Blog Moved

Future posts related to technology are directly published to LinkedIn

Monday, April 23, 2012

Identifying Bottlenecks

Recently, I have been asked to review performance of an ETL workflow of Informatica provided by Oracle as part of OBIEE upgrade. The OLAP schema upgrade which has several transformations to upgrade Siebel Analytics 7.7 to OBIA

The problem:
An innocent looking "Insert" load is reading a source table, applying a set of look up transformations, then generating a sequence number as a unique key and inserting to a target table in bulk mode is only able to give a throughput of about 65 records per second on a 4 core, 8GB RAM server. The source table is having over 20 million records.

There are other work flows which are running at a throughput of >2000 records per second. So, I have started investigating into this "Performance Problem"

1. Looked at the AWR reports from the source and target databases as well as the Informatica repository database. There is no indication any bottleneck on the database end.

2. Looked at the Session Log of Informatica PowerCenter. The session log shows the Writer thread 99% busy.

3. Taken a subset of 100K records using the source qualifier started running multiple scenarios to identify the bottleneck.

  • First step is to remove the target database; converted the target to a local flat file; so the workflow is now just reading the source; applying all the transformations & look-ups and writing to the local flat file. The throughput did not improve. 
  • Next step is to remove the large table look-ups. One of the look-up was on a table of more than 20 million records. Removed all the look-ups in the workflow mapping. Still the throughput is only 65 records per second. 
  • As the third step, removed the Sequence Number generation; Now the workflow is just reading from the source table and writing to a flat file after applying few compute transformations. The throughput reached 2400 records per second. 
  • By now we have identified the probable bottleneck with the "Sequence Generator"; to confirm that re-ran the work flow with all the look-up transformations and only disabling the sequence generator. The throughput was 2400 records per second. 

4. Looking at the sequence generator is reusable and set to cache only one value. This is causing a round trip from Informatica powercenter server to Informatica repository for the each sequence number to be generated. This could only do a maximum 65 round trips in a second which is causing the bottleneck in this workflow.

Setting appropriate caching on the sequence generator, we finally achieved a throughput of 2400 records per second and completed the load in less than 2.5 hours; It is an improvement of throughput by around 37 times!  

Spending few hours in identifying the bottleneck and removing it worth its effort...... 

Thursday, April 12, 2012

Data Replication

The need for data replication is evergreen. The use-case may vary from "High Availability" to "Operational Reporting (BI)" to "Real-time Analytics"; whatever may be the case, the need for replicating the data exists in the information systems and solutions.

I try to summarize the evolution of data replication technology from a Oracle Database standpoint in this post.

  1. Looking back at the Oracle based data replication technologies the first one is "Database Link". One can create a database link and pull or push the data directly into a remote database starting Oracle 5 or 6. This is the very first method of replication where the application need to push or pull the data from a remote database and apply necessary logic to identify what data has changed and what to do with those changes.....
  2. The next improvement in the replication is around Oracle 8 - one can setup a trigger based replication. That means whenever a transaction changes the data in a table one can trigger a function that can handle the replication without changing the application. So, database started giving a method to replicate data using triggers....
  3. Then the next improvement has come around 9.2 with streams and log-based replication. The capability is to mine the redo logs and move the committed transactions to the target systems. (DBMS_CDC_PUBLISH and DBMS_CDC_SUBSCRIBE packages were introduced)
  4. Oracle Advanced Queuing enhanced the streams to have robust publish and subscribe model replication that has enqueue and dequeue based communication. I was involved in a very large project that set up a custom Changed Data Capture to migrate data from a legacy system into SAP ERP involving large tables having 100Milion records.... 
  5. Later the log-mining technology was used for physical and logical standby dataabses and evolved to a Active Data Guard technology......
  6. With GoldenGate, the heterogeneous log-based data replication solution is complete from Oracle that has capabilities to extract, replicate, synchronize data in bi-directional movement.  

Depending on the need one should choose the right technology to achieve the needed data replication....