Blog Moved

Future posts related to technology are directly published to LinkedIn

Friday, February 22, 2013

Data agility in Master Data Management

Recently, I came across few master data management implementations and database performance problems related to them. 
What is master data management (MDM): It is a set of processes to centralize the management of “master data” i.e, the reference data for a given business. It typically consists of Customers, Agents, and Products for an insurance company.  Typically an MDM instance focuses on one data domain. If we focus on Customer domain the implementation is called as Customer Data Integration or when the domain is Products it is called as Product Information Management.
When it comes to Insurance industry, even though the “customer centricity” has been the latest trend, traditionally it has been a “agent” driven business. So, one key subject area for MDM is “Agent Data” which actually sells “Products” to “Customers”. For generalization we can call this as “Channel Master Data.” So we have three subject areas to deal with in a multi-domain-MDM implementation.
The processes include:
1.       Data acquisition (Input)
2.       Data standardization, duplicate removal, handling missing attributes (Quality)
3.       Data mining, analytics and reporting  (Output)
The industry tools have standard data and process models for customers and products to achieve the required functionality of MDM. We can adopt some customer model to model the agent domain or keep both in a “party” domain.
The problem with multi-domain MDM:
Here comes the problem. When it comes to agent domain, the customer requirement states “get me the agent and his top n customers acquired” or “get me the agent and his top n products sold”
In a renowned MDM product an SQL query need to join 20+ tables (involving some outer joins) to achieve the requirement and even after caching all the tables in-memory the processing for each agent id is taking 20+ seconds on an Oracle 11.2 database on a decent production sized hardware. 
a)      Dynamically querying for such data and building the structure of data into target structure puts more load/ adds latency to the screen. SLA is not met.
b)      Pre-populating the data with a periodic refresh in a materialized view does not make the data current to the real-time.
As we can’t have both requirements of getting the data in 2sec and it is to be current in traditional design, we need to design a hybrid approach.
Any ideas or experience in this are welcome. Have anyone implemented a MDM solution on a NoSQL database yet?

Friday, February 15, 2013

Measure, Metric and Key Performance Indicator (KPI)

In a BI project it is very important to design the system to confidently visualize the indicators that matter for business. Even senior BI professionals sometime get confused with the terms like measure, metric and KPI.
This post is to give some simple definitions for the hierarchy of KPIs:
Measure: is something that is measured like value of an order. Number of items on the order or lines of code or the number of defects identified during the unit testing. Measures are generally captured during the regular operations of the business. These are the lowest granularity of the FACT in the star schema.
Metric: What is metric? Metrics are generally derived from one or more measures. For example the defect density is the number of defects for 1000 lines of code in a COBOL program. It is derived from lines of code and number of defects measures. A metric can also be max, min, average of a single measure. Generally metrics are computed or derived based on the underlying facts / measures.
Performance Indicator: brings in the business context into the metric. For example reduction of defect density after introducing more rigorous review process (review effectiveness) could be a performance indicator in a software development context. In this case it reduces the rework effort of fixing the unit test bugs and re-testing thereby improving the performance of a software development organization.
A careful selection of KPIs and presenting them with suitable granularity to the right level of management users within the business makes a BI project be successful.
Bringing the data / measures from multiple sources into a star schema is typically ETL cycle of the warehouse. Once the data is refreshed, incrementally summarizing and computing all the metrics is the next stage. Finally visualizing the KPIs is the art of dash boarding.
I have seen several BI projects having problems mixing up all the three steps; trying to clean data during loading, trying to summarize the data into metrics during ETL or trying to summarize during the reporting phase causing multiple performance problems. A good design should try to keep the stages independent of each other and take care of issues like missing refreshes and duplicate refreshes from feeding source systems. Also need to consider parallelization of tasks to take advantages of multiple cores of processors and large clusters of computing resources.

Friday, February 8, 2013

Cloud Computing workshop slides

Cloud Computing - Foundations, Perspectives and Challenges workshop slides.

Present to BITES state level faculty development program at Mangalore today to around 60 faculty members from regional engineering colleges. 

Friday, February 1, 2013

Few thoughts on Data Preparation for #Analytics

It is Friday and it is time for a blog post.
Typical analysis project spends 70% (- 80%) of time in preparing the data. Achieving the right Data quality and right format of data is a primary success factor of success of an analytic project.
What makes this task very knowledge intensive and why is a multifaceted skill required to carry out this task?
I will give a quick/simple example of how the “Functional knowledge” other than the technical knowledge is important in the preparation of the data. There is a functional distinction between missing data and non-existing data.
For example consider a customer data set. If the customer is married and the age of spouse is not available this is missing data. If customer is single, age of spouse is non-existing. In the data mart these two scenarios need to be represented differently so that the analytic model behaves properly.
Dealing with the missing data (data imputation techniques) within the data set while preparing the data impacts on the results of the analytical models.
Dr. Gerhard Svolba of SAS has written extensively on Data Preparation as well as Data Quality (for Analytics) and this presentation gives more details on the subject.
I have made a blog post earlier dealing with these challenges in the “Big data” world -