Recently, I came across few master data management implementations and database performance problems related to them.
What is master data management (MDM):
It is a set of processes to centralize the management of “master data”
i.e, the reference data for a given business. It typically consists of
Customers, Agents, and Products for an insurance company. Typically
an MDM instance focuses on one data domain. If we focus on Customer
domain the implementation is called as Customer Data Integration or when
the domain is Products it is called as Product Information Management.
When
it comes to Insurance industry, even though the “customer centricity”
has been the latest trend, traditionally it has been a “agent” driven
business. So, one key subject area for MDM is “Agent Data” which
actually sells “Products” to “Customers”. For generalization we can call
this as “Channel Master Data.” So we have three subject areas to deal with in a multi-domain-MDM implementation.
The processes include:
1. Data acquisition (Input)
2. Data standardization, duplicate removal, handling missing attributes (Quality)
3. Data mining, analytics and reporting (Output)
The industry tools have standard data and process models for customers and products
to achieve the required functionality of MDM. We can adopt some
customer model to model the agent domain or keep both in a “party”
domain.
The problem with multi-domain MDM:
Here
comes the problem. When it comes to agent domain, the customer
requirement states “get me the agent and his top n customers acquired”
or “get me the agent and his top n products sold”
In a renowned MDM product an SQL query need to join 20+ tables (involving some outer joins) to achieve the requirement and even after caching all the tables in-memory the processing for each agent id is taking 20+ seconds on an Oracle 11.2 database on a decent production sized hardware.
a) Dynamically
querying for such data and building the structure of data into target
structure puts more load/ adds latency to the screen. SLA is not met.
b) Pre-populating the data with a periodic refresh in a materialized view does not make the data current to the real-time.
As
we can’t have both requirements of getting the data in 2sec and it is
to be current in traditional design, we need to design a hybrid approach.
Any ideas or experience in this are welcome. Have anyone implemented a MDM solution on a NoSQL database yet?