Blog Moved

Future posts related to technology are directly published to LinkedIn

Tuesday, November 19, 2013

Context analytics for better decisions - Analytics 3.0

Today's #BigData world, #analytics took additional complexity beyond pure statistics or pattern recognition using clustering, segmentation or predictive analytics using logistic regression methods.

One of the great challenge for big data's unstructured analytics is the 'context'. In traditional processing of data, we have removed the context and just recorded the content. All the we try to do with sentiment analysis is based on deriving the words, phrases & entities and try to combine them into 'concepts' and score them by matching known patterns of pre-discovered knowledge and assign the sentiment to the content.

The success rate in this method is fairly low. (This is my own personal observation!) One of the thoughts to improve the quality of this data is to add the context back to the content. To do this the technology enables is again a 'Big Data' solution. Means, we start with a big data problem and find the solution in the big data space. Interesting. Isn't it?

Take the content data at rest, analyze it. and enrich with the context information like spatial and temporal information and derive knowledge from it. Visualize the data by putting similar concepts together and by merging same concepts into a single entity.

The big blue is doing this after realizing the fact. Few months back they published a 'red paper' that can be found here.

Finally putting the discovered learning into action in real time gives all the needed business impact and takes it to the world of Analytics 3.0. (Refer to

Exciting world of opportunities....

Thursday, November 14, 2013

Analytical Processing of Data

A short presentation on Analytical Processing of Data; very high level overview.....

Friday, October 11, 2013

Transfer orbits, low-energy transfers and 20 years of career!

This day 20 years back (11th October 1993), a young graduate with a bag full of science books and few pairs of cloths landed here in Bangalore to pursue a career. Born in Andhra Pradesh, studied in Tamilnadu it is the third south Indian state Karnataka, I came to join Indian Space Research Organization as ‘Scientific Assistant – B’.

It was a long selection process to get to the job. I had to qualify a written test, an interview with a big panel of ISRO and IISc scientists, a police verification to join central government of India. With the planned Mangalyaan launch on 28th October, would like to give some science behind travelling beyond Earth’s orbit.

If we want to go to moon or mars we can’t aim a rocket towards the target and fire it. As the distance between earth and moon is about 4 million kilometers and to that of mars it is 400 million km. So, we need slightly intelligent way of going there. One way of going there is using Hohmann Transfer Orbit In simple terms, the spacecraft is placed in a highly elliptical orbit around earth and using a delta-v at right point transferred into another elliptical orbit at a suitable time to the target orbit.

There is another low-energy transfer using Lagrange points which will probably take longer transit time. The Interplanetary Transport Network (ITN) formed for deep space missions that travel purely using solar energy or very little fuel to fire the thrusters.

Mangalyaan is taking the first method of Hohmann Transfer Orbit between Earth and Mars to start it journey in this opportunity window to reach mars by November 2014. I wish all the best to my first employer in this mission which is going to prove the technology and ISRO’s ability to apply the science to take the satellite into orbit mars. There are challenges of handling the launch, orbit maneuvers, deep space communication network for both payload data and machine control (it is 20 light minutes distance between earth and mars at the maximum so the two way communication takes 40 minutes making it complex to manage the communications!) It is only unfortunate to see critics criticizing the cost of this mission which is around 450 crores Indian Rupees where as ONE Fodder scam is 950 Crores worth loss to nation; not to mention any other scams of recent past in India.

How is it related to my career? Even I ended up using the high energy transfer orbits and low energy transfers during this journey around different companies and quantum leaps to different roles on working on orbits to orbitals, providing management solutions to energy grids and computing grids, optimizing satellite operations to smart metering operations handling data movements in and out of commercial ERP systems and geospatial databases, deriving forecasts of orbits of satellites to insights from big data using analytics in these 20 years.

Apart from the science, math & technology, these 20 years took me around this little globe physically from India to Singapore to various European Countries (Belgium, Germany, France, Netherlands, Luxemburg) to UK then to USA, Japan, Korea; provided an opportunity to work with large enterprises from Australia, China, South Africa etc., to make me meet exceptional personalities from varied cultures, walks of life to interact and learn about the most colorful part of God’s creation.

A majority of the orbiting been with TCS, (which was my second employer between 1996 and 2006 for 10 years) followed by Oracle Corporation and IBM India Pvt. Ltd. to come back to TCS in 2011.

I take this opportunity to thank one and all who helped me through this journey, who have challenged me and those who have been neutral towards me; each of those gestures gave me immense experience and made the journey very colorful and interesting.

With entry and re-entry tested, I hope to go along on the orbit for few more years continuing my learning until the mission retires….

Thursday, September 12, 2013

Bayesian Belief Networks for inference and learning

I have attended a daylong seminar at IIM Bangalore on 10th September on the subject of Bayesian Belief Networks. Dr. Lionel Jouffe, gave 4 case studies during the one day technical session.

Introduction to Bayesian Belief Networks (BBNs) and building the network with multiple modes i.e., a network is built from a. mining the past data, or b. built purely from expert knowledge capture or a combination of both methods. Once the conditional probabilities for each node exist and associations between the nodes are built, both assisted and non-assisted learning can be used.

First case study involved knowledge discovery in the stock market whereby loading publicly available stock market data, a BBN is built, automatic clustering algorithm using the 'discritized' continuous variables was run to find similar tickers. ( )

Second case study showed was on segmentation using BBNs. Input contained market share of 58 stores selling juices. Three groups of juices like local brands, national brands and premium brands of juices are sold in one state of US across 11 brands of above three groups. Using this data, BBN was built, automatic segmentation performed into 5 segments with a good statistical description of segmentation.

Third case study involved a marketing mix analysis to describe and predict the efficiency of multiple channel campaigns (like TV, radio, online) on product sales. ( )

The fourth case study covered a vehicle safety scenario taking publicly available accident data to discover the two key factors that can reduce fatality of injury based on parameters of vehicle, driver etc., ( )

Conclusion is any analytic problem can be converted into a BBN and solved. I have seen few advantages of this approach:
1. The BBN can be built in a No Data scenarios with expert knowledge completely hand crafted. It can also be built from big data scenario deriving the conditional probabilities mining the data.
2. One strong theoretical framework solving the problems making it easier to learn. No need to learn multiple theories.
As a technique, it has some promising features. The whitepapers presented are useful in understanding the technique in different scenarios. Views? Comments?

Friday, August 30, 2013

F1 Database from Google: A scalable distributed SQL database


This world is a sphere. We keep going round and round. After a great hype around the NoSQL highly distributed databases, now Google presented a paper on how they have implemented a SQL based highly scalable database for supporting their "AdWords" business in 39th VLDB conference.

The news item:
and the paper:

The key changes I liked:
1. Heirarchically clustered physical schema model: I always thought heighrarchical model is more suited for real life application than a pure relational model. This implementations is proving it.

2. Protocol Buffers: Columns allowing structured types. It saves a lot of ORM style conversions when moving data from a storage to in-memory and vice versa.

A quote from the paper's conclusion:
In recent years, conventional wisdom in the engineering community has been that if you need a highly scalable, high-throughput data store, the only viable option is to use a NoSQL key/value store, and to work around the lack of ACID transactional guarantees and the lack of conveniences like secondary indexes, SQL, and so on. When we sought a replacement for Google's MySQL data store for the Ad-Words product, that option was simply not feasible: the complexity of dealing with a non-ACID data store in every part of our business logic would be too great, and there was simply no way our business could function without SQL queries.
So, ACID is needed and SQL is essential for running the businesses! Have a nice weekend!!

Friday, August 23, 2013

Anticipatory Computing and Functional Programming – some rambling…


After an early morning discussion on Anticipatory Computing on TCS's enterprise social network - Knome,  I thought of making this blog post linking the aspects of “functional orientation” of complex systems with consciousness.

In the computing world, it is generally widely accepted fact that data can exist without any prescribed associated process. Once the data is stored on a medium (generally called as Memory) it can be put into any abstract process trying to derive some conclusions. (This trend is generally called as big-data analytics leading to predictive and prescriptive analytics)


If I mention that function can exist without any prescribed data to it with multiple outcomes, then it is not easily accepted. Only thing people can think about is completely chaotic random number generator in this. Completely data independent, pure function that returns a function based on its own “anticipation” is what is called consciousness.

This is one of my interest areas in computability and information theory. A complex system behavior is not driven entirely by the data presented to it. Trying to model the complex system purely by the past data emitted by the system is not going to work. One should consider the anticipatory bias of the system as well while modeling.

Functional Programming comes a step near to this paradigm. It tries to define the function without intermittent state preserving variables. In mathematical terms a function maps elements of domain to its range. Abstracting this into an anticipation model we get the consciousness (or free will) as a function of three possible return functions.

1. Will do
2. Will NOT do
3. Will do differently
(I have derived this based on an ancient Sanskrit statement regarding free will – kartum, akartum, anyathA vA kartum saktaH)

The third option above (it is beyond binary, 0 or 1) leads to the recursion of this function of evaluation of alternatives and again at (t+Δt) the system has all the three options. When the anticipatory system responds then “data” starts emitting from it. The environment in with this micro anticipatory system is operating is also a macro anticipatory system.

The ongoing hype around big data is to establish the patterns of data emitted from various micro-systems and establishing the function of macro-freewill. It is easier for a micro-freewill to dynamically model the function which is called “intuition” that is beyond the limits of computability.

Enough of techno-philosophical rambling for this Friday! Have a nice weekend.

Thursday, August 8, 2013

Science, Research, Consulting and Philosophy

It was this day 25 years back (08-08-1988) I have joined my Bachelors of Science in Computer Sciences course. The aim at that time is be become a Scientist. As the years passed, I have completed my Masters and joined in Indian Space Research Organization.

Due to various reasons, I could not register for a PhD degree nor could continue my Research career. Instead, I started doing software consulting joining TCS, the largest software services company of India. That took me to various business domains starting with Banking moving into Utilities (Gas Transportation), retail, financial services and insurance. Working as a developer, tester, modeller, designer, architect, pre-sales solution support, offshore delivery manager etc., roles gave me an experience worth of PhD.

Later it was a period of working with Oracle in the core Server Technologies division when we were working closely with the select elite customers of Enterprise Manager product who were monitoring and managing large data centers.

A later period it turned out to be philosophy. Philosophy of data, information, knowledge trying to optimize the end to end information flows using the right strategies for the life cycle of information. Efficient data capture from individual transactions. Supporting the operational requirements with the needed latency, making it available in the right format for its human and other computing systems, transforming and moving around efficiently to derive much needed long term strategic decisions etc.,

Most of my career till date has moved through the highs and lows of information technology hype cycles, peaks, waves and magic quadrants.....

Links to the blog posts that are made around 8-August.....

Last year:


Friday, August 2, 2013

Crisscrossing thoughts around #Cloud and #BigData

While “Big Data Analytics” is running on Cloud based infrastructure with 1000s of (virtual) servers, Cloud infrastructure management has become a big data problem!

Assuming all key availability and performance metrics need to be collected and processed regularly to keep the cloud infrastructure running within the agreed performance service levels and to identify the trends of demand for the cloud services there is an absolute need for the predictive analytics on the collected metrics data.

As the data centers gradually turn into private clouds with a lot of virtualization, it becomes increasingly important to manage the underlying grid of resources efficiently by allocating the best possible resources to the high priority jobs. The integrated infrastructure monitoring and analytics framework running on the grid itself can optimize the resource allocation dynamically to fit the workload characteristics could make the data center more efficient and green.

Taking the same approach to the business services across the organizational boundaries, there could be an automated market place where the available computing resources could be traded by the public cloud providers and the consumers can “buy” needed computing resources in the market and get their processing executed by probably combining multiple providers’ resources on an extended hybrid cloud in a highly dynamic configuration.

The data and processing have to be encapsulated at a micro or nano scale objects, taking the computing out of current storage – processor architecture into a more connected neuron like architecture with billions of nodes connected in a really BIG bigdata.


If all the computing needed on this tiny globe can be unified into a single harmonic process, the amount of data that needs moving comes to a minimum and a “single cloud” serves the purpose.

Conclusion: Cloud management using bigdata, and big data running on cloud infrastructure complement each other to improve the future of computing!

Question: If I have a $1 today, where should I invest for better future? In big data? Or in Cloud startup??

Have a fabulous Friday!

Friday, July 12, 2013

Models of Innovation diffusion in social networks


Having seen the trust modeling, centrality in social network, this post is the third and last of the series on social network analysis.

Innovation diffusion, influence propagation or ‘viral marketing’ is one of the most researched subject of contemporary era.

Some theory:

Compartmental models studying the spread of epidemics, which have susceptible (S), infected (I) and recovered (R) ‘SIR states’ are used to study the influence propagation in the electronic social networks as well. Initially these are descriptive models to describe a specific behavior of nodes when exposed to new innovation or information each node has an initial probability to adapt to that innovation. As each node adapts to the new innovation it has a specific amount of influence on the nodes connected to it.

Primarily two basic models are used to study the spread in a social network. An initial set of ‘active’ nodes at time t0 exert influence on the connected nodes and at t1 some of the connected nodes will become ‘active’ with a probability p(i). Each individual node has a threshold θi and when the influence from the neighbors is more than this threshold it becomes active. This model is called ‘Linear Threshold’ model. At each step, the set of nodes till the step – 1 remain active and influence their neighbors with a weightage. In independent cascade model each node is given only one chance to influence its neighbor.

Based on the above two diffusion models, the maximization problem is to determine the best set of initial ‘active’ nodes in a network to arrive at a best propagation by maximizing influence for ‘viral marketing’ campaign.

It is a NP-hard problem and this paper - discusses some interesting approximation algorithm with a general cascade, threshold and triggering models.

Have a good weekend reading!

Friday, June 21, 2013

On Centrality and Power in social networks

After the last weeks post on 'Trust' - - let us quickly review another important measure of (social) network structure.

Centrality is a structural measure of a network that gives an indication of relative importance of a node in the graph / network.
Simplest way of measuring centrality is by counting the number of connections a node has. This is called 'degree centrality'.

Another way of measuring centrality is to see how far a node from all other nodes of the graph is is. This measure is called as 'closeness centrality' as it measures the path length between pairs of nodes.

'Betweenness Centrality' is the measure of number of times the node acting as a bridge on the shortest path of any other two nodes. That gives how important each n ode in connecting the whole network.

To complicate the centrality further, we have a measure called 'eigenvector centrality'. Eigenvector considers the influence for the node in the network. This methods considers the power of the nodes the current node is connected. To explain it simply, if I am connected to 500 other people on LinkedIn is different from Barak Obama connecting to 500 of his friends on the LinkedIn. His 500 connections are more influential (probably) than my 500 connections. Google's page rank is a variant of Eigenvector Centrality.

When an external factor is considered for each node and implement eigenvector centrality to consider an external α it is called 'alpha centrality'

When we move the alpha centrality measure from one node to cover multiple radii to include first degree, second degree and so on.. With a factors of β(i) and measure the centrality as a function of influence of varying degrees, it is called beta centrality.

The key problem with centrality computation is the amount of computing power needed to arrive at the beta centrality measure of the social network with millions of nodes. I recently came across this paper - which proposes an alternative approximation algorithm which is computationally efficient to estimate fairly accurate centrality measure. This alter-based non recursive method works well on non-bipartite networks and suits well for social networks.

Title of this blog states "power" and whole content did not mention anything about it. Generally centrality is considered as the indicator of power or influence. But in some situations power is not directly proportional to centrality. Think about it.

Friday, June 14, 2013

Trust modeling in social media


After last week’s “tie strength” post, this week let me give some fundamentals on importance of modeling TRUST in social media.

What is Trust?
It is difficult to define. But when I ask “Will you loan a moderate amount to the other person?” or “Will you seek a reference or recommendation regarding a key decision?” help understand the term TRUST.

There are two components to TRUST. Some people are more trusting than others. Some quickly establish trust where as others take a long time in establishing the trust. This component is not easy to be modeled. The second component is the credibility of the trusted person.

Measuring Trust:
In social media, the second component can be measured by analyzing the sentiment based on the blogs referenced by others. This is called “network based trust inference”.

This paper describes a model for measuring trust using link polarity.

Have a good weekend reading!

Friday, June 7, 2013

"tie strength" in social media

What is "tie strength”?

When analyzing the social web, we see various edges (ties or relationships) connecting the nodes (individuals or organizations). Theoretically the strength of the edge or relationship is categorized as strong or weak. In 1973 paper titled "The strength of weak ties" -  Mark Granovetter lays foundations of importance of strength of ties in micro and macro levels of sociology.

Predictive model
Recently I came across a predictive model developed using Facebook which considers seven dimensions of "tie strength" They are: Intensity, Intimacy, Duration, Reciprocal Services, Structural, Emotional Support and Social Distance.

32 Predictive variables from Facebook interactions have been used along with a survey deriving 5 dependent variables that fits into the predictive model.
The model uses statistical linier method to predict the strength of a relationship in continuous 0 - 1 Scale.

More on -

I like the methodology used and practical approach towards predictive modelling. More stronger the tie, better influence....

Friday, May 24, 2013

Data Philosophers and data quality

After data scientists and data artists, another need is for "data philosophers”. made me think about the data philosophers.

So, the data scientists are focusing on the underlying technology to gather validate and process the 'big' data and the artists are using the processed 'big' data to paint and visualize the insights.

In this whole process due to its wide variety and velocity (two 'V's of big data!) are we missing on the rigor of quality of data?

Considering the 36 attributes of data quality in the 1972 paper of Kristo Ivanov - and evaluating today's big data insights, I somehow feel there is a 'big' gap in the quality of 'big data'.

I see some parallels in big data processing and orbit determination. As long as the key laws governing the planetary motion are unknown, whatever is the amount of the data from observation we have, we will not be able to explain the ‘retrograde motion’ of the planets. In the same way, if we do not have a clear understanding of underlying principles of the data streams, we will not be able to explain them. That is where we need the philosophers!

Now, I think I am becoming a “Data Philosopher” already!

Friday, May 17, 2013

Data Artist - A new professional skillset?

In past few days, I have seen at least two blogs talking about "Data Artist"


The trend seems to go towards business centric data visualization of so called "big data".

One who can use data as the paint and create art that can represent massive flows of data and visualize the patterns in a way business users are delivered with a lot of “information” in a single glance.

It is slightly different from the “Data Scientist” profession. Data Scientists are focused on technical process of collecting, preparing and analyzing the data for patterns where as the Data Artists specialize in visualizing the discoveries in an artistic manner!

"Scientific Artists" and "Artistic Scientists" with Data! Are we complicating the matter too much??

Friday, April 5, 2013

Accelerating Analytics using “Blink” aka “BLU acceleration”

This Friday marks completion of my 2 years in the second innings with TCS ‘s Technology Excellence Group and it is time for a technical blog post.
During this week, I have seen IBM announcing new “BLU acceleration” enabled DB2 10.5 that claims a 10 to 20 times performance improvement out of box.  (Ref: )
This post aims at giving a brief summary of the Blink Project which has brought in this acceleration to the analytic queries.
The Blink technology has primarily two components that achieve the said acceleration to the analytic processing:
1.       The compression at the load time
2.       The query processing
Compression & Storage:
At load time each column is compressed using a “Frequency Partitioning” order preserving fixed length dictionary encoding method. Each partition of the column has a dictionary of its own making it to use shorter column codes. As it preserves order the comparison operators/predicates can be applied directly to the encoded values without needing to uncompress them.
Rows of are packed using the bit aligned columns to a byte aligned banks of 8, 16, 32 or 64bits for efficient ALU operations. This bank-major storage is combined to form blocks that are then loaded into the memory (or storage.) This bank-major storage exploits SIMD (Single Instruction, Multiple Data) capability of modern POWER processor chips of IBM.
Query Processing:
In Blink there are no indexes, no materialized views nor a run-time query optimizer. So, it is simple. But the query must be compiled to take care of different encoded column lengths of each horizontal partition of the data.
Each SQL is split into a series of single-table queries (STQs) which does scans with filtering. All the joins are hash joins. These scans happen in an outside-in fashion on a typical snowflake schema creating intermediate hybrid STQs.
Blink executes these STQs in multiple blocks to threads each running on a processor core. As most modern ALUs can operate on 128bit registers all the operations are bit operations exploiting SIMD which makes the processing fast.
For more technical details of Blink project refer to -
Hope this will bring “Analytics” a boost and some competition to Oracle’s Exa- appliances. Views, Comments?

Friday, March 29, 2013

Can we save capitalism from itself?

Thoughts from reading
The Trouble With Markets: Saving Capitalism from Itself, Second Edition
by  Roger Bootle

This book has three sections and the Economist author goes on to say how we have ended up here in the first three chapters.  

Section 1: The great implosion:
The 1930s had seen the Great Depression and the 1970s the Great Inflation. The 1990s had seen the Great Moderation. This was the Great Implosion.
The next 4 chapters he deals with the trouble with the markets.

Section 2: The trouble with the Markets
As Robert Heilbroner put it: “The profit motive, we are constantly being told, is as old as man himself. But it is not. The profit motive as we know it is only as old as modern man.”
OK, The next section of three chapters

Section 3: From implosion to Recovery
Keynes was right in three major respects:

  • Economic activity is permeated by fundamental uncertainty.
  • As a result, many of the major factors that affect the economy are psychological and depend critically on the state of confidence, which is not readily analyzable or predictable.
  • Consequently, the modern economy is inherently unstable and fragile
Conclusion starts with this quote -
All happy families are alike; each unhappy family is unhappy in its own way.
--Leo Tolstoy, 1873

Overall this book is a good read, but I am still unsure of one thing:
So, Is it really possible to save capitalism from itself?

Friday, March 22, 2013

De-normalizing with join materialized views fast refresh on commit

Two weeks back, I wrote a post on result_cache feature of Oracle 11g database to solve a specific performance scenario in MDM implementation. Working on the same set of performance issues, we have encountered another situation where we have a normalized structure which results in writing queries to use OUTER JOINS to achieve the required aggregation.

The structure contains a set of tables for PERSON and another set of tables to represent ORGANIZATION when a CUSTOMER can be a PERSON or an ORGANIZATION.
The requirement is to get a consolidated view of all persons and organizations together with certain attributes. We need to perform a UNION ALL query joining a total of 8 tables that is going to result in something like 10Million records. We will not be able to result_cache this result in memory.

Inevitably we need to create a persistent version of the result of the UNION ALL query in a materialized view. But customer needs real-time data and can’t afford any latency. So, we need a view that gets updated whenever underlying tables change. That is where the “REFRESH FAST ON COMMIT” comes into the picture.
To be able to do fast refresh MATERIALIZED VIEW LOG to be created on all the underlying tables. We have selected “rowid”. All the 8 underlying tables need to have the MV LOGS created before creating a MV as follows:

  p.rowid  AS prowid,
  xp.rowid  AS xprowid,
  xpn.rowid AS xpnrowid,
  pn.rowid  AS pnrowid
FROM person p,
  xperson xp,
  xpersonname xpn,
  personname pn
WHERE p.PID  = xp.XPid
AND pn.CId  = p.CId
AND xpn.preferred_ind  ='Y'
  o.rowid  AS orowid,
  xo.rowid  AS xorowid,
  xon.rowid  AS xonrowid ,
  orgn.rowid AS orgnrowid
FROM org o,
  xorg xo,
  xorgname xon,
  orgname orgn
WHERE o.cid  = xo.xoid
AND xon.xON_id =orgn.ONid
AND orgn.cId  = o.Cid
AND xon.preferred_ind  ='Y';

This MV now has de-normalized data which can be used in the higher 
level queries for looking up requird data without costly joins. We can 
also create INDEXes on the MV to improve lookup.

Any experiences? (both good and bad are welcome for discussion)

Friday, March 15, 2013

“White noise” and “Big Data”

For those who are familiar with physics and communications you would have heard about the term “White Noise” – In simple terms it is the noise produced by combining all different frequencies together.
So, what is the relationship between the white noise and big data?
At present, there is a lot of “noise” about big data in both positive and negative frequencies. Some feel it is data in high volume, some unstructured data, some relate it with analytics, some with real-time processing, some with machine learning, some with very large databases, some with in memory computing, some others with regression, still others with pattern recognition and so on….
People have started defining “big data” with 4 v’s (Volume, Velocity, Variety, and Variability) and gone on to add multiple other Vs to it. I have somewhere seen a list of 21Vs defining big data.
So, in simple terms big data is all about unstructured data mostly machine generated in quick succession in high volumes (one scientific example is the Large Hadron Collider generating huge amounts of data from each of its experiments) that need to be handled where the traditional computing models fail to do.
Most of this high volume data is also “white noise” which combines signals of all frequencies produced simultaneously on the social feeds like twitter etc., (The 4th goal by Spain in Euro 2012 match resulted in 15K tweets per second!) which could only prove there are so many people watching and exited about that event and adds minimum “business value” by such piece of information.
How to derive “Value” then?
The real business value of big data can only be realized when the right data sources are identified with the right data channelized through the processing engine to apply the right technique to separate out the right signal from the white data. That is precisely the job of a “Data Scientist” in my honest opinion.
I have not found a really good general use-case in the insurance industry for big data yet! (other than the stray cases related to vehicle telematics in auto sector and some weather/flood/tsunami hazard modeling cases in corporate specialty)
But I am tuned to the white noise anyway looking for the clues that identify some real use cases in insurance and largely in financial services… (Other than the “machine trading” algorithms are already well developed in that field!)
Comments? Views?

Friday, March 8, 2013

SQL result_cache in Oracle 11g


For the problem mentioned in my past blog post -  there are times where the SQL Queries are expensive and need lot of processing to generate a result set. These queries are executed from multiple sessions and it would be good if we can get the prepared result in the memory.

SQL Result Cache:

This feature is available in Oracle database 11g that can be enabled with the initialization parameter result_cache_mode the possible values for this parameter are FORCE (will cache all results and not recommended) and MANUAL. Setting this value to MANUAL one can selectively cache the results from the SQLs where the hint /*+ RESULT_CACHE */ is added just after the SELECT.

RESULT_CACHE_MAX_SIZE and RESULT_CACHE_MAX_RESULT are the other parameters that impact the way the result cache will function by defining the maximum amount of memory used and the maximum amount of memory a single result set can occupy.

More Information:

Please use the following links on to get better understanding of this feature.

Friday, March 1, 2013

Graph Theory & Pregel River

Euler's original publication in Latin on Graph Theory (The seven bridges of Königsberg) - 

The basis of all the latest hype around Facebook's Graph Search, Social Graphs, Knowledge Graphs, Interest Graphs etc., are the modern implementations of the above publication that formulates the Graph Theory in 1735.

I was fascinated about the Graph Theory during my college days and I feel it is one of the most natural structures how information is organized. A graph when used as a thinking tool of “mind map” is very effective as well.

The beauty of a Graph is its simplicity of being able to build and use it as a data structure to naturally traverse and split the graph / network into sub-graphs with ease. It is one of the best suited structures for massively parallel processed algorithms.

As the seven bridges of original paper were on the River Pregel, Google named the research project for large scale graph processing with the name. Let the programmers start thinking in the realm of vertices and edges….

What is offered by Pregel:
1.      A large scale, distributed, parallel graph processing API in C++
2.      Fault tolerance of distributed commodity clusters with a worker and master implementation. 
3.      Persistent data can be stored on Google’s GFS or Bigtable.

More on the Google's Pregel research paper -


I expect in future quantum computing will build virtualized, software defined dynamic cliques of order 'n' as needed to solve the computing problem using graph algorithms with highest fault tolerance and highly parallel "just right" performance! Let me name it - "Goldilocks Computing". It is not far in the future...

Friday, February 22, 2013

Data agility in Master Data Management

Recently, I came across few master data management implementations and database performance problems related to them. 
What is master data management (MDM): It is a set of processes to centralize the management of “master data” i.e, the reference data for a given business. It typically consists of Customers, Agents, and Products for an insurance company.  Typically an MDM instance focuses on one data domain. If we focus on Customer domain the implementation is called as Customer Data Integration or when the domain is Products it is called as Product Information Management.
When it comes to Insurance industry, even though the “customer centricity” has been the latest trend, traditionally it has been a “agent” driven business. So, one key subject area for MDM is “Agent Data” which actually sells “Products” to “Customers”. For generalization we can call this as “Channel Master Data.” So we have three subject areas to deal with in a multi-domain-MDM implementation.
The processes include:
1.       Data acquisition (Input)
2.       Data standardization, duplicate removal, handling missing attributes (Quality)
3.       Data mining, analytics and reporting  (Output)
The industry tools have standard data and process models for customers and products to achieve the required functionality of MDM. We can adopt some customer model to model the agent domain or keep both in a “party” domain.
The problem with multi-domain MDM:
Here comes the problem. When it comes to agent domain, the customer requirement states “get me the agent and his top n customers acquired” or “get me the agent and his top n products sold”
In a renowned MDM product an SQL query need to join 20+ tables (involving some outer joins) to achieve the requirement and even after caching all the tables in-memory the processing for each agent id is taking 20+ seconds on an Oracle 11.2 database on a decent production sized hardware. 
a)      Dynamically querying for such data and building the structure of data into target structure puts more load/ adds latency to the screen. SLA is not met.
b)      Pre-populating the data with a periodic refresh in a materialized view does not make the data current to the real-time.
As we can’t have both requirements of getting the data in 2sec and it is to be current in traditional design, we need to design a hybrid approach.
Any ideas or experience in this are welcome. Have anyone implemented a MDM solution on a NoSQL database yet?

Friday, February 15, 2013

Measure, Metric and Key Performance Indicator (KPI)

In a BI project it is very important to design the system to confidently visualize the indicators that matter for business. Even senior BI professionals sometime get confused with the terms like measure, metric and KPI.
This post is to give some simple definitions for the hierarchy of KPIs:
Measure: is something that is measured like value of an order. Number of items on the order or lines of code or the number of defects identified during the unit testing. Measures are generally captured during the regular operations of the business. These are the lowest granularity of the FACT in the star schema.
Metric: What is metric? Metrics are generally derived from one or more measures. For example the defect density is the number of defects for 1000 lines of code in a COBOL program. It is derived from lines of code and number of defects measures. A metric can also be max, min, average of a single measure. Generally metrics are computed or derived based on the underlying facts / measures.
Performance Indicator: brings in the business context into the metric. For example reduction of defect density after introducing more rigorous review process (review effectiveness) could be a performance indicator in a software development context. In this case it reduces the rework effort of fixing the unit test bugs and re-testing thereby improving the performance of a software development organization.
A careful selection of KPIs and presenting them with suitable granularity to the right level of management users within the business makes a BI project be successful.
Bringing the data / measures from multiple sources into a star schema is typically ETL cycle of the warehouse. Once the data is refreshed, incrementally summarizing and computing all the metrics is the next stage. Finally visualizing the KPIs is the art of dash boarding.
I have seen several BI projects having problems mixing up all the three steps; trying to clean data during loading, trying to summarize the data into metrics during ETL or trying to summarize during the reporting phase causing multiple performance problems. A good design should try to keep the stages independent of each other and take care of issues like missing refreshes and duplicate refreshes from feeding source systems. Also need to consider parallelization of tasks to take advantages of multiple cores of processors and large clusters of computing resources.

Friday, February 8, 2013

Cloud Computing workshop slides

Cloud Computing - Foundations, Perspectives and Challenges workshop slides.

Present to BITES state level faculty development program at Mangalore today to around 60 faculty members from regional engineering colleges. 

Friday, February 1, 2013

Few thoughts on Data Preparation for #Analytics

It is Friday and it is time for a blog post.
Typical analysis project spends 70% (- 80%) of time in preparing the data. Achieving the right Data quality and right format of data is a primary success factor of success of an analytic project.
What makes this task very knowledge intensive and why is a multifaceted skill required to carry out this task?
I will give a quick/simple example of how the “Functional knowledge” other than the technical knowledge is important in the preparation of the data. There is a functional distinction between missing data and non-existing data.
For example consider a customer data set. If the customer is married and the age of spouse is not available this is missing data. If customer is single, age of spouse is non-existing. In the data mart these two scenarios need to be represented differently so that the analytic model behaves properly.
Dealing with the missing data (data imputation techniques) within the data set while preparing the data impacts on the results of the analytical models.
Dr. Gerhard Svolba of SAS has written extensively on Data Preparation as well as Data Quality (for Analytics) and this presentation gives more details on the subject.
I have made a blog post earlier dealing with these challenges in the “Big data” world -

Friday, January 25, 2013

Machine Learning Algorithms – Classification, Clustering and Regression

For a Data Scientist among the other skill sets, good fundamentals on “data mining” or “machine learning” is the icing over the cake. These algorithms are also used in predictive analytics. It is immaterial if the data is “big data” or not!
We will end up having several variables either containing numeric or nominal attributes describing an entity. The numeric variables could be continuous or discreet like ordinals.  Nominal variables are of the form of known list of values having binary (two i.e., yes or no; male or female etc.,) or more possible values.
Given that a data set consisting of a millions of records, each record containing some 100s of variables an analyst’s job is to derive some insights to solve the known or unknown business problems! This is where the application of machine learning algorithms comes into play.
Broadly Machine Learning can be put into two groups.
1.       Predicting a target variable for a given instance of data record. We will have a set of records with the known values for target variable by which we can develop a model and train the model, test and put that into production – This is called supervised learning
a.       If the target variable is nominal then these algorithms are called classification.
b.      If the target variable is a continuous numeric variable then we need to apply regression
2.       There is no target variable; we need to group the data records into distinct groups based on multiple variables within the dataset – This is called unsupervised learning. Clustering and Association Analysis algorithms are used to achieve this. 
Formulating the problem, preparing the data, visualizing the data, training the model, testing the model and interpreting the results to generate insights and them implementing the derived knowledge to the business operations require multi disciplinary skills in business domain, operations management and technology.
A real business analytic solution consists of using multiple techniques involving machine learning to achieve Customer Segmentation, Cross-selling, Customer behavior analysis, Customer retention, Marketing Analytics and campaign management, fraud detection, optimization of profits etc.,
A recent quick read through the book Machine Learning in Action by Peter Harrington prompted me to write this tech capsule on this Friday… 

Friday, January 11, 2013

A journey through Grid and Virtualization leading to Cloud computing

In computing these are two trends I have seen crisscrossing throughout my career.
1.       Making a computing node look like multiple nodes using virtualization.
2.       Making multiple computing nodes work as a single whole called a cluster or grid
So, there was an era where the computing was really a “Big-Iron” and the computers are mainframe size and provided a lot of capacity of computing. One computer handled multiple users and multiple operations at the same time.  The Virtual Machine is included in operating system of IBM and the mainframes as well as from DEC VAX mainframes had similar concepts. Of late, we see the trend even in desktops with hypervisors like vmware etc., coming out.
With the advent of mid-range servers with limited capacity, there is a need to put them together to get the higher computing power to deal with the demand.  The first commercial cluster developed by DEC ARCnet even though there is always been a fight between IBM and DEC on who invented clusters.  Clustering also provides high availability / fault tolerance along with higher computing capacity.  Oracle was the first database to implement parallel server on ARCnet cluster for VAX operating system.
This trend of cluster computing has achieved supercomputing to break the complex task in to multiple parallel streams and execute them on multiple processors. What is the fundamental challenge in “clustering” – the process coordination and access to the shared resources. This leads to be locally networked and connected with high performance local network.
Another concept of this is grid computing where the administrative domain can connect loosely coupled nodes to perform a task. So, we have more and more cores, processors, nodes in a grid to provide low cost, fault tolerant computing. This is making smaller components put together to look like a giant computing capacity.
Finally, what I see today is “Cloud” which creates a grid of elastic nodes that look to appear like a single (large) computing resource and gives a slice of virtualized capacity to each of multiple tenants of that computing resource.  
Designing solutions in each of these technologies of big-iron, virtualization, clusters & grid and in Cloud world has really been challenging and keeps the job lively…

Thursday, January 3, 2013

Small & Big Data processing philosophies

In this first post of 2013, I would like to cover some fundamental philosophical aspects of “data” & “processing”.
As the buzz around “Big Data” going on high, I have classified the original structured, relational data as “small data” even though some very large databases I have seen having 100+ Terabytes of data with an IO volume of 150+ Terabytes per day.  
Present day data processing predominantly uses Von-Neumann architecture of computing in which “Data” and its “processing” are distinct and separated into “memory” and “processor” connected by a “bus”.  Any data that need to be processed will be moved into processor using the bus and then the required arithmetic or logical operation happens on it producing the “result” of the operation. Then the result will be moved to “memory/storage” for further reference.  Also, the list of operations to be performed (the processing logic or program) is stored in the “memory” as well. One needs to move the next instruction to be carried out into the processor from memory using the bus.
So in essence both the data and the operation that needs performing will be in memory which can’t process data and the facility that can process data is always dependent on the memory in the Von-Neumann architecture.
Traditionally, the “data” has been moved into a place where the processing logic is deployed as the amount of data is small when compared to the amount of processing needed is relatively large involving the complex logic. In the RDBMS engines like Oracle read the blocks of storage into the SGA buffer cache of running database instance for processing. The transactions were modifying small amounts of data at any given time.
Over a period of time “analytical processing” that required to bring huge amounts of data from storage into processing node which created a bottleneck on the network pipe. Add to that there is a large growth in the semi-structured and unstructured data that started flowing which needed a different philosophy towards data processing.
There comes the HDFS and map-reduce framework of Hadoop which took the processing to the data. During the same time comes Oracle Exadata which took the database processing to storage layer with a feature called “query offloading”
In the new paradigm, the snippets of processing logic are being taken to a cluster of connected nodes where the data mapped with a hashing algorithm resides and results of processing then reduced to finally produce result sets. It is now becoming economical to take the processing to data as the amount of data is relatively large and the required processing is fairly simple tasks of matching, aggregating, indexing etc.,
So, we now have choice of taking small amounts of data to complex processing with structured RDBMS engines with shared-everything architecture of traditional model as well as taking processing to data in the shared-nothing big data architectures. It purely depends on the type of “data processing” problem in hand and neither traditional RDBMS technologies will be replaced by new big data architectures, nor could the new big-data problems be solved by traditional RDBMS technologies. They go hand-in-hand complementing each other while adding value to the business when properly implemented.
The non-Von-Neumann architectures still need better attention by the technologists which will probably hold the key to the way human brain processes and deals with the information seamlessly either it is structured or non-structured streams of voice, video etc., with ease. 
Any non-Von-Neumann architecture enthusiasts over here?