Blog Moved

Future posts related to technology are directly published to LinkedIn

Thursday, October 25, 2012

Data Munging in the Big Data world

Recently, when NASA announced a competition on a large US government data sets' data munging problem. It is called NITRD Big Data Challenge series @

The first of the challenges was primarily about "How to create a homogeneous big-data dataset from large soloed data sets available with the multiple government departments such that some meaningful societal decisions can be derived from the knowledge generated from big data analytics"

So, the word coined almost a decade back "Data Munging" has come back into a key skill in today's world of "Data Science" discipline.

What is data munging?
In simplest possible terms, it is making data that is generated in heterogeneous platforms/formats to a common processable format for further munging or analytics!

How is it deferent from ETL/data integration?
Data integration and ETL are fully automatic and programmed where as the munging involves semi automatic; based on human assisted machine learning algorithms.

Why is it important now?
As the massively parallel processing paradigm based on map reduce and other so called "big data" technologies, it is a key thing now how the existing vast amounts of "data" be made available for such kind of processing to derive the knowledge by the means of analytic and machine learning algorithms.

There is an emergence of start-ups trying to generate platforms and tools for data munging are now in the market. In my opinion, this is going to be a key "skill" in future big data based "Data Science" discipline.

So, if you have good skills in data and algorithms based on assisted machine learning for manipulation then go for it!

Wednesday, October 3, 2012

Oracle database 12c

I have been seeing Oracle database form 6, 7, 8i, 9i, 10g, 11g and now it is going to be called 12c - c as in Cloud. OOW12 revealed the new architecture. (there was no tail for 6 and 7.... i stood for internet, g for Grid..... )

The new release with a fundamental architectural change to the data dictionary that gives a concept of "Pluggable Database" - PDB over the standard oracle system "Container Database" - CDB.

The obj$ of Oracle database dictionary will contain all the information of objects in the current architecture.  It is being split into CDB and PDB going forward for easing the "multi-tenant" private cloud databases.

This fundamental seperation of the data dictionary provides
a. Better security of multiple "databases" on the same instance
b. Seperation of CDB from other PDBs will allow easy upgrades.
c. All PDBs will share the instance and overall management of the consolidated database should be much simpler.

Excited about the cloud enabled, multi-tenant, pluggable database from Oracle! 

So, let us wait and see when the stable 12.2 will come out to roll out the new database into production....