Thursday, October 25, 2012
Data Munging in the Big Data world
Recently, when NASA announced a competition on a large US government data sets' data munging problem. It is called NITRD Big Data Challenge series @ http://community.topcoder.com/
The first of the challenges was primarily about "How to create a homogeneous big-data dataset from large soloed data sets available with the multiple government departments such that some meaningful societal decisions can be derived from the knowledge generated from big data analytics"
So, the word coined almost a decade back "Data Munging" has come back into a key skill in today's world of "Data Science" discipline.
What is data munging?
In simplest possible terms, it is making data that is generated in heterogeneous platforms/formats to a common processable format for further munging or analytics!
How is it deferent from ETL/data integration?
Data integration and ETL are fully automatic and programmed where as the munging involves semi automatic; based on human assisted machine learning algorithms.
Why is it important now?
As the massively parallel processing paradigm based on map reduce and other so called "big data" technologies, it is a key thing now how the existing vast amounts of "data" be made available for such kind of processing to derive the knowledge by the means of analytic and machine learning algorithms.
There is an emergence of start-ups trying to generate platforms and tools for data munging are now in the market. In my opinion, this is going to be a key "skill" in future big data based "Data Science" discipline.
So, if you have good skills in data and algorithms based on assisted machine learning for manipulation then go for it!