Recently, when NASA announced a competition on a large US government data sets' data munging problem. It is called NITRD Big Data Challenge series @ http://community.topcoder.com/ coeci/nitrd/
The
first of the challenges was primarily about "How to create a
homogeneous big-data dataset from large soloed data sets available with
the multiple government departments such that some meaningful societal
decisions can be derived from the knowledge generated from big data
analytics"
So,
the word coined almost a decade back "Data Munging" has come back into a
key skill in today's world of "Data Science" discipline.
What is data munging?
In
simplest possible terms, it is making data that is generated in
heterogeneous platforms/formats to a common processable format for
further munging or analytics!
How is it deferent from ETL/data integration?
Data
integration and ETL are fully automatic and programmed where as the
munging involves semi automatic; based on human assisted machine
learning algorithms.
Why is it important now?
As
the massively parallel processing paradigm based on map reduce and
other so called "big data" technologies, it is a key thing now how the
existing vast amounts of "data" be made available for such kind of
processing to derive the knowledge by the means of analytic and machine
learning algorithms.
There
is an emergence of start-ups trying to generate platforms and tools for
data munging are now in the market. In my opinion, this is going to be a
key "skill" in future big data based "Data Science" discipline.
So, if you have good skills in data and algorithms based on assisted machine learning for manipulation then go for it!
No comments:
Post a Comment