In this first post of 2013, I would like to cover some fundamental philosophical aspects of “data” & “processing”.
As
the buzz around “Big Data” going on high, I have classified the
original structured, relational data as “small data” even though some
very large databases I have seen having 100+ Terabytes of data with an
IO volume of 150+ Terabytes per day.
Present day data processing predominantly uses Von-Neumann architecture
of computing in which “Data” and its “processing” are distinct and
separated into “memory” and “processor” connected by a “bus”. Any
data that need to be processed will be moved into processor using the
bus and then the required arithmetic or logical operation happens on it
producing the “result” of the operation. Then the result will be moved
to “memory/storage” for further reference. Also, the list
of operations to be performed (the processing logic or program) is
stored in the “memory” as well. One needs to move the next instruction
to be carried out into the processor from memory using the bus.
So
in essence both the data and the operation that needs performing will
be in memory which can’t process data and the facility that can process
data is always dependent on the memory in the Von-Neumann architecture.
Traditionally,
the “data” has been moved into a place where the processing logic is
deployed as the amount of data is small when compared to the amount of
processing needed is relatively large involving the complex logic. In
the RDBMS engines like Oracle read the blocks of storage into the SGA
buffer cache of running database instance for processing. The
transactions were modifying small amounts of data at any given time.
Over
a period of time “analytical processing” that required to bring huge
amounts of data from storage into processing node which created a
bottleneck on the network pipe. Add to that there is a large growth in
the semi-structured and unstructured data that started flowing which
needed a different philosophy towards data processing.
There
comes the HDFS and map-reduce framework of Hadoop which took the
processing to the data. During the same time comes Oracle Exadata which
took the database processing to storage layer with a feature called
“query offloading”
In
the new paradigm, the snippets of processing logic are being taken to a
cluster of connected nodes where the data mapped with a hashing
algorithm resides and results of processing then reduced to finally
produce result sets. It is now becoming economical to take the processing to data
as the amount of data is relatively large and the required processing
is fairly simple tasks of matching, aggregating, indexing etc.,
So,
we now have choice of taking small amounts of data to complex
processing with structured RDBMS engines with shared-everything
architecture of traditional model as well as taking processing to data
in the shared-nothing big data architectures. It purely depends on the
type of “data processing” problem in hand and neither traditional RDBMS
technologies will be replaced by new big data architectures, nor could
the new big-data problems be solved by traditional RDBMS technologies.
They go hand-in-hand complementing each other while adding value to the
business when properly implemented.
The non-Von-Neumann architectures still
need better attention by the technologists which will probably hold the
key to the way human brain processes and deals with the information
seamlessly either it is structured or non-structured streams of voice,
video etc., with ease.
Any non-Von-Neumann architecture enthusiasts over here?
No comments:
Post a Comment