Blog Moved

Future posts related to technology are directly published to LinkedIn

Friday, January 25, 2013

Machine Learning Algorithms – Classification, Clustering and Regression

For a Data Scientist among the other skill sets, good fundamentals on “data mining” or “machine learning” is the icing over the cake. These algorithms are also used in predictive analytics. It is immaterial if the data is “big data” or not!
We will end up having several variables either containing numeric or nominal attributes describing an entity. The numeric variables could be continuous or discreet like ordinals.  Nominal variables are of the form of known list of values having binary (two i.e., yes or no; male or female etc.,) or more possible values.
Given that a data set consisting of a millions of records, each record containing some 100s of variables an analyst’s job is to derive some insights to solve the known or unknown business problems! This is where the application of machine learning algorithms comes into play.
Broadly Machine Learning can be put into two groups.
1.       Predicting a target variable for a given instance of data record. We will have a set of records with the known values for target variable by which we can develop a model and train the model, test and put that into production – This is called supervised learning
a.       If the target variable is nominal then these algorithms are called classification.
b.      If the target variable is a continuous numeric variable then we need to apply regression
2.       There is no target variable; we need to group the data records into distinct groups based on multiple variables within the dataset – This is called unsupervised learning. Clustering and Association Analysis algorithms are used to achieve this. 
Formulating the problem, preparing the data, visualizing the data, training the model, testing the model and interpreting the results to generate insights and them implementing the derived knowledge to the business operations require multi disciplinary skills in business domain, operations management and technology.
A real business analytic solution consists of using multiple techniques involving machine learning to achieve Customer Segmentation, Cross-selling, Customer behavior analysis, Customer retention, Marketing Analytics and campaign management, fraud detection, optimization of profits etc.,
A recent quick read through the book Machine Learning in Action by Peter Harrington prompted me to write this tech capsule on this Friday… 

Friday, January 11, 2013

A journey through Grid and Virtualization leading to Cloud computing

In computing these are two trends I have seen crisscrossing throughout my career.
1.       Making a computing node look like multiple nodes using virtualization.
2.       Making multiple computing nodes work as a single whole called a cluster or grid
So, there was an era where the computing was really a “Big-Iron” and the computers are mainframe size and provided a lot of capacity of computing. One computer handled multiple users and multiple operations at the same time.  The Virtual Machine is included in operating system of IBM and the mainframes as well as from DEC VAX mainframes had similar concepts. Of late, we see the trend even in desktops with hypervisors like vmware etc., coming out.
With the advent of mid-range servers with limited capacity, there is a need to put them together to get the higher computing power to deal with the demand.  The first commercial cluster developed by DEC ARCnet even though there is always been a fight between IBM and DEC on who invented clusters.  Clustering also provides high availability / fault tolerance along with higher computing capacity.  Oracle was the first database to implement parallel server on ARCnet cluster for VAX operating system.
This trend of cluster computing has achieved supercomputing to break the complex task in to multiple parallel streams and execute them on multiple processors. What is the fundamental challenge in “clustering” – the process coordination and access to the shared resources. This leads to be locally networked and connected with high performance local network.
Another concept of this is grid computing where the administrative domain can connect loosely coupled nodes to perform a task. So, we have more and more cores, processors, nodes in a grid to provide low cost, fault tolerant computing. This is making smaller components put together to look like a giant computing capacity.
Finally, what I see today is “Cloud” which creates a grid of elastic nodes that look to appear like a single (large) computing resource and gives a slice of virtualized capacity to each of multiple tenants of that computing resource.  
Designing solutions in each of these technologies of big-iron, virtualization, clusters & grid and in Cloud world has really been challenging and keeps the job lively…

Thursday, January 3, 2013

Small & Big Data processing philosophies

In this first post of 2013, I would like to cover some fundamental philosophical aspects of “data” & “processing”.
As the buzz around “Big Data” going on high, I have classified the original structured, relational data as “small data” even though some very large databases I have seen having 100+ Terabytes of data with an IO volume of 150+ Terabytes per day.  
Present day data processing predominantly uses Von-Neumann architecture of computing in which “Data” and its “processing” are distinct and separated into “memory” and “processor” connected by a “bus”.  Any data that need to be processed will be moved into processor using the bus and then the required arithmetic or logical operation happens on it producing the “result” of the operation. Then the result will be moved to “memory/storage” for further reference.  Also, the list of operations to be performed (the processing logic or program) is stored in the “memory” as well. One needs to move the next instruction to be carried out into the processor from memory using the bus.
So in essence both the data and the operation that needs performing will be in memory which can’t process data and the facility that can process data is always dependent on the memory in the Von-Neumann architecture.
Traditionally, the “data” has been moved into a place where the processing logic is deployed as the amount of data is small when compared to the amount of processing needed is relatively large involving the complex logic. In the RDBMS engines like Oracle read the blocks of storage into the SGA buffer cache of running database instance for processing. The transactions were modifying small amounts of data at any given time.
Over a period of time “analytical processing” that required to bring huge amounts of data from storage into processing node which created a bottleneck on the network pipe. Add to that there is a large growth in the semi-structured and unstructured data that started flowing which needed a different philosophy towards data processing.
There comes the HDFS and map-reduce framework of Hadoop which took the processing to the data. During the same time comes Oracle Exadata which took the database processing to storage layer with a feature called “query offloading”
In the new paradigm, the snippets of processing logic are being taken to a cluster of connected nodes where the data mapped with a hashing algorithm resides and results of processing then reduced to finally produce result sets. It is now becoming economical to take the processing to data as the amount of data is relatively large and the required processing is fairly simple tasks of matching, aggregating, indexing etc.,
So, we now have choice of taking small amounts of data to complex processing with structured RDBMS engines with shared-everything architecture of traditional model as well as taking processing to data in the shared-nothing big data architectures. It purely depends on the type of “data processing” problem in hand and neither traditional RDBMS technologies will be replaced by new big data architectures, nor could the new big-data problems be solved by traditional RDBMS technologies. They go hand-in-hand complementing each other while adding value to the business when properly implemented.
The non-Von-Neumann architectures still need better attention by the technologists which will probably hold the key to the way human brain processes and deals with the information seamlessly either it is structured or non-structured streams of voice, video etc., with ease. 
Any non-Von-Neumann architecture enthusiasts over here?