Blog Moved

Future posts related to technology are directly published to LinkedIn
https://www.linkedin.com/today/author/prasadchitta

Friday, December 30, 2011

Pensions Regulation in UK

As the year 2011 draws to a close, we are entering into a new year that is very important for UK workplace pensions. The new regulation around workplace pensions is coming into force in 2012.

Being purely technical for few years, I just want to test my skills around understanding regulatory documents and extract "Functional Information Needs" for a business.

A good review published in October 2010 can be accessed here: http://www.dwp.gov.uk/docs/cp-oct10-full-document.pdf


Basically, all the employers in UK have to automatically enroll all the eligible workers falling in a AGE range and EARNINGS range to a suitable pension scheme. They also need to "certify" the selected pension scheme meets the required quality criteria. (Refer to section 6.5 of the review document above) The requirement is to have at least 8% of earnings are paid towards a pension fund.

Regulation defines "Qualifying Earnings" as gross earnings that include commissions, bonuses, overtime etc., but most of the employers have the pension contribution basis as the "Basic Pay" i.e a pensionable pay. So, employer has primarily following options to "Certify" the pension scheme and comply with it.

Pseudo logic in plain English
1. IF pension contribution basis IS qualifying earnings (within the band) THEN pay the contributions of 8% and no certification is required.
2. IF pension contribution basis IS pensionable pay THEN
2a. CASE pensionable pay is 100% of gross pay THEN pay contributions of 7%
2b. CASE pensionable pay is at least 85% of gross pay THEN pay contributions of 8%
2c. CASE pensionable pay is less than 85% of gross pay THEN pay contributions of 9%
AND self certify pension scheme for all the employees participating in the pension scheme.

So, core information needs to implement this regulation is employee payroll data that covers age, all components of qualifying earnings of all employees. A bit of intelligence is needed to "model" the best possible grouping of employees and assign them to a suitable pension scheme(s) with one (or more) of the pension providers in the market.

The overall goal for a techno-functional consultant like me is to optimize the value of new pension regulations for employees, employers, pension providers and IT consulting companies by optimizing the information flows across various stakeholders!!

Wishing everyone a great new year 2012. Let there be peace, security and prosperity be with one and all.

Sunday, December 4, 2011

Storing Rows and Columns

A fundamental requirement of a database is to store and retrieve the data. In Relational Database Management Systems (RDBMS) the data is organized into a table that contain the rows and columns. Traditionally the data is stored into blocks of rows. For example a "sales transaction row" may have 30 data items representing 30 columns. Assuming a record occupies 256 bytes, a block of 8KB can hold 32 such records. Again assuming a million such transactions that need to be stored in 32150 blocks per day. All this works well as long as we need the data as ROWS! We want to access one row or a group of rows at a time to process that data, this organization has no issues.

Let us consider if we want to get a summary of total value of type x items that are sold in past seven days. This query need to retrieve 7million records that contain 30 columns each to just process the count of items of types x. All that we need is two columns item type and amount to process this. This type of analytical requirement lead us to store the data in columns. We group the columns together and store them in blocks. It improves the speed of retrieving the columns from the overall table quickly for the purpose of analyzing the data.  

But the column storage has its limitations when it comes to the write and update

With a high volume of social data, where there is high volume of write is needed (like messages and status updates, likes and comments etc.,) , highly distributed, NOSQL based column stores are emerging into mainstream. Apache Cassandra is the new breed of NOSQL column store that was initially developed by Facebook.

So, we have a variety of data base / data stores available now, a standard RDBMS engine with SQL support for OLTP applications, A column based engies for OLAP processing and noSQL based key value pair stores for in-memory processing, highly clustered Hadoop style big data with map/reduce framework for big data processing and noSQL based column stores for high volume social write and read efficiencies. 


Making right choice of data store for the problem in had is becoming tough with many solution options. But that is the job of an architect; Is it not?


Friday, November 11, 2011

numbers and counters

There is some amount of hype on the 11-11-11 i.e., 11-November-2011... I see the numbers as just counters and they themselves do not make much sense unless identified with some meaningful thing.

It is 14581 days since i was born, 8495 days since i was associated with software/computers, 6240 days since i started working etc., etc., in all these counters "I" remains constant while the numbers move on...

Several other numbers, top 10s, fortune 500s etc., etc., also make some hype around from time to time; but it will be continuously replaced in the flow of the numbers.

Especially in the current era of very high importance to the numbers and counters the true importance of the "Identity" and "Intelligence" seems to have lost...

Hope the 11:11AM IST of 11-11-11 bring some common sense around in the world in general and Information Technology world in particular.....

All the best flocks! 

I would like to add one memorable item from year 2001 on this occasion.  We have completed our first "Consolidation" project a decade back and for that we got a 500 Million years old natural slate piece printed with a small message as a memento. 




Friday, October 14, 2011

Most efficient multi-set Cartesian join in C

At the beginning of my career, with Indian Space Research Organization, I have been posed with a challenge that required implementing a multi-set Cartesian product with absolutely minimum memory usage to solve an optimization problem. (see my old post on description of the problem: Simple looking Complex problem  )


As a tribute to Dennis M Ritchie (also known as dmr) the creator of C language, who passed away yesterday, I am posting my implementation of this algorithm in C language.

I consider the above seven "highlighted" lines of C code as one of the earliest and most notable achievements of my career! 


If there is any better implementation to solve the stated problem please let me know by posting a comment.....  

Sunday, October 9, 2011

ACID and BASE of data

I am completing my 18 years of working in the field of Information Technology.

All these days an enterprise  data store generally provides the four qualities Atomicity, Consistency, Isolation and Durability (ACID) to the transactions. Oracle has emerged as a leader in providing enterprise class ACID transactional capabilities to the applications.

Recently in the Open World 2011, Oracle announced a noSQL database which typically characterized by the BASE acronym. Basically Available, Soft state, Eventually consistent (BASE)

I see a lot of debate on SQL vs NoSQL, ACID vs BASE and Shared Everything vs Shared Nothing architectures of data stores of late; and with Oracle getting on to the NoSQL bandwagon, this debate is just took up additional momentum.

Oracle has posted this paper nicely explaining their NoSQL database. http://www.oracle.com/technetwork/database/nosqldb/learnmore/nosql-database-498041.pdf

In my opinion, SQL and NOSQL choice is straight forward to make:-

big query: Are we storing data or BIG-DATA (read my old post on transactional data vs machine generated big data - http://technofunctionalconsulting.blogspot.com/2011/02/analytics.html)

With the new trends in 'BIG DATA' all the data almost become key, value pair with read and insert only operations with minimal or no updates to the data records. NoSQL/BASE is best suited to handle this type of data. Still the traditional transactional databases of OLTP nature, needs ACID complaint transactions.

So, when designing the big data solutions, an architect should surely look at the NoSQL dataBASE. Is it not?

Publishing this post on 09/10/11 (dd/mm/yy) and this is my 85th post to this blog.

Wednesday, September 28, 2011

User Experience - HTML5

Approximately two years back, I have made a post on "Rich Internet Applications" where the development of user experience focused on the plug-ins on the browser or run-time environments like Adobe AIR for rich experience. (See Desktop Widgets post.... )

As the technology progressed in last two years, the new/emerging HTML 5 seems to take on the web user experience design to the standard based, plug-in independent mode.

Some examples can be found on : http://www.apple.com/html5/

Another dimension today is the "mobile devices" along with the browser on desktop/laptop. 

When it comes to the mobile devices and integration with the specific device capabilities, one should develop "native applications" to take full advantage of the native hardware of the device. Standards are good but Native Applications can do better. On the other hand, using the standards we can develop once and deploy on multiple devices where as native applications development requires "effort/time" on each platform...

So, there is no silver bullet for the problem of Rich User Experience needs of ever changing world!  

Saturday, September 24, 2011

Ancient advice applicable for projects & tasks

I have been writing this techno functional consulting blog for past 4 years and I like to bring some ancient touch for modern Project Management:

अफलानि दुरन्तानि समव्ययफलानि च
अशक्यानि च कार्याणि नारभेत विचक्षणः

aphalAni = Those without fruit, durantAni = those with a bad ending (i.e., ends in failure), sama-vyaya-phalaani ca = and, those who are equal in effort and result (i.e., that do not end in either profit or in loss!), aSakyAni ca = and, those which is beyond the capability (i.e., impossible ones!) kAryANi = activities, projects, tasks na+ArabhEta = should not be started or initiated by vicakshaNaH= the wise man.

In my view there is only a 50% success rate in the Information Technology projects. So, it is wise to start only the projects that are sure to be successful. so, the ancient scholar of the above verse saying:

Don't take up a task/project if it is known to be:
a. meaningless or fruitless,
b. sure to land in a bad-end,
c. is of no-gain; no-loss
d. impossible or beyond one's own capability

We will improve the "success rate" if we follow this basic advice before taking up the projects & tasks!!

Given a chance I will put this in the beginning of PMP and PRINCE 2 certification material. 




Saturday, September 10, 2011

Tiered data storage

Hierarchical Storage Management
Not too long ago (about 10 years back), I have done a strategy for "Data Archival" options for a system that has lots of data which need to be preserved for 25 years due to legal reasons. (sort of Records Management requirement) The requirement is to have it fully query-able fine grained data in the system. The key challenge was keeping all data in on-line storage with the technology available at that time. So, we need to have a clear "Archival Strategy" to move the data off from the disk to tape and preserve the "Tapes" in a way they can be retrieved (by methods of proper labeling etc.,) on-demand within the given service levels. This technology later named as Hierarchical Storage Management. Overall strategy included manual tiering of data between the disks and tapes sometimes using a mechanical robotic hands and associated software around them.

Information Life-cycle Management 
As the technology advanced, the disk storage evolved to multiple bands of cost/functionality. The database software like Oracle came up with options like table partitioning and advanced compression. Combining these advances in the database management systems and the storage a new strategy emerged as Information Life-cycle Management. Logically partitioning the tables and putting them in the different types of storage like Enterprise Flash Disks (EFD), Fiber Channel (FC) and SATA disks using an automated storage tiering is the trend of the day.

Thin provisioning technologies like EMCs Fully Automated Storage Tiering - Virtual Pooling FAST VP and Hitachi's Dynamic Tiering etc., when used with Oracle's ASM and the partitioning & advanced compression options gives the best flexibility, performance and value for money. There is a good whitepaper from EMC with published few months back that can be found here

Conclusion:
Most of the storage vendors now have the Tiered storage technology embedded in the disk controller software layer that can automate the data migration or intelligently cache and tire the data across multiple types of storage. Using the available technology with right mix of logical features of database and storage virtualization  leads to better data availability at the optimal cost. Still the "right solution" is a job of a knowledgeable Architect! (who can understand the Business and Technology well!!) 

Monday, August 8, 2011

web age of WWW

As the WWW turns 20 years over the weekend (Link to the first webpage), my association with the computers turns 23 years today. The WWW is estimated to have approx. 20 billion pages as of today.

The information hungry world started making "Assets" out of information. Information has been classified as confidential, sensitive, internal, limited circulation, public etc., and some companies purely live only on "Informational Assets" today...

Protecting these information assets in the current day scenario of (operation shady RAT and reports stating that the claims of shady RAT themselves are shady!! ) hacking is truly a challenge. The information storage and its regulated flow to different end points need to be fully governed and secured.

My past blog posts related to the Information Security:
1. Data Security Technologies
2. Maximum Security Architecture
3. Identity and Access Management

with all these technology still there is a lot of "insecurity" among the technologists. Why?
Originally the information is published by the owner of that information and he/she would secure it with necessary proven authentication. Overall the information flow is between two known entities. (e-mail etc.,)
OR
Public information is broadcasted to reach maximum number of recipients. (spam mails etc.,)

As the WWW advanced to "Social" media the information is now being published by individuals for consumption by different like minded individuals who are directly known or unknown to the original publisher. This mode of information flow makes the whole process of information security very complex.

Technology surely can live up to the challenges that are posed by the trends in the information management area. Only thing needed now is cleaver brains to tackle the threats... It is all in the proper implementation of the available technology...


On this 8400 day of my association with computers and software, I am working on securing the information in the financial industry... Let us all hope we will have another 20 years flourishing, safe and secure WWW....

Sunday, July 10, 2011

SQL Plan Stability

Recently I came across a "performance problem" on Oracle database. A fairly innocent looking insert statement is intermittently taking "hours" to complete.

Problem:
INSERT INTO TARGET_TABLE ("Column List")
SELECT "values"
FROM SOURCE_TAB1, SOURCE_TAB2, SOURCE_TAB3
WHERE "All necessary join conditions and other conditions"

As the statement performing will in some instances and giving problem only in some cases, I have looked at the plans. It was generating two different plans - one with a simple nested loops and another with a Cartesian Join. When the second execution path is executed, it needed a lot of CPU and memory resources.

When I looked at the source of this query, it is originated in a job which is uploading a set of files into the database. Table stats are collected just before running the job. The development team has tested it several times in their database and they never had a problem with performance even in UAT environment. This behavior is only in the new environment that was built for the purpose of to-be production!

Reason:
As the optimized statistics are highly fluctuating from one file load to another file load, Oracle Database is generating different plans when the query is executed with different bind variables.

Solution:
Plan Stability can be achieved by
1. Importing the statistics from a stable environment and LOCKed.
2. By the way of providing Hints
3. By creating SQL Profiles
4. By generating stored outlines
5. By SQL plan Management (SPM) functionality in 11g.
We took a simple approach of importing and locking the statistics for the intermediate file upload schema to achieve the stability which worked well in the environment. but the most sophisticated SQL Plan Management functionality in 11g can solve most of plan stability issues. See this Oracle TWP - http://www.oracle.com/technetwork/database/focus-areas/bi-datawarehousing/twp-sql-plan-management-11gr2-133099.pdf for more details.

Notes:
As there is no silver bullet, one should be careful in implementing new features. This blog post explains the flip side of SPM and how to be careful with it.

To put it simply, be careful with setting optimizer_capture_sql_plan_baselines to TRUE. One can enable this parameter at a session level and capture the needed baselines and use them to get a consistent performance!

finally, however cleaver the RDBMS engine becomes, it can still commit blunders! an experienced DBA can never be replaced while dealing with performance issues!!

Thursday, June 16, 2011

data consolidation (ETL) and data federation (EII)

Operational IT systems focus on providing the support for the business operations & enable capture, validation, storage and presentation of transactional data during normal running of the operations. They contain latest view of the organization's operational state.

Traditionally, the data from various operational systems is extracted, transformed and loaded into a central warehouse for historical trending and analytic purposes. This ETL process will need a separate IT infrastructure to hold the data as well as it introduces some time lag in making the information in the OLTP systems available in the central data warehouse.

When the costs/resources required for consolidating data in the traditional way is not suitable due to the latest trends of acquisitions etc., there is a need for a different mechanism of data integration. The relatively different way of looking at this problem is to provide a semantic layer that can be used to access the data across heterogeneous sources for analytical purposes. This new way is called as "Data Federation" or "Data Virtualization" or EII - Enterprise Information Integration.

Key advantages of EII are quick delivery and lower costs. Key disadvantage is the performance of the solution and dependence on the source systems.

A good use case of data virtualization in my view is to consolidate different enterprise data warehouses due to mergers/acquisitions.

Traditional ETL and data warehouse technology vendors are coming up with data federation tools. Informatica Data Services uses a consolidate data integration philosophy where as Business Objects data federator uses a virtual tables in the BO universes for providing same functionality. Composite Integration Server is the independent technology provider in this area.

Key considerations in selecting the data federation and associated technologies are
1. native access to the heterogeneous source systems
2. capabilities of access method optimization
3. caching capabilities of the federation platform
4. metadata discovery capabilities from various sources
5. ease of development

A carefully chosen hybrid approach of consolidation and federation of data is required for a successful enterprise in the modern world.

Saturday, April 30, 2011

SQL performance tuning

Having seen several performance problems within IT systems, I have a methodology for performance tuning. When it comes to a SQL query tuning, it should be slightly different.

9 out of 10 cases of performance problems on relational database systems relate to a bad SQL programming. Even with the latest "optimizers" within the commercial database management core execution engines, it is the skill of the developer to make use of the facilities effectively to get the best out of the RDBMS.

I have recently came across a typical problem with a "junction table" design.
A set of tables represent USER and all are connected by a USERID
Another set of tables represent ACCOUNT and all are connected by ACCOUNTID

The software product implements the one to many relationship using a junction table called USER_ACCOUNTS which contains (USERID, ACCOUNTID) with the composite primary key USERID, ACCOUNTID.

Now there are 30K users 120K accounts and 120K USER_ACCOUNTS and a query that need to get data from USER tables involving some outer joins on itself and ACCOUNT tables which joins multiple tables to get various attributes; all these tables linked in a join using the junction table. That query runs for 18 hours.

When the query is split into two inline views with all the required filtering in each side of data access on USER and ACCOUNT individually and then joined using the junction table it completes in 43 seconds.

So, FILTER and then JOIN is better than JOIN and then FILTER in terms of resource consumption. Hence the performance tuning is all about doing the sequence of actions in the right order to minimize the consumption of resources to perform the job!

Tuesday, April 26, 2011

Data Serialization, Process Parallization!

There is a lot of buzz around: going away from traditional data processing i.e., a relational database and persistent data in relational form being processed by a set of processes that capture, process (validate, summarize, re-format etc.,) and present (display on multiple format displays over multiple channels in verbal and multimedia formats) that data.

But where are we going? Object orientation of encapsulating data with its own operations to make loosely coupled application services those can be orchestrated to form business services with in an enterprise.... Those enterprise business services further choreographed to form a business to business flows across common interfacing models...

The traditional computer architecture that has a Processor that can process the data which is stored in a distinct Memory of the computer. The processor and memory are two distinct components of the basic architecture of the modern computer. When an "object" needs to be stored or shared between two different applications one should "serialize" we have Hibernate and JSON etc., formats developed for this data serialization...

At the same time, there is a trend that takes over to process the data more and more in parallel streams in the shared nothing style clusters to break the typical task into smaller pieces and summarize the results in a hierarchical fashion to arrive at final result. This can happen when the data becomes more and more unstructured with the help of objects!

Overall the trend means we are slowly going away from structured data stores in traditional relational databases and going nearer to natural language, fault tolerant and predictive data capture and processing (e.g., you can type any spelling on Google and it will return results for the right word!) and more visual and multimedia presentation of the information (in mashups, maps) with bi-directional interaction (like social, I can "like", "comment" etc on the presented data as feedback!)

That is just my view.... Businesses have to gear-up quickly to adapt to these trends!

Thursday, March 31, 2011

Leaving IBM...

Having worked with IBM for few months I have decided to move on. It has been a great experience to work with the world's no 1 software company. It is very special as IBM is celebrating its centenary as an organization.

I have been working in smart energy initiatives of Application Innovation Services division. Theme of my work at IBM is "Smarter Grids" (Follow the link to see what is a smart grid!)

I would take this opportunity to wish all my colleagues "All the best" until we cross our paths in this small IT world sometime, somewhere!

Tuesday, March 22, 2011

Integration - Data or Information, Application

Having written a post on topological differences (EAI vs ESB) of integration long time back, and management approaches (centralized vs distributed) I am making an attempt to look at approach in terms of data integration vs application integration.

We have slightly touched upon this subject of Application Integration on one of the earlier post - Binding energy in software systems.

So, what is data integration? A typical Extract-Transform-Load - ETL, data migration, Change Data Capture - CDC, Master Data Management scenarios are classified as data integration. There are several platforms from big and small vendors to achieve this. (IBM InfoSphere/DataStage, Informatica, Microsoft, Oracle, Pervasive have suites/products under this head)

How is this different from traditional Application Integration? Application Integration focuses on integrating business process supporting the information workflows. Data Integration focus is primarily on the propagation and synchronization of data across the enterprise system landscape sometimes spanning into Cloud..

When the volumes are high, a data integration based approach has an advantage; If the process/workflow complexity is high and orchestration is needed to achieve the integration then one should look for an ESB/SOA style integration.

Enterprise Data Integration platforms are comparably similar to EAI platforms in their nature of technical architecture (hub/spoke based EAI topology) with a ETL hub that loads the data into a warehouse.

Of late, the buzz is toward real-time data integration based on CDC etc., let us see how it goes and changes the game!

Friday, February 25, 2011

Architectural Approaches

Having written a small post on "Enterprise Solution Architecture" some time back on this blog now I want to touch up on the Architectural Styles/Approaches.

Typically the architecture discipline is about making "models" and naturally the architecture is thereby model-driven. Model Driven Architecture or MDA is an approach for building various models using the UML (Unified Modeling Language) for defining "structure" and "behavior" of a system being modeled.

But, a "model" can only represent a specific view point of the "system" being modeled. So, there should be a standard set of view points fit into a framework to describe a typical Enterprise. The open group's TOGAF and ADM tries to do that by defining specific "architectural layers" i.e, business, information, application and technology with multiple view points as functional view, security view, user view, communications view, management view etc., The ADM gives a methodology to select the key stakeholders and required viewpoints the architecture needs to be developed.

IEEE 1471 gives a recommended practice of "Architecture Descriptions" that generalizes specific frameworks in generating these models within the system's context, stakeholders and their specific needs.

But there is never been an architecture that related to Information Technology that is purely greenfield. There is always an "As-Is" architecture and a "To-Be" architecture that will be built based on current problem domain. SEI's ATAM is a tradeoff methodology to evaluate the architecture and evolve the architectures.

So, architecture discipline develops models that go out of sync by the time the solution goes live into production and starts solving the problem. It is really difficult to keep the models in sync with what is the reality on the field.

The Service Oriented Architecture i.e, SOA looks at the Enterprise as a set of loosely coupled services that interact to run the enterprise in its environment. This gave a rise to deploy and host the services in a marketplace like environment called "Cloud" that changes the paradigm of architecture into "Cloud Computing".

While a majority of architectures even in SOA are flow based, there is a different approach that is available is Event Driven Architecture or EDA. This approach looks at the events processing and events triggering various workflows in a business environment. With the complex-event-processing that correlates events over cause-effect, spacial or temporal dimensions it has specific uses in service management and Business Performance Management areas.

While solving complex problems all the above different architectural approaches can be used based on their fitment and availability of time and resources.

But the key is to have the right set of people, processes and tools in developing these architectural views!

Wednesday, February 16, 2011

Analytics

Analytics is the "science of analysis" i.e., dividing a large set of data into certain "themes" and understanding various relationships between these to make out some business sense and take it for strategic advantage or for efficient operations is the field of Analytics.

In olden days, data that used to be stored in the computers is predominantly "Human generated transactional data" i.e., when an transaction happens between a producer and consumer, the data related to such transactions was stored in the software systems. This is relatively small amount of data.

As the days progressed, various "machine generated events" like a customer's website views, different clicks, data from automated sensors (like RFID etc., ) and various system log events are being stored for analyzing the behaviors.

This post is to enumerate different mathematical models and their uses in the field of business and web analytics.

1. Descriptive models: Used to classify the data into different groups. For example deriving the age of a person based on the first driving license date. Determining the sex based on height and weight etc., Focus is on as many variables as possible.

2. Predictive models: Used to find the causal relationships between the themes of data. Focus is on specific variables. These models give a probability of a set of outcomes.

3. Optimization/decision models: Used to derive the definite impact of certain decision and optimize the result within a set of constraints based on the data.

PMML = predictive model markup language from dmg is the xml based standard that can be used to exchange the models across multiple supporting applications.

The trend is in-database analytics that brings the data analytics into the database core engine and databases that are specifically built for the purpose of analytics based on columnar storage that makes the database an "analytical database" instead of a mere data storage and retrieval engine.

Oracle has published a good reference paper on this subject that can be found here -Predictive Analytics: Bringing tools to data.

over and above the thematic analysis there is an increasing demand for spacial and temporal analysis of the data. The field of analytics will converge into a single set of tools where one can analyse the data using the slicing and dicing functionality on all the dimensions of themes, spacial characteristics and temporal analysis at the same time with loads and loads of machine generated data is not far in the future....

Recently, I came across this paper that presents a framework for thematic, spacial and temporal analytics that can be possibly combined with data mining option....

Tuesday, January 11, 2011

Accountability and Authority

Back to fundamentals on the project management on this 11/1/11

The RACI matrix is a well known tool to identify the

Responsible
Accountable/Authority
Consulted and
Informed

parties involved in successful completion of a "Task".

Out of these four types of parties, there can be many responsible, many consulted and many informed about it but there should be one and ONLY ONE finally accountable for every given task within a project plan.

Why?
There should be one person in complete authority of a task who is empowered to take decisions during the task execution and makes sure the task in completed with the required quality within the time and within the effort allocated to the task. The person should have the final 'authority' to sign-off the deliverable.

In these days, technical architects are made accountable for tasks without necessary authority assigned.

Influencing people without authority and ability to be accountable for tasks is one of the key (soft) skills of a technical architect.