IT strategy, big data, data quality
In a previous post (see here), I have written about how the buzz around big data technologies is forcing financial services (and indeed, other) organizations to take a fresh look at their data ‘pipelines’ – the mechanisms they use to store data and ferry it between its sources and its consumers.
Arguably, the most critical process related to the data pipeline that is undergoing the maximum change due to these technologies is the extraction /loading /transformation (ETL) of data.
A much needed (data) journey
Even if awareness exists within an organization of what specific data is available with it, where it lies and further, how it can be accessed, it is not the same as bringing the data into use effectively in data consumer-facing processes like reporting & BI.
This is what ETL processes do. They help get this ‘discovered’ data from its existing locations to the places where it can be readied for use in the data consumer-facing processes (e.g. data warehouses, datamarts, ODS etc.). In addition to being transported between applications, the data can also be improved, enriched and transformed along the journey for more efficient use.
The promise of speed
Technology that was originally built for use with big data can transform how data acquisition processes are executed. If one reads the current discourse from the big data technology vendors, one is likely to believe that traditional ETL has undergone a transformational change (excuse the pun!). The elements of the ETL process have been moved around – its current avatar is the ELT process, driven by the ‘schema-on-read’ philosophy.
This philosophy focuses on getting data into the big data infrastructure from the source systems in it’s as-is form, without bothering much about its structure (‘schema’). (E)xtraction and (L)oading of data gain prominence. (T)ransformation of data, when required, is then carried out subsequently within this infrastructure itself depending on what need the data is required to serve.
There is no doubt that this philosophy, supported by technology, has helped speed up the technical implementation of acquisition of data by the data infrastructure.
On the other hand, I also wonder if this increased efficiency in certain parts of the data pipeline can lead to a blind spot in its other parts. Let me try and explain with a real life case.
It was an incredible drive...
A leading wealth management company approached us to assess their data warehousing + BI initiative which had run into a set of problems, and recommend remedial measures.
They have grown through acquisitions, and their technology ecosystem reflected that – multiple different applications in use, performing the same or slightly different functions. Rather than wait for the rationalization of their ecosystem in the short term, they had decided to build a data warehouse and a BI platform for their internal and external consumers to access the available data and quickly derive value from it.
They had chosen a big data technology product to build a data warehouse, which would pull data in from multiple applications (12 in the initial phase) and deliver it in usable form to their enterprise BI platform for reporting and other uses (like predictive analytics).
In fact, they had adopted an efficient parallel implementation strategy – once the reporting & analysis business requirements had been gathered, they initiated parallel tracks – one for building the BI components, another for data acquisition (schema-on-read style).
Data acquisition was surprisingly fast – in fact, much faster than traditional mechanisms. In some cases, they even went ahead and acquired most data from a source because it was so easy. The big data platform they had chosen was largely SQL-compliant, which meant transformation of data for use in reporting was not very onerous either. On the other side, their reports and dashboards were being rapidly developed in parallel using an off-the-shelf BI product.
The first set of dashboards started getting data in about 5 months from initiation. If you think about it, that is an incredible time to market for a data warehousing program!
...till we hit the data in the blind spot
However, the business teams started reporting problems – the dashboards / reports were working mostly as expected in terms of functionality, but there were problems with the results. Some of these problems were serious.
For example, a large percentage of their current AUM could not be attributed to specific product ‘sponsors’. Many financial advisors could not be uniquely associated with branch offices.
When they evaluated data from just two source systems as a part of their data investigations, they found 2200+ account types. Even considering the 750 duplicate values across the two systems, there were 1500 ‘unique’ account types that had to be contended with (For reference, anything more than 40 unique basic account classifications should send the data quality radars buzzing.) In other words, a large data rationalization problem.
To support DOL ERISA Fiduciary rule related analysis by their Consulting & Strategy team, they had to classify account types into qualified and non-qualified. You can imagine how difficult this was with the way account type information was being made available in the dashboards. Imagine selecting from a list of 1500 ‘account types’ in a screen to run a report or see a chart.
There were more such examples. Feedback received from business strongly suggested that the experience of consuming the data via the platform was very sub-optimal – both in terms of data accuracy, as well as the overall data-related user experience.
In other words, the program had hit a data blind spot!
Did someone move our columns?
In the traditional ETL (‘schema-on-write’) model, we have to design for data to come into the warehouse once the requirements have been analyzed. This design phase involves inventorying the data in the various sources and building the data schema(s) in the warehouse. This phase typically requires a lot of diligence, especially in complex programs.
In addition to schema definition, this phase forces the organization to acknowledge data related issues (data quality, reference & master data management, data harmonization across sources) head-on, investigate and remediate them. This data remediation involves substantial collaboration between business and IT SMEs.
This activity is one of the (if not the) largest contributors to any data warehouse implementation program plan in terms of schedule.
We were hypnotized by that yellow elephant
The schema-on-read approach has brought about a change in the process.
This change may seem subtle. However, in our observations, this change plays an important part in introducing a data blind spot in the program.
Because data acquisition into the warehouse has been made so easy, it often precedes data design activities in implementations. Because of the promised cost effectiveness of data storage in Hadoop (as opposed to traditional data warehouses), there is also a great temptation to bring all data from sources into the big data platform.
Because data is now acquired and ‘available’, it is very tempting to start transforming it for end-use. And this transformation often precludes any data design and data remediation activities.
While the ‘acquire and transform’ approach does have certain benefits when executed well, what it is also likely to do is convey a sense of progress in the program that is artificial. I call it artificial progress because it has been achieved by ignoring / bypassing some very critical activities in the data pipeline – i.e., investigation and remediation of data problems.
In other words, the schema-on-read / ELT approach does not make the data related problems go away. They just pop up downstream in the pipeline, most likely when the data is being consumed.
Planning better for the next journey
Even after adopting a schema-on-read / ELT approach, we have observed that the true critical path of the warehouse implementation continues to include data analysis, design and remediation activities.
These activities cannot be bypassed, whether in the ETL-based approach or in the ELT-based one. Data design and remediation, if not conducted before making the data available for consumption, are likely to make the data related issues manifest themselves in reports, dashboards and other business-facing experiences. In other word, their order in the critical path can be changed (often, involuntarily) but they cannot be eliminated from it.
As a part of our assessment, we have recommended that the company continue with the existing ELT approach to their data pipeline, but incorporate a split data design phase in its process.
Once the data has been acquired, it will undergo data design, investigation and remediation for multiple aspects (Design I + Remediate) – data quality issues, metadata matching across multiple sources to a common canonical data model, data value harmonization across sources – for all planned and anticipated dashboards and reports. These activities will have to be brought into the equation every time new requirements are received from data consumers.
After the data has been remediated for quality, it can then enter a use-case specific design (Design II) and transformation process. In this process, the data is ‘designed’ for the specific requirements of each report or dashboard, and then transformed accordingly within their big data platform.
However, the availability of business and IT SME expertise in data remediation continues to be a necessary condition for success.
With this approach, the organization will continue to benefit from the efficiencies in data acquisition that the ELT philosophy makes available. In addition to increased efficiency in acquisition itself, this approach will also provide the investigation & remediation team a broader view of the data across multiple sources. This, in turn, should bring additional efficiencies in the remediation process too.
In other words, a strategy to help them eliminate their data blind spot!
- FS Solutions Group