Updated: Jan 15, 2020
In our video blog The foundation of an ML workflow and upcoming podcast (Releases December 19th, 2019), we talked about the so-called Data Iceberg. We can depict this iceberg as follows:
As shown in the picture, the ‘cool’ stuff [no pun intended] happens in the visible tip of the iceberg. This is where Machine Learning (ML) ‘lives’ and this is also what organizations read about in the press. However, before you can apply ML, you need clear, consistent and uniform data. This data almost never comes out of the gate from your (raw) data sources. There is always some kind of data preparation, which needs to happen. As a matter of fact, for most ML projects, 80% of the time is allocated to data preparation. Most organizations don’t realize this until they find out the hard way when undertaking ML projects.
So, in a nutshell, data preparation is turning data sources into clear, consistent, and uniform data sets. Data sets, which in their turn, can be used by the actual machine learning algorithms. The first step in data preparation is acquiring these data sources. Because if we can’t even acquire this data, then essentially all bets are off. As such, this is a very important step.
There are different ways to approach the data acquisition process. One way is to ‘just get out there and get the data’. Although this sounds like a reasonable approach, it can also be inefficient, given the potential challenges you are going to face. Challenges like:
Data living in silos
Data not complete
Legislation, which prevents you from getting (some of) the data
Data residing in proprietary, closed-off systems
To make the acquisition more efficient, you can assess the Data Readiness Capabilities (DCR) of the data sources you are considering. A DCR assessment gives you a high-level idea of whether (i) the data source is worth acquiring, and if so, (ii) which potential challenges you may face during the acquisition.
For each data source, the DCR assessment looks at certain criteria (not a complete list):
The table above shows that each criteria can have different scores. For instance, for data source A, the ‘Completeness’ criteria can be ‘1’, which means that the data source contains all the necessary data. While for data source B, the same criteria can be ‘0.5’, which means that only a subset of the data is available. Depending on the ML required, this may or may not be a problem. Maybe it is relatively easy to gather the missing data or the subset of data is good enough. Or, the subset of the data is simply not good enough for the particular ML model chosen.
The DCR score is then the multiplication of the individual criteria scores, as follows:
DCR Score = RD * R * C * MR * NL
Given the possible values, the DCR score can be ‘0’, ‘1' or in-between as shown below:
It is important to look at all the criteria at once for each data source. Especially as one criteria can render the use of a data source useless. For instance, data source Y, may have the Right Data (“RD = 1”), is very Relevant (“R = 1”), and be Complete (“C = 1”). But if ‘Y’ is not machine-readable (“MR = 0”), ‘Y’ is essentially useless (unless it is fairly easy and cost-effective to make ‘Y’ machine-readable).
For instance, the DCR of data source ‘Y’ (see example above) would be ‘0’, as DCR Y = 1 * 1 * 1 * 0 * 1 (given that MR = 0).
Data acquisition is difficult. You can make this is a little bit easier and more efficient by using the DCR assessment.
Wilco Van Ginkel
Data and Analytics Lead