Click play on the video below or keep reading for a discussion between Data & Analytics lead, Wilco van Ginkel, with host Nicole Hutchison to talk about data preparation.
Q1: Can you first explain what data preparation is?
Answer: Essentially, data preparation is takin raw data and turning into clean, consistent and uniform data sets. When you have those uniform data sets you can then apply machine learning and analytics. Now, that all sounds very easy but in reality, it is pretty tough.
So what typically happens with data preparation is it [the data] goes through what we call a data pipeline. For example, let's say, for raw data, let's say you have a bunch of Google sheets with data in it. First, you need to have access to the data so you have to acquire the data. If you go on our website you will see more about information about how data acquisition really works because that part can be really challenging. Once you have the data you then need to normalize it. In other words, it has to be in a format that is consistent and uniform because you have to talk about columns and rows that have the same meaning. Then, you can apply filters. For example, I only want to see last year's data or, I want aggregation such as "give me the sum of all the salaries within this company instead of individual ones" and then you have to deliver that data set. In other words, it has to be available for consumption. Essentially that is data preparation.
If you go to our white papers or the blog, you will find more information about it [data preparation].
Q2: Why is data preparation so important?
Answer: Well you may have heard me talk about the so-called "data iceberg". The data iceberg shows how a lot of people focus on the top of the iceberg, which is visible, cool and what everybody is reading about but not a lot of people in an organization focus on the "80%". The truth of the matter is if you don't have the "80%" done you cannot start with the "20%". So data preparation is all about getting the 80% done. Thankfully, a lot of organizations are starting to realize that now but a lot of organizations also have trouble with it. Why is that? Not only from a technology point of view but also because they underestimate the effort. They go in there [data iceberg] and think the other way around. They think "Oh! The 80% is all about the machine learning models, the pipeline, the workflow", and the "20%" is, "well we have to get the data right". No no. It's the other way around.
Why is it 80/20 and not 20/80? The reason is that if you see how we have accumulated data in our society over the decades, data is in all kinds of different formats. So you have the so-called structured data and unstructured data, for example, videos and images are unstructured data and structured data is a typical database that's already in the start. Structured data is in data silos. I mean, everybody has their own data, even small companies. Everyone has their own data somewhere living. You have data politics, meaning people who are literally sitting on data and they don't want to give access. You have of course the legislation, like in healthcare where you want to have access to data but you are not allowed to.
Finally, you also need domain expertise. Everybody who thinks that we walk in, apply machine learning and magic happens, that's complete rubbish. We strongly believe in a hybrid model where, yes, technology can do work, machine learning can help, but we still need to have humans in the loop, at least for some projects, where domain expertise plays a role. Sometimes, as a data scientist, you may go in a certain direction and you need a sanity check like "this is complete rubbish because of this, this and this". We have seen this a few times, not that we have been called out to be rubbish, but like oh we thought it was this and then someone with domain expertise said no it is not because of this and this. For MLT, as an example, we work for so many different companies, we can not be experts in every domain, so we need those domain experts with us during the journey to make sure that what we are modeling and how we are looking at the data makes sense.
Q3: What are the big challenges with data preparation?
One of the biggest challenges with data preparation is estimating the effort. That's simply because they don't realize the effort it takes to really get there. The other one [challenge] is, as related to one of the other things mentioned earlier, is working in isolation. They don't do their due diligence. They don't question things like "is this what we really need to do?", "Is the data really there?" and "Is this the right data that we need?". There is also, the politics, as was mentioned earlier, People just don't want to give away their data. Another challenge is legislation. However, another challenge that is important is closed-off proprietary systems. This means data that lives in closed-off proprietary systems.
So, why is this important? If you look over the last few decades how we have built software before the cloud came in, it was all about proprietary software, enterprise software, one-vendor, one-solution kind of thing. These software solutions were literally not being built with an open mind. There were no APIs (application program interface) to talk to the system. Data was not open data, it was their own format.
If you look nowadays, how systems are being built, which is more from a decoupled point-of-view, we have modules that run on their own, they communicate with each other using APIs, data is becoming more and more open, etc... That's a whole new concept. That concept is great because technology-wise it is supported now but the reality is that maybe 80-90% of the systems that you are going to focus on could be closed-off in propriety. Unfortunately, the data gems that you need are in a system that is very difficult to access. This circles back to the 80% percent that, yeah, you might have a great idea for a machine learning model and how to apply it to that data, the problem is having access to it. It's like you see a beautiful castle but you do not have the key. It's beautiful, but I can not get in.
Q4: Whether you have the right data or not, what can people do to get in front of these challenges?
Thankfully, it is getting slightly better, but it is kind of a paradox because part of doing the data explanation is exactly answering this question. So ideally, they will say, I got the right data right out of the gate and we can work with it. In reality, again, this is not the case. The multi-million dollar question is "how would you know upfront whether you have the right data?" That is really what everybody is looking for because that reduces 80% of the effort you have to put into a project. It's kind of a paradox because you want that upfront but you don't know the answer. To get to the answer, you have to go through the hoops. So, it is very hard to determine up front. Essentially, a machine learning project is a journey with really a lot of trial and error. What you read in the newspapers is maybe the 0.11111% that the Googles and the Facebooks and the Amazons of this world will do. It all sounds like, oh, this is simple, we found the holy grail, it is easy... it's really not, it's bloody hard work and a lot of trial and error.
Remember we talked about the domain expertise that you have to include, you have the stakeholders online, the data is just dirty most of the time, so there is a lot of work involved.
Now, that being said, can you improve it a little bit? I think you can. What we have developed at MLT is what we call a data retinas capability assessment. It's high-level, but it gives you an indication, as you look at certain data sources. Remember that data preparation is taking raw data, turning it into clean, consistent, uniform datasets. So, you are going to look at raw data, for example, you have a bunch of google sheets or you have a database table with transactions and you are going to ask a few questions with the data owner like:
Is the data source complete? For example: if you have a google sheet with orders. Are all the orders that I need included in the data set?
Is the data source machine-readable? If it is not machine-readable, I can make an effort to make it so, but then the question becomes does the effort outweigh the benefits. If it is not machine-readable, it can not be used as a data source. If it is an important data-source, but I have an alternative, then you may want to go with the alternative. If there is no alternative but the data source is so important but the effort is still too much then you may want to look at a different machine learning model or maybe you have to look at a completely different way of looking at the problem over-all.
Noise Level. The noise level is about the integrity of the data. If you have a relational database management systems, which are all about integrity, therefore, the integrity will be quite okay. However, the noise level is also in the semantics. If I ask you, "write your name", you might write your name as Hutchison-Nicole, while I might write my name as W. Van Ginkel. The entity is "name" but the way we go about the name is different. So how can we let a system know if its last names or surnames then first name or is it first name then last name and so-on and so-forth? So as you can see there are many datasets where there are inconsistencies in even the same data items. This is also the Noise Level. From a syntax point of view, this should be a name but the format of the name is not being guaranteed.
You get a lot of those nitty-gritty things that you go through when you do the data preparation. That's why, in this day and age, it is very difficult to know upfront unless you are in a very clean, concise, structured data management system. However, the majority of companies are not that way. When you look at companies, they have google sheets, they work in teams or in slack. They have files all over the place in network drives. It's just a bunch of different data and it's very difficult to tell upfront if you have the right data. So, that is one of the first things that we do when we do a process. We not only talk about the stakeholders first but then, okay, let's look at your data.
The data is like fuel to a car. Without fuel, the car is not going anywhere. We have to make sure we have the right fuel and the right amount of fuel before we even can start driving the car.
Interview conducted by Nicole Hutchison
Expert Wilco Van Ginkel, Data & Analytics Lead.