As soon as we saw at Panorays that there’s an option to try out Cloud Dataprep as a Beta on Google Cloud Platform we registered ASAP and found really useful and time saving, let's say that it made our Data Scientist, the AWESOME Tal Peretz, very very happy!
Let start with the problem (There’s always a “Problem” :) ) that we were trying to solve,
We had lot’s (Around 700 GB of them) of files needing parsing, filtering and some transformations (In origin CSV, SQL and Regular Text files - And some of them in zipped formats - “tar.gz”).
As you know that around of 80% of the day to day job of the Data Scientist is Data preparation, Data filtering and Data enrichment, so of course the goal is to save as much time as possible at that phase.
As you know that around of 80% of the day to day job of the Data Scientist is Data preparation, Data filtering and Data enrichment, so of course the goal is to save as much time as possible at that phase.
TL;DR:
We’ve been able to save a lot of time and effort (and probably compute resources) using Dataprep. (No, no one paid me to write this post...)
Benefits:
- Easy to use and to ramp up.
- No need to run the “code” on any VM (Pretty much Serverless).
- Convenient GUI that allows you to see the whole process and statistics and metadata over it.
- Runs in a pretty efficient manner performance wise.
What we still couldn’t do with Dataprep:
- Run flows on the zipped files out of the box.
- Share the flows between different users in the save GCP project.
- Store the data in another Data Store for the rest of our Data Pipeline. (But It’s also not the use-case of using Dataprep in my opinion)
Now to the actual process:
We’ve put them for storage purposes on Google Cloud Storage (You know, it’s fully replicated and super cheap, so why not) and because of that it really fitted us well with the way of use of Dataprep that reads its input from from Cloud Storage.
Main goals:
Main goals:
- Go over all of the files, different logic for other types of files.
- Extract the relevant features and finally merge them into the final output “File” that will contain all of the rows and then continue with the Data Pipeline.
- Move it to another Indexed data store to be able to query the data afterwards with the resolution of a single data point.
- To be able to run analytics and and create some Machine Learning learning models with the output data.
So, How we did it via Dataprep (With some screen shots of course):
- We’ve created a Flow to process all of the Raw Data, in that Flow created a “Recipe” for each type of file format and with the context of the data inside it.
All will be merged to the same output “file” via specific features that the joins we’re made by.
- Here you can see the “Job” execution and the data distribution at a specific job before it ran on sample data taken from one of the sources. (Looks pretty cool right?)
- Job results in a really convenient manner with the ability to export the report.
Also being able to see the counts of all of the rows with matches and mismatches, and it all took around 30 minutes to run with the output result of ~12GB (After filtering of course). We can also see some overview on the output data distribution which is really cool. We can see that basically this is being ran in the background on Cloud Dataflow, and the job itself cab shown as the Dataflow job with all of it’s DAG break down.
- All after all it has got to be saved to somewhere, right?
So where if not to Google Cloud storage. You can see that the output is being saved in a sharded distributed manner in the defined sandbox bucket of the Dataprep environment.
So to conclude things,
We’ve managed to transform and filter a vast amount of the raw data in about a day of work.
Probably if we would have tried doing it via “Apache Spark” (My personal favourite until now) even on a managed environment like Cloud Dataproc it would have taken us much longer to write the code and to run all of the transformations.
The runtime was pretty good and we didn’t need to maintain or schedule any resources ourselves.
We still have some blockers in the Beta version in which we can’t share the Jobs between different uses at the same GCP project for example, which is pretty inconvenient.
Probably if we would have tried doing it via “Apache Spark” (My personal favourite until now) even on a managed environment like Cloud Dataproc it would have taken us much longer to write the code and to run all of the transformations.
The runtime was pretty good and we didn’t need to maintain or schedule any resources ourselves.
We still have some blockers in the Beta version in which we can’t share the Jobs between different uses at the same GCP project for example, which is pretty inconvenient.
Also, we were not able to extract the zipped files with transformations in Dataprep so we did the extraction phase manually beforehand and re-uploaded the extracted files to Cloud Storage.
But all and all it looks like a great tool for the Data preparation with pretty good looking User Interface, that I’m sure that in the GA version would be more feature rich.
I want to thank Tal Peretz for his great work great work! (Much of the “Fast Results” is because of him).
Hoping for more great features by the Cloud Dataprep team that will make the life of Data Engineers and Data Scientists much easier.