What's Your Favorite Dinosaur? - StreamSets Pipeline to Clean Delimited Files

  • 12 March 2022
Problem Statement

You’ve been asking everybody you meet the age old question: What is your favorite dinosaur? You’ve been saving their responses in a series of delimited files that have this format.



Tragically, the data you have meticulously gathered is dirty. Before you can execute your master plan to do whatever is that you were planning to do with this information, you must clean your dataset and resave it to a single clean file.


The Pipeline

  1. You read in the Dino data with the Directory origin. 
  2. You remove duplicates on the Dinosaur column using the Record Deduplicator
  3. You filter Dinos that have murdered people in the Jurassic Park film series for reasons that are you own using the Stream Selector.
  4. Finally, you save your Dino data to one single file on your local system. Your data is clean, filtered and ready for analysis.



If you’re reading this, please comment with your favorite dinosaur. It’s uh...for science. 

