Software Engineer Intern - Fuzzy Distinct (Data Platform)
Headquartered in New York City, Dataiku was founded in Paris in 2013 and achieved unicorn status in 2019. Now, more than 1,000+ employees work across the globe in our offices and remotely. Backed by a renowned set of investors and partners including CapitalG, Tiger Global, and ICONIQ Growth, we’ve set out to build the future of AI.
Augment Dataiku data preparation with a tool that can automatically merge nearly identical data records
Today, Dataiku boasts a robust data preparation framework that functions admirably to process a vast amount of data, helping users to have clean databases with the right data (and only the right data) inside them. However, we believe that with your help, we can take it a step further!
In a world where databases can be filled by real humans, data is not always clean. Errors can happen, typos can be made, and sometimes, you want to merge two database tables containing the same information, but not quite in the same format. “Dataiku”, “dataiku”, “data\niku” refer to the same company, but will be considered different entries in your database.
The goal of this internship is to improve the capabilities of our “distinct” processor to support fuzzy matching (aka: matching data that looks almost the same). The new processor will help clients clean up their database, using algorithms like Levenstein distance, Jaro–Winkler distance, n-grams, Jaccard similarity, or Metaphone to detect duplicated information and reduce them to a single line.
During this internship, you will:
- Get familiar with Dataiku and its data preparation recipes as well as database schemas.
- Design a new component that uses numerous industry-standard algorithms (Levenstein distance, Jaro-Winkler distance, N-grams, Jaccard similarity, or Metaphone) to automatically detect duplicate data
- Develop the User Interface that helps the user understand the clusters of data, to ensure he is not grouping too much or too little
- Celebrate and party because our beloved users will then be able to reduce their data overload!
- Python or Java for the backend side