Key Quotes
I leverage my clinical experience daily which is both amazing and motivating. Because of my clinical experience, I am able to provide an additional lens on the data from the perspective of a healthcare provider, and give my team the context for the data so they can interpret and translate the data in a way that makes sense. For example, for one of our provider performance products, we work really closely with medical codes. These are designated codes that define certain diagnoses and procedures. My team is cleaning and building a model on these same codes that I used to bill for my own visits as a provider. Being able to recognize and understand the insights that we can get from these codes have just been a great reminder of the value of my experience.
Data engineering is a huge part of making this data usable. I think it requires a lot of creativity to think about "How can you scalably ingest thousands of schemas?". For example, address data can be formatted a number of different ways, we need to standardize that data across all the different scales that we see across different data sources. We built a tool that helps with onboarding new data sources by mapping all different fields to our own standard fields. Before, it would take us 20-30 minutes in Python to code up just one new data source, so imagine the mountain of work that’s created when you have hundreds of sources. Now, we have a simple UI that even starts to guess some initial mappings for you, reducing a 20-to-30-minute data mapping process per new data source to just 10-15 seconds, which makes a lot of our operations and our data adjustment processes a lot smoother and far more scalable.
Key Takeaways
Data Engineering is very valuable when it comes to the scalability of data cleaning. It’s essential to think creatively about how to solve data quality challenges so that your solutions work reliably at scale.
It's helpful to understand the context of the data, such as learning why the data was produced in the first place, who sits behind it, and what their intentions are. That context can change the entire process, starting with how you clean the data, analyze it, and how you consider anomalies and edge cases.
Having a strong and clear operating definition for what is considered good quality data can help you more effectively work with messy data, transform it into usable data, and draw meaningful insights from it.
The Top AI Certifications for 2024: A Guide to Advancing Your Tech Career
Matt Crabtree
10 min
Announcing the "Become an AI Developer" Code-Along Series
DataCamp Team
4 min
ChatGPT & Generative AI: The Year in Review – Top 17 Moments
Moez Ali
17 min
Data & AI for Good, with Marga Hoek, Founder & CEO, Business for Good
Adel Nehme
45 min
How to Make Custom ChatGPT Models: 5 Easy Steps to Personalized GPTs
Moez Ali
9 min
Fine-tuning Stable Diffusion XL with DreamBooth and LoRA
Abid Ali Awan
14 min