Skip to main content

course

Cleaning Data with PySpark

Advanced

4.1+

Updated 12/2024

Learn how to clean data with Apache Spark in Python.

Start course for free

Included for FreePremium or Teams

SparkData Preparation4 hours16 videos53 exercises4,150 XP27,670Statement of Accomplishment

Create Your Free Account

Google LinkedIn Facebook

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.

Training 2 or more people?

Try DataCamp for Business

Loved by learners at thousands of companies

Course Description

Working with data is tricky - working with millions or even billions of rows is worse. Did you receive some data processing code written on a laptop with fairly pristine data? Chances are you’ve probably been put in charge of moving a basic data process from prototype to production. You may have worked with real world datasets, with missing fields, bizarre formatting, and orders of magnitude more data. Even if this is all new to you, this course helps you learn what’s needed to prepare data processes using Python with Apache Spark. You’ll learn terminology, methods, and some best practices to create a performant, maintainable, and understandable data processing platform.

Prerequisites

Intermediate Python Introduction to PySpark

1

DataFrame details

Intro to data cleaning with Apache Spark

Data cleaning review

Defining a schema

Immutability and lazy processing

Immutability review

Using lazy processing

Understanding Parquet

Saving a DataFrame in Parquet format

SQL and Parquet

2

Manipulating DataFrames in the real world

DataFrame column operations

Filtering column content with Python

Filtering Question #1

Filtering Question #2

Modifying DataFrame columns

Conditional DataFrame column operations

when() example

When / Otherwise

User defined functions

Understanding user defined functions

Using user defined functions in Spark

Partitioning and lazy processing

Adding an ID Field

IDs with different partitions

More ID tricks

3

Improving Performance

Caching a DataFrame

Removing a DataFrame from cache

Improve import performance

File size optimization

File import performance

Cluster configurations

Reading Spark configurations

Writing Spark configurations

Performance improvements

Normal joins

Using broadcasting on Spark joins

Comparing broadcast vs normal joins

4

Complex processing and data pipelines

Introduction to data pipelines

Quick pipeline

Pipeline data issue

Data handling techniques

Removing commented lines

Removing invalid rows

Splitting into columns

Further parsing

Data validation

Validate rows via join

Examining invalid rows

Final analysis and delivery

Dog parsing

Per image count

Percentage dog pixels

Congratulations and next steps

Cleaning Data with PySpark

Course
Complete

Earn Statement of Accomplishment

Add this credential to your LinkedIn profile, resume, or CV
Share it on social media and in your performance review

Included withPremium or Teams

Don’t just take our word for it

*4.1

from 19 reviews

53%

21%

16%

11%

0%

Highest to Lowest
Lowest to Highest
Most recent
Top reviews

Flor S.

about 1 month

Best part for me is the interactive part where you get to apply immediately what was taught in the course through virtual coding.

Syed O.

8 months

I did learn alot from the course and it definitely talked about many pyspark features not mentioned in other courses however more explaination with examples for tougher and complicated topics in the course would have been better

André S.

9 months

Eu aprendi demais com esse curso. Gostei muito dos laboratórios também.

Douglas L.

over 1 year

Very Good Content.

Jegan D.

over 1 year

Very good course with challenging examples. The only problem is that I found it difficult to submit some of my answers or the solution provided. This happened in two different exercises.

"Best part for me is the interactive part where you get to apply immediately what was taught in the course through virtual coding."

Flor S.

"I did learn alot from the course and it definitely talked about many pyspark features not mentioned in other courses however more explaination with examples for tougher and complicated topics in the course would have been better"

Syed O.

"Eu aprendi demais com esse curso. Gostei muito dos laboratórios também."

André S.

Join over 15 million learners and start Cleaning Data with PySpark today!

Create Your Free Account

Google LinkedIn Facebook

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.