Saltar al contenido principal

Cleaning Data with PySpark

Learn how to clean data with Apache Spark in Python.

Comienza El Curso Gratis

4 horas16 vídeos53 ejercicios27.333 aprendicesDeclaración de cumplimiento

Crea Tu Cuenta Gratuita

Google LinkedIn Facebook

o

Al continuar, acepta nuestros Términos de uso, nuestra Política de privacidad y que sus datos se almacenan en los EE. UU.

¿Entrenar a 2 o más personas?

Probar DataCamp for Business

Preferido por estudiantes en miles de empresas

Descripción del curso

Working with data is tricky - working with millions or even billions of rows is worse. Did you receive some data processing code written on a laptop with fairly pristine data? Chances are you’ve probably been put in charge of moving a basic data process from prototype to production. You may have worked with real world datasets, with missing fields, bizarre formatting, and orders of magnitude more data. Even if this is all new to you, this course helps you learn what’s needed to prepare data processes using Python with Apache Spark. You’ll learn terminology, methods, and some best practices to create a performant, maintainable, and understandable data processing platform.

Empresas

¿Entrenar a 2 o más personas?

Obtén a tu equipo acceso a la plataforma DataCamp completa, incluidas todas las funciones.

En las siguientes pistas

Big Data con PySpark

1
DataFrame details
Gratuito
A review of DataFrame fundamentals and the importance of data cleaning.
Reproducir Capítulo Ahora
Intro to data cleaning with Apache Spark
50 xp
Data cleaning review
50 xp
Defining a schema
100 xp
Immutability and lazy processing
50 xp
Immutability review
50 xp
Using lazy processing
100 xp
Understanding Parquet
50 xp
Saving a DataFrame in Parquet format
100 xp
SQL and Parquet
100 xp
2
Manipulating DataFrames in the real world
A look at various techniques to modify the contents of DataFrames in Spark.
Reproducir Capítulo Ahora
DataFrame column operations
50 xp
Filtering column content with Python
100 xp
Filtering Question #1
50 xp
Filtering Question #2
50 xp
Modifying DataFrame columns
100 xp
Conditional DataFrame column operations
50 xp
when() example
100 xp
When / Otherwise
100 xp
User defined functions
50 xp
Understanding user defined functions
50 xp
Using user defined functions in Spark
100 xp
Partitioning and lazy processing
50 xp
Adding an ID Field
100 xp
IDs with different partitions
100 xp
More ID tricks
100 xp
3
Improving Performance
Improve data cleaning tasks by increasing performance or reducing resource requirements.
Reproducir Capítulo Ahora
Caching
50 xp
Caching a DataFrame
100 xp
Removing a DataFrame from cache
100 xp
Improve import performance
50 xp
File size optimization
50 xp
File import performance
100 xp
Cluster configurations
50 xp
Reading Spark configurations
100 xp
Writing Spark configurations
100 xp
Performance improvements
50 xp
Normal joins
100 xp
Using broadcasting on Spark joins
100 xp
Comparing broadcast vs normal joins
100 xp
4
Complex processing and data pipelines
Learn how to process complex real-world data using Spark and the basics of pipelines.
Reproducir Capítulo Ahora
Introduction to data pipelines
50 xp
Quick pipeline
100 xp
Pipeline data issue
50 xp
Data handling techniques
50 xp
Removing commented lines
100 xp
Removing invalid rows
100 xp
Splitting into columns
100 xp
Further parsing
100 xp
Data validation
50 xp
Validate rows via join
100 xp
Examining invalid rows
100 xp
Final analysis and delivery
50 xp
Dog parsing
100 xp
Per image count
100 xp
Percentage dog pixels
100 xp
Congratulations and next steps
50 xp

Empresas

¿Entrenar a 2 o más personas?

Obtén a tu equipo acceso a la plataforma DataCamp completa, incluidas todas las funciones.

En las siguientes pistas

Big Data con PySpark

conjuntos de datos

Dallas Council Votes Dallas Council Voters Flights - 2014 Flights - 2015 Flights - 2016 Flights - 2017

colaboradores

Hadrien Lacroix

Hillary Green-Lerman

requisitos previos

Intermediate Python Introduction to PySpark

Data Engineer Consultant @ Flexible Creations

¿Qué tienen que decir otros alumnos?

¡Únete a 15 millones de estudiantes y empieza Cleaning Data with PySpark hoy mismo!

Crea Tu Cuenta Gratuita

Google LinkedIn Facebook

o

Al continuar, acepta nuestros Términos de uso, nuestra Política de privacidad y que sus datos se almacenan en los EE. UU.