Feature Engineering with PySpark
Learn the gritty details that data scientists are spending 70-80% of their time on; data wrangling and feature engineering.
Start Course for Free4 hours16 videos60 exercises14,806 learnersStatement of Accomplishment
Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.Training 2 or more people?
Try DataCamp for BusinessLoved by learners at thousands of companies
Course Description
The real world is messy and your job is to make sense of it. Toy datasets like MTCars and Iris are the result of careful curation and cleaning, even so the data needs to be transformed for it to be useful for powerful machine learning algorithms to extract meaning, forecast, classify or cluster. This course will cover the gritty details that data scientists are spending 70-80% of their time on; data wrangling and feature engineering. With size of datasets now becoming ever larger, let's use PySpark to cut this Big Data problem down to size!
Training 2 or more people?
Get your team access to the full DataCamp platform, including all the features.In the following Tracks
Big Data with PySpark
Go To Track- 1
Exploratory Data Analysis
FreeGet to know a bit about your problem before you dive in! Then learn how to statistically and visually inspect your dataset!
Where to Begin50 xpWhere to begin?50 xpCheck Version100 xpLoad in the data100 xpDefining A Problem50 xpWhat are we predicting?100 xpVerifying Data Load100 xpVerifying DataTypes100 xpVisually Inspecting Data / EDA50 xpUsing Corr()100 xpUsing Visualizations: distplot100 xpUsing Visualizations: lmplot100 xp - 2
Wrangling with Spark Functions
Real data is rarely clean and ready for analysis. In this chapter learn to remove unneeded information, handle missing values and add additional data to your analysis.
Dropping data50 xpDropping a list of columns100 xpUsing text filters to remove records100 xpFiltering numeric fields conditionally100 xpAdjusting Data50 xpCustom Percentage Scaling100 xpScaling your scalers100 xpCorrecting Right Skew Data100 xpWorking with Missing Data50 xpVisualizing Missing Data100 xpImputing Missing Data100 xpCalculate Missing Percents100 xpGetting More Data50 xpA Dangerous Join100 xpSpark SQL Join100 xpChecking for Bad Joins100 xp - 3
Feature Engineering
In this chapter learn how to create new features for your machine learning model to learn from. We'll look at generating them by combining fields, extracting values from messy columns or encoding them for better results.
Feature Generation50 xpDifferences100 xpRatios100 xpDeeper Features100 xpTime Features50 xpTime Components100 xpJoining On Time Components100 xpDate Math100 xpExtracting Features50 xpExtracting Text to New Features100 xpSplitting & Exploding100 xpPivot & Join100 xpBinarizing, Bucketing & Encoding50 xpBinarizing Day of Week100 xpBucketing100 xpOne Hot Encoding100 xp - 4
Building a Model
In this chapter we'll learn how to choose which type of model we want. Then we will learn how to apply our data to the model and evaluate it. Lastly, we'll learn how to interpret the results and save the model for later!
Choosing the Algorithm50 xpWhich MLlib Module?50 xpCreating Time Splits100 xpAdjusting Time Features100 xpFeature Engineering Assumptions for RFR50 xpFeature Engineering For Random Forests50 xpDropping Columns with Low Observations100 xpNaively Handling Missing and Categorical Values100 xpBuilding a Model50 xpBuilding a Regression Model100 xpEvaluating & Comparing Algorithms100 xpUnderstanding Metrics50 xpInterpreting, Saving & Loading50 xpInterpreting Results100 xpSaving & Loading Models100 xpFinal Thoughts50 xp
Training 2 or more people?
Get your team access to the full DataCamp platform, including all the features.In the following Tracks
Big Data with PySpark
Go To Trackcollaborators
John Hogue
See MoreLead Data Scientist, General Mills
What do other learners have to say?
Join over 15 million learners and start Feature Engineering with PySpark today!
Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.