The real world is messy and your job is to make sense of it. Toy datasets like MTCars and Iris are the result of careful curation and cleaning, even so the data needs to be transformed for it to be useful for powerful machine learning algorithms to extract meaning, forecast, classify or cluster. This course will cover the gritty details that data scientists are spending 70-80% of their time on; data wrangling and feature engineering. With size of datasets now becoming ever larger, let's use PySpark to cut this Big Data problem down to size!
Exploratory Data AnalysisFree
Get to know a bit about your problem before you dive in! Then learn how to statistically and visually inspect your dataset!Where to Begin50 xpWhere to begin?50 xpCheck Version100 xpLoad in the data100 xpDefining A Problem50 xpWhat are we predicting?100 xpVerifying Data Load100 xpVerifying DataTypes100 xpVisually Inspecting Data / EDA50 xpUsing Corr()100 xpUsing Visualizations: distplot100 xpUsing Visualizations: lmplot100 xp
Wrangling with Spark Functions
Real data is rarely clean and ready for analysis. In this chapter learn to remove unneeded information, handle missing values and add additional data to your analysis.Dropping data50 xpDropping a list of columns100 xpUsing text filters to remove records100 xpFiltering numeric fields conditionally100 xpAdjusting Data50 xpCustom Percentage Scaling100 xpScaling your scalers100 xpCorrecting Right Skew Data100 xpWorking with Missing Data50 xpVisualizing Missing Data100 xpImputing Missing Data100 xpCalculate Missing Percents100 xpGetting More Data50 xpA Dangerous Join100 xpSpark SQL Join100 xpChecking for Bad Joins100 xp
In this chapter learn how to create new features for your machine learning model to learn from. We'll look at generating them by combining fields, extracting values from messy columns or encoding them for better results.Feature Generation50 xpDifferences100 xpRatios100 xpDeeper Features100 xpTime Features50 xpTime Components100 xpJoining On Time Components100 xpDate Math100 xpExtracting Features50 xpExtracting Text to New Features100 xpSplitting & Exploding100 xpPivot & Join100 xpBinarizing, Bucketing & Encoding50 xpBinarizing Day of Week100 xpBucketing100 xpOne Hot Encoding100 xp
Building a Model
In this chapter we'll learn how to choose which type of model we want. Then we will learn how to apply our data to the model and evaluate it. Lastly, we'll learn how to interpret the results and save the model for later!Choosing the Algorithm50 xpWhich MLlib Module?50 xpCreating Time Splits100 xpAdjusting Time Features100 xpFeature Engineering Assumptions for RFR50 xpFeature Engineering For Random Forests50 xpDropping Columns with Low Observations100 xpNaively Handling Missing and Categorical Values100 xpBuilding a Model50 xpBuilding a Regression Model100 xpEvaluating & Comparing Algorithms100 xpUnderstanding Metrics50 xpInterpreting, Saving & Loading50 xpInterpreting Results100 xpSaving & Loading Models100 xpFinal Thoughts50 xp
In the following tracksBig Data with PySpark
John HogueSee More
Lead Data Scientist, General Mills
I have a strong drive for innovation and giving back. Through my work I enjoy building out a career path and center of excellence for those in data science at General Mills. I have a passion for taking action and challenging the status quo with fact based analysis to drive results. Outside of work I enjoy running an organization that gives aspiring and practicing data scientists opportunities to show case their skills in a meaningful way.