# Machine Learning with PySpark

Learn how to make predictions from data with Apache Spark, using decision trees, logistic regression, linear regression, ensembles, and pipelines.

Start Course for Free4 Hours16 Videos56 Exercises

## Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.Training 2 or more people?Try DataCamp For Business

## Loved by learners at thousands of companies

## Course Description

## Learn to Use Apache Spark for Machine Learning

Spark is a powerful, general purpose tool for working with Big Data. Spark transparently handles the distribution of compute tasks across a cluster. This means that operations are fast, but it also allows you to focus on the analysis rather than worry about technical details. In this course you'll learn how to get data into Spark and then delve into the three fundamental Spark Machine Learning algorithms: Linear Regression, Logistic Regression/Classifiers, and creating pipelines.## Build and Test Decision Trees

Building your own decision trees is a great way to start exploring machine learning models. You’ll use an algorithm called ‘Recursive Partitioning’ to divide data into two classes and find a predictor within your data that results in the most informative split of the two classes, and repeat this action with further nodes. You can then use your decision tree to make predictions with new data.## Master Logistic and Linear Regression in PySpark

Logistic and linear regression are essential machine learning techniques that are supported by PySpark. You’ll learn to build and evaluate logistic regression models, before moving on to creating linear regression models to help you refine your predictors to only the most relevant options.By the end of the course, you’ll feel confident in applying your new-found machine learning knowledge, thanks to hands-on tasks and practice data sets found throughout the course.

For Business

### Training 2 or more people?

Get your team access to the full DataCamp library, with centralized reporting, assignments, projects and more- 1
### Introduction

**Free**Spark is a framework for working with Big Data. In this chapter you'll cover some background about Spark and Machine Learning. You'll then find out how to connect to Spark using Python and load CSV data.

- 2
### Classification

Now that you are familiar with getting data into Spark, you'll move onto building two types of classification model: Decision Trees and Logistic Regression. You'll also find out about a few approaches to data preparation.

Data Preparation50 xpRemoving columns and rows100 xpColumn manipulation100 xpCategorical columns100 xpAssembling columns100 xpDecision Tree50 xpTrain/test split100 xpBuild a Decision Tree100 xpEvaluate the Decision Tree100 xpLogistic Regression50 xpBuild a Logistic Regression model100 xpEvaluate the Logistic Regression model100 xpTurning Text into Tables50 xpPunctuation, numbers and tokens100 xpStop words and hashing100 xpTraining a spam classifier100 xp - 3
### Regression

Next you'll learn to create Linear Regression models. You'll also find out how to augment your data by engineering new predictors as well as a robust approach to selecting only the most relevant predictors.

One-Hot Encoding50 xpEncoding flight origin100 xpEncoding shirt sizes50 xpRegression50 xpFlight duration model: Just distance100 xpInterpreting the coefficients100 xpFlight duration model: Adding origin airport100 xpInterpreting coefficients100 xpBucketing & Engineering50 xpBucketing departure time100 xpFlight duration model: Adding departure time100 xpRegularization50 xpFlight duration model: More features!100 xpFlight duration model: Regularization!100 xp - 4
### Ensembles & Pipelines

Finally you'll learn how to make your models more efficient. You'll find out how to use pipelines to make your code clearer and easier to maintain. Then you'll use cross-validation to better test your models and select good model parameters. Finally you'll dabble in two types of ensemble model.

Pipeline50 xpFlight duration model: Pipeline stages100 xpFlight duration model: Pipeline model100 xpSMS spam pipeline100 xpCross-Validation50 xpCross validating simple flight duration model100 xpCross validating flight duration model pipeline100 xpGrid Search50 xpOptimizing flights linear regression100 xpDissecting the best flight duration model100 xpSMS spam optimised100 xpHow many models for grid search?50 xpEnsemble50 xpDelayed flights with Gradient-Boosted Trees100 xpDelayed flights with a Random Forest100 xpEvaluating Random Forest100 xpClosing thoughts50 xp

Collaborators

Andrew Collier

See MoreData Scientist @ Exegetic Analytics

Andrew Collier is a Data Scientist, working mostly in R and Python but also dabbling in a wide range of other technologies. When not in front of a computer he spends time with his family and runs obsessively.

## What do other learners have to say?

## FAQs

## Join over 13 million learners and start Machine Learning with PySpark today!

## Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.