Skip to main content
HomeSpark

Big Data Fundamentals with PySpark

4+
19 reviews
Advanced

Learn the fundamentals of working with big data with PySpark.

Start Course for Free
4 hours16 videos55 exercises52,260 learnersTrophyStatement of Accomplishment

Create Your Free Account

GoogleLinkedInFacebook

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.
Group

Training 2 or more people?

Try DataCamp for Business

Loved by learners at thousands of companies


Course Description

There's been a lot of buzz about Big Data over the past few years, and it's finally become mainstream for many companies. But what is this Big Data? This course covers the fundamentals of Big Data via PySpark. Spark is a "lightning fast cluster computing" framework for Big Data. It provides a general data processing platform engine and lets you run programs up to 100x faster in memory, or 10x faster on disk, than Hadoop. You’ll use PySpark, a Python package for Spark programming and its powerful, higher-level libraries such as SparkSQL, MLlib (for machine learning), etc. You will explore the works of William Shakespeare, analyze Fifa 2018 data and perform clustering on genomic datasets. At the end of this course, you will have gained an in-depth understanding of PySpark and its application to general Big Data analysis.
For Business

Training 2 or more people?

Get your team access to the full DataCamp platform, including all the features.
DataCamp for BusinessFor a bespoke solution book a demo.

In the following Tracks

Big Data with PySpark

Go To Track
  1. 1

    Introduction to Big Data analysis with Spark

    Free

    This chapter introduces the exciting world of Big Data, as well as the various concepts and different frameworks for processing Big Data. You will understand why Apache Spark is considered the best framework for BigData.

    Play Chapter Now
    What is Big Data?
    50 xp
    The 3 V's of Big Data
    50 xp
    PySpark: Spark with Python
    50 xp
    Understanding SparkContext
    100 xp
    Interactive Use of PySpark
    100 xp
    Loading data in PySpark shell
    100 xp
    Review of functional programming in Python
    50 xp
    Use of lambda() with map()
    100 xp
    Use of lambda() with filter()
    100 xp
  2. 4

    Machine Learning with PySpark MLlib

    PySpark MLlib is the Apache Spark scalable machine learning library in Python consisting of common learning algorithms and utilities. Throughout this last chapter, you'll learn important Machine Learning algorithms. You will build a movie recommendation engine and a spam filter, and use k-means clustering.

    Play Chapter Now
For Business

Training 2 or more people?

Get your team access to the full DataCamp platform, including all the features.

In the following Tracks

Big Data with PySpark

Go To Track

datasets

Complete ShakespeareMovie ratings5000 pointsFIFA 2018PeopleSpamHam

collaborators

Collaborator's avatar
Hadrien Lacroix
Collaborator's avatar
Chester Ismay
Upendra Kumar Devisetty HeadshotUpendra Kumar Devisetty

Science Analyst at CyVerse

Upendra Kumar Devisetty is a Science Analyst at CyVerse where he scientifically interacts with biologists, bioinformaticians, programming teams and other members of CyVerse team. He also coordinates development across projects, and facilitates integration and cross-communication. His current work mainly focuses on integrative analysis of Big Data using high-throughput methods on advanced computing systems. As scientific computing is becoming indispensable for Big Data research, he started building a community to develop and propagate a set of best practices, including continuous testing, version control, virtualization, sharing code through notebooks, and standard data structures.
See More

Don’t just take our word for it

*4
from 19 reviews
53%
16%
11%
21%
0%
Sort by
  • rajesh k.
    3 months

    amazing content

  • Ravdeep S.
    5 months

    The course was just perfect and grew in terms of complexity towards the end. It helped me a lot in understanding fundamentals of big data on Puspark. I encourage Upendra to have more courses on datacamp.

  • Nitin G.
    9 months

    Great content

  • Marcos C.
    11 months

    Very didactic

  • Rodina 2.
    about 1 year

    The content is very good and the explanation is to the point.

"amazing content"

rajesh k.

"The course was just perfect and grew in terms of complexity towards the end. It helped me a lot in understanding fundamentals of big data on Puspark. I encourage Upendra to have more courses on datacamp."

Ravdeep S.

"Great content"

Nitin G.

Join over 15 million learners and start Big Data Fundamentals with PySpark today!

Create Your Free Account

GoogleLinkedInFacebook

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.