Big Data Fundamentals with PySpark
Learn the fundamentals of working with big data with PySpark.
Kurs Kostenlos Starten4 Stunden16 Videos55 Übungen52.260 LernendeLeistungsnachweis
Kostenloses Konto erstellen
oder
Durch Klick auf die Schaltfläche akzeptierst du unsere Nutzungsbedingungen, unsere Datenschutzrichtlinie und die Speicherung deiner Daten in den USA.Trainierst du 2 oder mehr?
Versuchen DataCamp for BusinessBeliebt bei Lernenden in Tausenden Unternehmen
Kursbeschreibung
There's been a lot of buzz about Big Data over the past few years, and it's finally become mainstream for many companies. But what is this Big Data? This course covers the fundamentals of Big Data via PySpark. Spark is a "lightning fast cluster computing" framework for Big Data. It provides a general data processing platform engine and lets you run programs up to 100x faster in memory, or 10x faster on disk, than Hadoop. You’ll use PySpark, a Python package for Spark programming and its powerful, higher-level libraries such as SparkSQL, MLlib (for machine learning), etc. You will explore the works of William Shakespeare, analyze Fifa 2018 data and perform clustering on genomic datasets. At the end of this course, you will have gained an in-depth understanding of PySpark and its application to general Big Data analysis.
Trainierst du 2 oder mehr?
Verschaffen Sie Ihrem Team Zugriff auf die vollständige DataCamp-Plattform, einschließlich aller Funktionen.In den folgenden Tracks
Big Data mit PySpark
Gehe zu Track- 1
Introduction to Big Data analysis with Spark
KostenlosThis chapter introduces the exciting world of Big Data, as well as the various concepts and different frameworks for processing Big Data. You will understand why Apache Spark is considered the best framework for BigData.
- 2
Programming in PySpark RDD’s
The main abstraction Spark provides is a resilient distributed dataset (RDD), which is the fundamental and backbone data type of this engine. This chapter introduces RDDs and shows how RDDs can be created and executed using RDD Transformations and Actions.
Abstracting Data with RDDs50 xpRDDs from Parallelized collections100 xpRDDs from External Datasets100 xpPartitions in your data100 xpBasic RDD Transformations and Actions50 xpMap and Collect100 xpFilter and Count100 xpPair RDDs in PySpark50 xpReduceBykey and Collect100 xpSortByKey and Collect100 xpAdvanced RDD Actions50 xpCountingBykeys100 xpCreate a base RDD and transform it100 xpRemove stop words and reduce the dataset100 xpPrint word frequencies100 xp - 3
PySpark SQL & DataFrames
In this chapter, you'll learn about Spark SQL which is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine. This chapter shows how Spark SQL allows you to use DataFrames in Python.
Abstracting Data with DataFrames50 xpRDD to DataFrame100 xpLoading CSV into DataFrame100 xpOperating on DataFrames in PySpark50 xpInspecting data in PySpark DataFrame100 xpPySpark DataFrame subsetting and cleaning100 xpFiltering your DataFrame100 xpInteracting with DataFrames using PySpark SQL50 xpRunning SQL Queries Programmatically100 xpSQL queries for filtering Table100 xpData Visualization in PySpark using DataFrames50 xpPySpark DataFrame visualization100 xpPart 1: Create a DataFrame from CSV file100 xpPart 2: SQL Queries on DataFrame100 xpPart 3: Data visualization100 xp - 4
Machine Learning with PySpark MLlib
PySpark MLlib is the Apache Spark scalable machine learning library in Python consisting of common learning algorithms and utilities. Throughout this last chapter, you'll learn important Machine Learning algorithms. You will build a movie recommendation engine and a spam filter, and use k-means clustering.
Overview of PySpark MLlib50 xpPySpark ML libraries50 xpPySpark MLlib algorithms100 xpCollaborative filtering50 xpLoading Movie Lens dataset into RDDs100 xpModel training and predictions100 xpModel evaluation using MSE100 xpClassification50 xpLoading spam and non-spam data100 xpFeature hashing and LabelPoint100 xpLogistic Regression model training100 xpClustering50 xpLoading and parsing the 5000 points data100 xpK-means training100 xpVisualizing clusters100 xpCongratulations!50 xp
Trainierst du 2 oder mehr?
Verschaffen Sie Ihrem Team Zugriff auf die vollständige DataCamp-Plattform, einschließlich aller Funktionen.In den folgenden Tracks
Big Data mit PySpark
Gehe zu TrackMitwirkende
Voraussetzungen
Introduction to PythonUpendra Kumar Devisetty
Mehr AnzeigenScience Analyst at CyVerse
Was sagen andere Lernende?
Melden Sie sich an 15 Millionen Lernende und starten Sie Big Data Fundamentals with PySpark Heute!
Kostenloses Konto erstellen
oder
Durch Klick auf die Schaltfläche akzeptierst du unsere Nutzungsbedingungen, unsere Datenschutzrichtlinie und die Speicherung deiner Daten in den USA.