Scalable Data Processing in R
Learn how to write scalable code for working with big data in R using the bigmemory and iotools packages.
Start Course for Free4 hours15 videos49 exercises5,842 learnersStatement of Accomplishment
Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.Training 2 or more people?
Try DataCamp for BusinessLoved by learners at thousands of companies
Course Description
Datasets are often larger than available RAM, which causes problems for R programmers since by default all the variables are stored in memory. You’ll learn tools for processing, exploring, and analyzing data directly from disk. You’ll also implement the split-apply-combine approach and learn how to write scalable code using the bigmemory and iotools packages. In this course, you'll make use of the Federal Housing Finance Agency's data, a publicly available data set chronicling all mortgages that were held or securitized by both Federal National Mortgage Association (Fannie Mae) and Federal Home Loan Mortgage Corporation (Freddie Mac) from 2009-2015.
Training 2 or more people?
Get your team access to the full DataCamp platform, including all the features.In the following Tracks
Big Data in R
Go To Track- 1
Working with increasingly large data sets
FreeIn this chapter, we cover the reasons you need to apply new techniques when data sets are larger than available RAM. We show that importing and exporting data using the base R functions can be slow and some easy ways to remedy this. Finally, we introduce the bigmemory package.
What is Scalable Data Processing?50 xpWhy is your code slow?50 xpHow does processing time vary by data size?100 xpWorking with "Out-of-Core" Objects using the Bigmemory Project50 xpReading a big.matrix object100 xpAttaching a big.matrix object100 xpCreating tables with big.matrix objects100 xpData summary using bigsummary100 xpReferences vs. Copies50 xpCopying matrices and big matrices100 xp - 2
Processing and Analyzing Data with bigmemory
Now that you've got some experience using bigmemory, we're going to go through some simple data exploration and analysis techniques. In particular, we'll see how to create tables and implement the split-apply-combine approach.
The Bigmemory Suite of Packages50 xpTabulating using bigtable100 xpBorrower Race and Ethnicity by Year (I)100 xpSplit-Apply-Combine50 xpFemale Proportion Borrowing100 xpSplit100 xpApply100 xpCombine100 xpVisualize your results using the tidyverse50 xpVisualizing Female Proportion Borrowing100 xpThe Borrower Income Ratio100 xpTidy Big Tables100 xpLimitations of bigmemory50 xpWhere should you use bigmemory?50 xp - 3
Working with iotools
We'll use the iotools package that can process both numeric and string data, and introduce the concept of chunk-wise processing.
Introduction to chunk-wise processing50 xpCan you split-compute-combine it?50 xpFoldable operations (I)100 xpFoldable operations (II)100 xpA first look at iotools: Importing data50 xpCompare read.delim() and read.delim.raw()100 xpReading raw data and turning it into a data structure100 xpchunk.apply50 xpReading chunks in as a matrix100 xpReading chunks in as a data.frame100 xpParallelizing calls to chunk.apply100 xp - 4
Case Study: A Preliminary Analysis of the Housing Data
In the previous chapters, we've introduced the housing data and shown how to compute with data that is about as big, or bigger than, the amount of RAM on a single machine. In this chapter, we'll go through a preliminary analysis of the data, comparing various trends over time.
Overview of types of analysis for this chapter50 xpRace and Ethnic Representation in the Mortgage Data100 xpComparing the Borrower Race/Ethnicity and their Proportions100 xpAre the data missing at random?50 xpLooking for Predictable Missingness100 xpA little more about missingness50 xpAnalyzing the Housing Data50 xpBorrower Race and Ethnicity by Year (II)100 xpVisualizing the Adjusted Demographic Trends100 xpRelative change in demographic trend100 xpBorrower Lending Trends: City vs. Rural50 xpBorrower Region by Year100 xpWho is securing federally guaranteed loans?100 xpCongratulations!50 xp
Training 2 or more people?
Get your team access to the full DataCamp platform, including all the features.In the following Tracks
Big Data in R
Go To TrackMichael Kane
See MoreAssistant Professor at Yale University
Simon Urbanek
See MoreMember of the R-Core; Lead Inventive Scientist at AT&T Labs Research
What do other learners have to say?
Join over 15 million learners and start Scalable Data Processing in R today!
Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.