Pular para o conteúdo principal

Web Scraping in R

Learn how to efficiently collect and download data from any website using R.

Comece O Curso Gratuitamente

4 horas13 vídeos45 exercícios12.842 aprendizesDeclaração de Realização

Crie sua conta gratuita

Google LinkedIn Facebook

ou

Ao continuar, você aceita nossos Termos de Uso, nossa Política de Privacidade e que seus dados são armazenados nos EUA.

Treinar 2 ou mais pessoas?

Tentar DataCamp for Business

Amado por alunos de milhares de empresas

Descrição do Curso

Have you ever come across a website that displays a lot of data such as statistics, product reviews, or prices in a format that’s not data analysis-ready? Often, authorities and other data providers publish their data in neatly formatted tables. However, not all of these sites include a download button, but don’t despair. In this course, you’ll learn how to efficiently collect and download data from any website using R. You'll learn how to automate the scraping and parsing of Wikipedia using the rvest and httr packages. Through hands-on exercises, you’ll also expand your understanding of HTML and CSS, the building blocks of web pages, as you make your data harvesting workflows less error-prone and more efficient.

Para Empresas

Treinar 2 ou mais pessoas?

Obtenha acesso à sua equipe à plataforma DataCamp completa, incluindo todos os recursos.

Nas seguintes faixas

Desenvolvedor R

Ir para a trilha

1
Introduction to HTML and Web Scraping
Gratuito
In this chapter, you'll be introduced to Hyper Text Markup Language (HTML), a declarative language used to structure modern websites. Using the rvest library, you'll learn how to query simple HTML elements and scrape your first table.
Reproduzir Capítulo Agora
Introduction to HTML
50 xp
Read in HTML
100 xp
Beware of syntax errors!
50 xp
Navigating HTML
50 xp
Select all children of a list
100 xp
Parse hyperlinks into a data frame
100 xp
Scrape your first table
50 xp
The right order of table elements
100 xp
Turn a table into a data frame with html_table()
100 xp
2
Navigation and Selection with CSS
Cascading Style Sheets (CSS) describe how HTML elements are displayed on a web page, including colors, fonts, and general layout. In this chapter, you'll learn why CSS selectors and combinators are a crucial ingredient for web scraping.
Reproduzir Capítulo Agora
Introduction to CSS
50 xp
Select multiple HTML types
100 xp
Order CSS selectors by the number of results
100 xp
CSS classes and IDs
50 xp
Identify the correct selector types
100 xp
Leverage the uniqueness of IDs
100 xp
Select the last child with a pseudo-class
100 xp
CSS combinators
50 xp
Select direct descendants with the child combinator
100 xp
How many elements get returned?
50 xp
Simply the best!
100 xp
Not every sibling is the same
100 xp
3
Advanced Selection with XPATH
The CSS selectors you got to know in the last chapter are powerful but have their limitations. For example, if you want to select nodes based on the properties of their descendants. XPath to the rescue! Using this query language, you can navigate and scrape even the most hideous HTML.
Reproduzir Capítulo Agora
Introduction to XPATH
50 xp
Find the correct CSS equivalent
100 xp
Select by class and ID with XPATH
100 xp
Use predicates to select nodes based on their children
100 xp
XPATH functions and advanced predicates
50 xp
Find a more elegant XPATH alternative
50 xp
Get to know the position() function
100 xp
Extract nodes based on the number of their children
100 xp
The XPATH text() function
50 xp
The shortcomings of html_table() with badly structured tables
100 xp
Select directly from a parent element with XPATH's text()
100 xp
Combine extracted data into a data frame
100 xp
Scrape an element based on its text
100 xp
4
Scraping Best Practices
Now that you know how to extract content from web pages, it's time to look behind the curtains. In this final chapter, you’ll learn why HTTP requests are the foundation of every scraping action and how they can be customized to comply with best practices in web scraping.
Reproduzir Capítulo Agora
The nature of HTTP requests
50 xp
Which of these statements about HTTP is false?
50 xp
Do it the httr way
100 xp
Houston, we got a 404!
100 xp
Telling who you are with custom user agents
50 xp
Check out your user agent
100 xp
Add a custom user agent
100 xp
How to be gentle and slow down your requests
50 xp
Custom arguments for throttled functions
50 xp
Apply throttling to a multi-page crawler
100 xp
Recap: Web Scraping in R
50 xp

Para Empresas

Treinar 2 ou mais pessoas?

Obtenha acesso à sua equipe à plataforma DataCamp completa, incluindo todos os recursos.

Nas seguintes faixas

Desenvolvedor R

Ir para a trilha

colaboradores

Maggie Matsui

Amy Peterson

pré-requisitos

Intermediate R Introduction to the Tidyverse

Timo Grossenbacher

Head of Newsroom Automation at Tamedia

O que os outros alunos têm a dizer?

Junte-se a mais de 15 milhões de alunos e comece Web Scraping in R hoje mesmo!

Crie sua conta gratuita

Google LinkedIn Facebook

ou

Ao continuar, você aceita nossos Termos de Uso, nossa Política de Privacidade e que seus dados são armazenados nos EUA.