Direkt zum Inhalt

Startseite Python

Feature Engineering for NLP in Python

Learn techniques to extract useful information from text and process them into a format suitable for machine learning.

Kurs Kostenlos Starten

4 Stunden15 Videos52 Übungen25.042 LernendeLeistungsnachweis

Kostenloses Konto erstellen

Google LinkedIn Facebook

oder

Durch Klick auf die Schaltfläche akzeptierst du unsere Nutzungsbedingungen, unsere Datenschutzrichtlinie und die Speicherung deiner Daten in den USA.

Trainierst du 2 oder mehr?

Versuchen DataCamp for Business

Beliebt bei Lernenden in Tausenden Unternehmen

Kursbeschreibung

In this course, you will learn techniques that will allow you to extract useful information from text and process them into a format suitable for applying ML models. More specifically, you will learn about POS tagging, named entity recognition, readability scores, the n-gram and tf-idf models, and how to implement them using scikit-learn and spaCy. You will also learn to compute how similar two documents are to each other. In the process, you will predict the sentiment of movie reviews and build movie and Ted Talk recommenders. Following the course, you will be able to engineer critical features out of any text and solve some of the most challenging problems in data science!

Für Unternehmen

Trainierst du 2 oder mehr?

Verschaffen Sie Ihrem Team Zugriff auf die vollständige DataCamp-Plattform, einschließlich aller Funktionen.

In den folgenden Tracks

Machine Learning Scientist mit Python

Natürliche Sprachverarbeitung in Python

1
Basic features and readability scores
Kostenlos
Learn to compute basic features such as number of words, number of characters, average word length and number of special characters (such as Twitter hashtags and mentions). You will also learn to compute readability scores and determine the amount of education required to comprehend a piece of text.
Kapitel Jetzt Abspielen
Introduction to NLP feature engineering
50 xp
Data format for ML algorithms
50 xp
One-hot encoding
100 xp
Basic feature extraction
50 xp
Character count of Russian tweets
100 xp
Word count of TED talks
100 xp
Hashtags and mentions in Russian tweets
100 xp
Readability tests
50 xp
Readability of 'The Myth of Sisyphus'
100 xp
Readability of various publications
100 xp
2
Text preprocessing, POS tagging and NER
In this chapter, you will learn about tokenization and lemmatization. You will then learn how to perform text cleaning, part-of-speech tagging, and named entity recognition using the spaCy library. Upon mastering these concepts, you will proceed to make the Gettysburg address machine-friendly, analyze noun usage in fake news, and identify people mentioned in a TechCrunch article.
Kapitel Jetzt Abspielen
Tokenization and Lemmatization
50 xp
Identifying lemmas
50 xp
Tokenizing the Gettysburg Address
100 xp
Lemmatizing the Gettysburg address
100 xp
Text cleaning
50 xp
Cleaning a blog post
100 xp
Cleaning TED talks in a dataframe
100 xp
Part-of-speech tagging
50 xp
POS tagging in Lord of the Flies
100 xp
Counting nouns in a piece of text
100 xp
Noun usage in fake news
100 xp
Named entity recognition
50 xp
Named entities in a sentence
100 xp
Identifying people mentioned in a news article
100 xp
3
N-Gram models
Learn about n-gram modeling and use it to perform sentiment analysis on movie reviews.
Kapitel Jetzt Abspielen
Building a bag of words model
50 xp
Word vectors with a given vocabulary
50 xp
BoW model for movie taglines
100 xp
Analyzing dimensionality and preprocessing
100 xp
Mapping feature indices with feature names
100 xp
Building a BoW Naive Bayes classifier
50 xp
BoW vectors for movie reviews
100 xp
Predicting the sentiment of a movie review
100 xp
Building n-gram models
50 xp
n-gram models for movie tag lines
100 xp
Higher order n-grams for sentiment analysis
100 xp
Comparing performance of n-gram models
100 xp
4
TF-IDF and similarity scores
Learn how to compute tf-idf weights and the cosine similarity score between two vectors. You will use these concepts to build a movie and a TED Talk recommender. Finally, you will also learn about word embeddings and using word vector representations, you will compute similarities between various Pink Floyd songs.
Kapitel Jetzt Abspielen
Building tf-idf document vectors
50 xp
tf-idf weight of commonly occurring words
50 xp
tf-idf vectors for TED talks
100 xp
Cosine similarity
50 xp
Range of cosine scores
50 xp
Computing dot product
100 xp
Cosine similarity matrix of a corpus
100 xp
Building a plot line based recommender
50 xp
Comparing linear_kernel and cosine_similarity
100 xp
Plot recommendation engine
100 xp
The recommender function
100 xp
TED talk recommender
100 xp
Beyond n-grams: word embeddings
50 xp
Generating word vectors
100 xp
Computing similarity of Pink Floyd songs
100 xp
Congratulations!
50 xp

Für Unternehmen

Trainierst du 2 oder mehr?

Verschaffen Sie Ihrem Team Zugriff auf die vollständige DataCamp-Plattform, einschließlich aller Funktionen.

In den folgenden Tracks

Machine Learning Scientist mit Python

Natürliche Sprachverarbeitung in Python

Datensätze

Russian Troll Tweets Movie Overviews and Taglines Preprocessed Movie Reviews TED Talk Transcripts Real and Fake News Headlines

Mitwirkende

Adrián Soto

Hillary Green-Lerman

Voraussetzungen

Introduction to Natural Language Processing in Python Supervised Learning with scikit-learn

Data Scientist at Fractal Analytics

Was sagen andere Lernende?

Melden Sie sich an 15 Millionen Lernende und starten Sie Feature Engineering for NLP in Python Heute!

Kostenloses Konto erstellen

Google LinkedIn Facebook

oder

Durch Klick auf die Schaltfläche akzeptierst du unsere Nutzungsbedingungen, unsere Datenschutzrichtlinie und die Speicherung deiner Daten in den USA.