Feature Engineering for NLP in Python
Learn techniques to extract useful information from text and process them into a format suitable for machine learning.
Kurs Kostenlos Starten4 Stunden15 Videos52 Übungen25.042 LernendeLeistungsnachweis
Kostenloses Konto erstellen
oder
Durch Klick auf die Schaltfläche akzeptierst du unsere Nutzungsbedingungen, unsere Datenschutzrichtlinie und die Speicherung deiner Daten in den USA.Trainierst du 2 oder mehr?
Versuchen DataCamp for BusinessBeliebt bei Lernenden in Tausenden Unternehmen
Kursbeschreibung
In this course, you will learn techniques that will allow you to extract useful information from text and process them into a format suitable for applying ML models. More specifically, you will learn about POS tagging, named entity recognition, readability scores, the n-gram and tf-idf models, and how to implement them using scikit-learn and spaCy. You will also learn to compute how similar two documents are to each other. In the process, you will predict the sentiment of movie reviews and build movie and Ted Talk recommenders. Following the course, you will be able to engineer critical features out of any text and solve some of the most challenging problems in data science!
Trainierst du 2 oder mehr?
Verschaffen Sie Ihrem Team Zugriff auf die vollständige DataCamp-Plattform, einschließlich aller Funktionen.In den folgenden Tracks
Machine Learning Scientist mit Python
Gehe zu TrackNatürliche Sprachverarbeitung in Python
Gehe zu Track- 1
Basic features and readability scores
KostenlosLearn to compute basic features such as number of words, number of characters, average word length and number of special characters (such as Twitter hashtags and mentions). You will also learn to compute readability scores and determine the amount of education required to comprehend a piece of text.
Introduction to NLP feature engineering50 xpData format for ML algorithms50 xpOne-hot encoding100 xpBasic feature extraction50 xpCharacter count of Russian tweets100 xpWord count of TED talks100 xpHashtags and mentions in Russian tweets100 xpReadability tests50 xpReadability of 'The Myth of Sisyphus'100 xpReadability of various publications100 xp - 2
Text preprocessing, POS tagging and NER
In this chapter, you will learn about tokenization and lemmatization. You will then learn how to perform text cleaning, part-of-speech tagging, and named entity recognition using the spaCy library. Upon mastering these concepts, you will proceed to make the Gettysburg address machine-friendly, analyze noun usage in fake news, and identify people mentioned in a TechCrunch article.
Tokenization and Lemmatization50 xpIdentifying lemmas50 xpTokenizing the Gettysburg Address100 xpLemmatizing the Gettysburg address100 xpText cleaning50 xpCleaning a blog post100 xpCleaning TED talks in a dataframe100 xpPart-of-speech tagging50 xpPOS tagging in Lord of the Flies100 xpCounting nouns in a piece of text100 xpNoun usage in fake news100 xpNamed entity recognition50 xpNamed entities in a sentence100 xpIdentifying people mentioned in a news article100 xp - 3
N-Gram models
Learn about n-gram modeling and use it to perform sentiment analysis on movie reviews.
Building a bag of words model50 xpWord vectors with a given vocabulary50 xpBoW model for movie taglines100 xpAnalyzing dimensionality and preprocessing100 xpMapping feature indices with feature names100 xpBuilding a BoW Naive Bayes classifier50 xpBoW vectors for movie reviews100 xpPredicting the sentiment of a movie review100 xpBuilding n-gram models50 xpn-gram models for movie tag lines100 xpHigher order n-grams for sentiment analysis100 xpComparing performance of n-gram models100 xp - 4
TF-IDF and similarity scores
Learn how to compute tf-idf weights and the cosine similarity score between two vectors. You will use these concepts to build a movie and a TED Talk recommender. Finally, you will also learn about word embeddings and using word vector representations, you will compute similarities between various Pink Floyd songs.
Building tf-idf document vectors50 xptf-idf weight of commonly occurring words50 xptf-idf vectors for TED talks100 xpCosine similarity50 xpRange of cosine scores50 xpComputing dot product100 xpCosine similarity matrix of a corpus100 xpBuilding a plot line based recommender50 xpComparing linear_kernel and cosine_similarity100 xpPlot recommendation engine100 xpThe recommender function100 xpTED talk recommender100 xpBeyond n-grams: word embeddings50 xpGenerating word vectors100 xpComputing similarity of Pink Floyd songs100 xpCongratulations!50 xp
Trainierst du 2 oder mehr?
Verschaffen Sie Ihrem Team Zugriff auf die vollständige DataCamp-Plattform, einschließlich aller Funktionen.In den folgenden Tracks
Machine Learning Scientist mit Python
Gehe zu TrackNatürliche Sprachverarbeitung in Python
Gehe zu TrackDatensätze
Russian Troll TweetsMovie Overviews and TaglinesPreprocessed Movie ReviewsTED Talk TranscriptsReal and Fake News HeadlinesMitwirkende
Rounak Banik
Mehr AnzeigenData Scientist at Fractal Analytics
Was sagen andere Lernende?
Melden Sie sich an 15 Millionen Lernende und starten Sie Feature Engineering for NLP in Python Heute!
Kostenloses Konto erstellen
oder
Durch Klick auf die Schaltfläche akzeptierst du unsere Nutzungsbedingungen, unsere Datenschutzrichtlinie und die Speicherung deiner Daten in den USA.