Case Study: School Budgeting with Machine Learning in Python
Learn how to build a model to automatically classify items in a school budget.
Commencer Le Cours Gratuitement4 heures15 vidéos51 exercices59 259 apprenantsDéclaration de réalisation
Créez votre compte gratuit
ou
En continuant, vous acceptez nos Conditions d'utilisation, notre Politique de confidentialité et le fait que vos données sont stockées aux États-Unis.Formation de 2 personnes ou plus ?
Essayer DataCamp for BusinessApprécié par les apprenants de milliers d'entreprises
Description du cours
Data science isn't just for predicting ad-clicks-it's also useful for social impact! This course is a case study from a machine learning competition on DrivenData. You'll explore a problem related to school district budgeting. By building a model to automatically classify items in a school's budget, it makes it easier and faster for schools to compare their spending with other schools. In this course, you'll begin by building a baseline model that is a simple, first-pass approach. In particular, you'll do some natural language processing to prepare the budgets for modeling. Next, you'll have the opportunity to try your own techniques and see how they compare to participants from the competition. Finally, you'll see how the winner was able to combine a number of expert techniques to build the most accurate model.
Formation de 2 personnes ou plus ?
Donnez à votre équipe l’accès à la plateforme DataCamp complète, y compris toutes les fonctionnalités.- 1
Exploring the raw data
GratuitIn this chapter, you'll be introduced to the problem you'll be solving in this course. How do you accurately classify line-items in a school budget based on what that money is being used for? You will explore the raw text and numeric values in the dataset, both quantitatively and visually. And you'll learn how to measure success when trying to predict class labels for each row of the dataset.
Introducing the challenge50 xpWhat category of problem is this?50 xpWhat is the goal of the algorithm?50 xpExploring the data50 xpLoading the data50 xpSummarizing the data100 xpLooking at the datatypes50 xpExploring datatypes in pandas50 xpEncode the labels as categorical variables100 xpCounting unique labels100 xpHow do we measure success?50 xpPenalizing highly confident wrong answers50 xpComputing log loss with NumPy100 xp - 2
Creating a simple first model
In this chapter, you'll build a first-pass model. You'll use numeric data only to train the model. Spoiler alert - throwing out all of the text data is bad for performance! But you'll learn how to format your predictions. Then, you'll be introduced to natural language processing (NLP) in order to start working with the large amounts of text in the data.
It's time to build a model50 xpSetting up a train-test split in scikit-learn100 xpTraining a model100 xpMaking predictions50 xpUse your model to predict values on holdout data100 xpWriting out your results to a csv for submission100 xpA very brief introduction to NLP50 xpTokenizing text50 xpTesting your NLP credentials with n-grams50 xpRepresenting text numerically50 xpCreating a bag-of-words in scikit-learn100 xpCombining text columns for tokenization100 xpWhat's in a token?100 xp - 3
Improving your model
Here, you'll improve on your benchmark model using pipelines. Because the budget consists of both text and numeric data, you'll learn to how build pipielines that process multiple types of data. You'll also explore how the flexibility of the pipeline workflow makes testing different approaches efficient, even in complicated problems like this one!
Pipelines, feature & text preprocessing50 xpInstantiate pipeline100 xpPreprocessing numeric features100 xpText features and feature unions50 xpPreprocessing text features100 xpMultiple types of processing: FunctionTransformer100 xpMultiple types of processing: FeatureUnion100 xpChoosing a classification model50 xpUsing FunctionTransformer on the main dataset100 xpAdd a model to the pipeline100 xpTry a different class of model100 xpCan you adjust the model or parameters to improve accuracy?100 xp - 4
Learning from the experts
In this chapter, you will learn the tricks used by the competition winner, and implement them yourself using scikit-learn. Enjoy!
Learning from the expert: processing50 xpHow many tokens?50 xpDeciding what's a word100 xpN-gram range in scikit-learn100 xpLearning from the expert: a stats trick50 xpWhich models of the data include interaction terms?50 xpImplement interaction modeling in scikit-learn100 xpLearning from the expert: the winning model50 xpWhy is hashing a useful trick?50 xpImplementing the hashing trick in scikit-learn100 xpBuild the winning model100 xpWhat tactics got the winner the best score?50 xpNext steps and the social impact of your work50 xp
Formation de 2 personnes ou plus ?
Donnez à votre équipe l’accès à la plateforme DataCamp complète, y compris toutes les fonctionnalités.collaborateurs
Peter Bull
Voir PlusCo-founder of DrivenData
Qu’est-ce que les autres apprenants ont à dire ?
Inscrivez-vous 15 millions d’apprenants et commencer Case Study: School Budgeting with Machine Learning in Python Aujourd’hui!
Créez votre compte gratuit
ou
En continuant, vous acceptez nos Conditions d'utilisation, notre Politique de confidentialité et le fait que vos données sont stockées aux États-Unis.