Case Study: School Budgeting with Machine Learning in Python

Learn how to build a model to automatically classify items in a school budget.

Start Course for Free

4 hours15 videos51 exercises59,248 learnersStatement of Accomplishment

Create Your Free Account

Google LinkedIn Facebook

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.

Training 2 or more people?

Try DataCamp for Business

Loved by learners at thousands of companies

Course Description

Data science isn't just for predicting ad-clicks-it's also useful for social impact! This course is a case study from a machine learning competition on DrivenData. You'll explore a problem related to school district budgeting. By building a model to automatically classify items in a school's budget, it makes it easier and faster for schools to compare their spending with other schools. In this course, you'll begin by building a baseline model that is a simple, first-pass approach. In particular, you'll do some natural language processing to prepare the budgets for modeling. Next, you'll have the opportunity to try your own techniques and see how they compare to participants from the competition. Finally, you'll see how the winner was able to combine a number of expert techniques to build the most accurate model.

For Business

Training 2 or more people?

Get your team access to the full DataCamp platform, including all the features.

1
Exploring the raw data
Free
In this chapter, you'll be introduced to the problem you'll be solving in this course. How do you accurately classify line-items in a school budget based on what that money is being used for? You will explore the raw text and numeric values in the dataset, both quantitatively and visually. And you'll learn how to measure success when trying to predict class labels for each row of the dataset.
Play Chapter Now
Introducing the challenge
50 xp
What category of problem is this?
50 xp
What is the goal of the algorithm?
50 xp
Exploring the data
50 xp
Loading the data
50 xp
Summarizing the data
100 xp
Looking at the datatypes
50 xp
Exploring datatypes in pandas
50 xp
Encode the labels as categorical variables
100 xp
Counting unique labels
100 xp
How do we measure success?
50 xp
Penalizing highly confident wrong answers
50 xp
Computing log loss with NumPy
100 xp
2
Creating a simple first model
In this chapter, you'll build a first-pass model. You'll use numeric data only to train the model. Spoiler alert - throwing out all of the text data is bad for performance! But you'll learn how to format your predictions. Then, you'll be introduced to natural language processing (NLP) in order to start working with the large amounts of text in the data.
Play Chapter Now
It's time to build a model
50 xp
Setting up a train-test split in scikit-learn
100 xp
Training a model
100 xp
Making predictions
50 xp
Use your model to predict values on holdout data
100 xp
Writing out your results to a csv for submission
100 xp
A very brief introduction to NLP
50 xp
Tokenizing text
50 xp
Testing your NLP credentials with n-grams
50 xp
Representing text numerically
50 xp
Creating a bag-of-words in scikit-learn
100 xp
Combining text columns for tokenization
100 xp
What's in a token?
100 xp
3
Improving your model
Here, you'll improve on your benchmark model using pipelines. Because the budget consists of both text and numeric data, you'll learn to how build pipielines that process multiple types of data. You'll also explore how the flexibility of the pipeline workflow makes testing different approaches efficient, even in complicated problems like this one!
Play Chapter Now
Pipelines, feature & text preprocessing
50 xp
Instantiate pipeline
100 xp
Preprocessing numeric features
100 xp
Text features and feature unions
50 xp
Preprocessing text features
100 xp
Multiple types of processing: FunctionTransformer
100 xp
Multiple types of processing: FeatureUnion
100 xp
Choosing a classification model
50 xp
Using FunctionTransformer on the main dataset
100 xp
Add a model to the pipeline
100 xp
Try a different class of model
100 xp
Can you adjust the model or parameters to improve accuracy?
100 xp
4
Learning from the experts
In this chapter, you will learn the tricks used by the competition winner, and implement them yourself using scikit-learn. Enjoy!
Play Chapter Now
Learning from the expert: processing
50 xp
How many tokens?
50 xp
Deciding what's a word
100 xp
N-gram range in scikit-learn
100 xp
Learning from the expert: a stats trick
50 xp
Which models of the data include interaction terms?
50 xp
Implement interaction modeling in scikit-learn
100 xp
Learning from the expert: the winning model
50 xp
Why is hashing a useful trick?
50 xp
Implementing the hashing trick in scikit-learn
100 xp
Build the winning model
100 xp
What tactics got the winner the best score?
50 xp
Next steps and the social impact of your work
50 xp

For Business

Training 2 or more people?

Get your team access to the full DataCamp platform, including all the features.

collaborators

Hugo Bowne-Anderson

Yashas Roy

Casey Fitzpatrick

prerequisites

Supervised Learning with scikit-learn

Peter Bull

Co-founder of DrivenData

Peter is a co-founder of DrivenData. He earned his master's in Computational Science and Engineering from Harvard’s School of Engineering and Applied Sciences. His work lies at the intersection of statistics and computer science, and he wants to help bring powerful new modeling techniques to the organizations that need them most. He previously worked as a software engineer at Microsoft and earned a BA in philosophy from Yale University.

What do other learners have to say?

FAQs

Join over 15 million learners and start Case Study: School Budgeting with Machine Learning in Python today!

Create Your Free Account

Google LinkedIn Facebook

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.

Case Study: School Budgeting with Machine Learning in Python

Create Your Free Account

Training 2 or more people?

Loved by learners at thousands of companies

Course Description

Training 2 or more people?

Exploring the raw data

Creating a simple first model

Improving your model

Learning from the experts

Training 2 or more people?

What do other learners have to say?

FAQs

Is this course suitable for beginners?

Will I receive a certificate at the end of the course?

What topics will be covered in this course?

What jobs would benefit from this course?

What programming language is used in this course?

What techniques from natural language processing would I learn in this course?

How accurate did the winning model become?

Join over 15 million learners and start Case Study: School Budgeting with Machine Learning in Python today!

Create Your Free Account

Course Description

.css-10r9e5n{-webkit-margin-end:8px;margin-inline-end:8px;}.css-1309hh9{-webkit-flex-shrink:0;-ms-flex-negative:0;flex-shrink:0;-webkit-margin-end:8px;margin-inline-end:8px;}Training 2 or more people?

Exploring the raw data

Creating a simple first model

Improving your model

Learning from the experts

Training 2 or more people?

What do other learners have to say?

FAQs

What topics will be covered in this course?

What jobs would benefit from this course?

What programming language is used in this course?

What techniques from natural language processing would I learn in this course?

How accurate did the winning model become?

Join over .css-ou6dz6{color:#03ef62;}15 million learners and start Case Study: School Budgeting with Machine Learning in Python today!

Create Your Free Account

Training 2 or more people?

Join over 15 million learners and start Case Study: School Budgeting with Machine Learning in Python today!