Blog

Top articles you may have missed last month

This article covers major news and updates from the field of data science and machine learning that happened last month. Spanning different topics such as tutorials, research, new packages, use cases, and more.

Apr 2022 · 7 min read

Data science is one of the fastest-moving industries out there. New research papers, packages, tutorials, and technologies are being launched every day, sometimes making it hard to keep tabs. For any data practitioner, staying on top of the latest in data science is important to keep learning and growing. In this article, you’ll find a collection of all the latest news, tutorials, research, and insights you might have missed during last month.

Newly Released Tutorials

Two ways to create custom transformers with scikit-learn

This Towards Data Science blog post covers a fairly simple tutorial on how you can go about creating a custom transformer using scikit-learn. Scikit-learn is one of the most widely used packages in data science, and arguably the most popular machine learning package out there. It provides a host of pre-processing functionality such as One Hot Encoding, MinMax Scaling, and more. Sometimes, however, it is required to use custom pre-processing functions instead of using pre-built ones such as One Hot Encoder. This tutorial teaches you how to create custom transformers in scikit-learn to streamline data pre-processing for machine learning.

MLOps in BigQuery ML using Vertex AI

Over the past year, big tech companies have made great strides in the field of MLOps. Bringing machine learning models into production environments is one of the biggest challenges faced by data teams today. This video provides a quick tutorial on how you can use Google Cloud’s Vertex AI and Big Query ML services together to build a production-ready Vertex AI pipeline.

DataCamp’s New Power BI Tutorials

With the launch of its new Data Analyst in Power BI Track in partnership with Microsoft, DataCamp released a few tutorials on Power BI over the past month. Power BI is one of the most widely used business intelligence tools out there, and if your company uses Microsoft Office, chances are you already have access to it. It’s also a great stepping stone for anyone looking to go beyond Excel and dive deeper into the world of data. These tutorials cover the gamut of working with Power BI, from Transitioning from Excel to Power BI, to a deeper tutorial on Power BI for Beginners, Data Modelling in Power BI, and a tutorial on Power BI’s DAX Formulas for Beginners. You can also download this handy cheat sheet as you go along your learning journey.

Trailblazing Machine Learning Research

DALLE-2 by OpenAI

Over the past month, OpenAI wowed the world with its DALLE-2 image generation algorithm. OpenAI has trained this new machine learning model to create photorealistic images and art from simple natural text input. DALLE-2, which is the successor of DALLE-1, achieves a great level of photorealism by using a process called “diffusion” which creates a pattern of random dots and gradually alters it to form an image that it recognizes from the text. You can read about it here, and watch Isabella Leslie Miller, DataCamp’s Data Journalist, break it down in our Weekly Roundup News Video.

Pathways Language Model (PaLM)

PaLM demonstrates the first large-scale use of the Pathways system. The Pathways system architecture was introduced by Google last year to efficiently train a new generation of models that can do a variety of tasks across different domains. With PaLM, Google created a humongous 540-billion parameter Transformer model that has achieved state-of-the-art performance on a variety of language and reasoning tasks. These tasks range from code generation, joke explanation, cause and effect identification, and more. Check out the article for a deeper dive on Pathways and the type of use-cases PaLM unlocks.

New Packages and Models

PyTorch 1.11 Release

With the release of Pytorch version 1.11 comes a variety of improvements. The latest version of PyTorch now offers beta versions of TorchData, the successor of DataLoader API, and functorch, which offers composable function transforms for PyTorch Modules. Beyond the introduction of these functions, PyTorch 1.11 boasts a variety of performance improvements, such as 40% faster startup time for mobile and edge deployments, and more. To learn more about PyTorch, check out this DataCamp Course for Beginners.

EpyNN 1.2.7

EpyNN stands for “Educational python for Neural Networks”. It is intended for teachers, students, scientists, and anyone with relatively minimal Python skills who wish to understand and build neural networks from scratch. It provides a host of architecture templates and practical examples that reduce the time it takes to learn how to develop neural networks from scratch.

BigScience Large Language Model Training (tr11-176B-ml-logs)

BigScience is a popular space in Hugging Face and is on a mission to create large-scale language models in the open with thousands of researchers around the world. The training of BigScience’s main model started on March 11, 2022, and will train for three to four months on 384 A100 80GB GPUs of the Jean Zay public supercomputer. This open-source model will consist of 176B parameters and will have a GPT-like decoder-only architecture. You can follow the training updates on Twitter—and will be able to use the model sometime around June.

Data Science & Machine Learning Use Cases

La Liga Partners with Databricks for Football Analytics

La Liga’s analytics team has partnered with Databricks to deploy a data lakehouse within the league’s analytics infrastructure. This will allow La Liga Tech—the new, consolidated analytics wing of the league - to structure and manage their data much more efficiently, and leverage machine learning on interesting use-cases such as injury prediction, content recommendations for fans, and more.

Using Deep Learning to Annotate the Protein Universe

Scientists have been long using computational tools to infer and annotate the protein function directly from its sequence. Google’s AI team has successfully used Deep Learning to predict the function of proteins reliably; they call it ProtENN, which has enabled them to add about 6.8 million entries to Pfam’s database. Google’s AI team has released the ProtENN model and a distill-like interactive article for experimentation. Solving these types of problems will enable faster, more reliable, novel drug discoveries and therapeutics with machine learning.

Insights & Opinions

Empowering the Modern Data Analyst

In the latest DataFramed Podcast, Peter Fishman, CEO of Mozart Data, breaks down the state of the modern data stack, and how the latest tools in data science are designed to empower the modern data analyst. Moreover, he goes through his experience leading data teams, what makes an excellent data analyst, the importance of subject matter expertise and listening to users, and more. Listen and subscribe to DataFramed wherever you get your podcasts.

MLOps is a Mess But that’s to be Expected

In this article, Mihail Eric, founder of Confetti AI, shares his insights on the state of tooling in MLOps today. Just as the article headline describes, Mihail brilliantly breaks down the disjointed state of the Machine Learning tooling landscape. As machine learning operations is still in its infancy, the set of standards, tools, best practices surrounding it are still being formed and there is no clear canonical stack practitioners can rely on. Moreover, solving the talent shortages, and adopting machine learning thinking within organizational cultures will shape the adoption of MLOps in the years to come.

Learn more

We hope you enjoyed this round-up of top stories, insights, tutorials, and research. For more on the latest in data science, check out the following resources:

Topics

Data Science

Machine Learning

tutorial

A Beginner's Guide to Azure Machine Learning

Explore Azure Machine Learning in our beginner's guide to setting up, deploying models, and leveraging AutoML & ML Studio in the Azure ecosystem.

Moez Ali

11 min

tutorial

ML Workflow Orchestration With Prefect

Learn everything about a powerful and open-source workflow orchestration tool. Build, deploy, and execute your first machine learning workflow on your local machine and the cloud with this simple guide.

Abid Ali Awan

tutorial

An Introduction to Vector Databases For Machine Learning: A Hands-On Guide With Examples

Explore vector databases in ML with our guide. Learn to implement vector embeddings and practical applications.

Gary Alway

8 min

tutorial

Snscrape Tutorial: How to Scrape Social Media with Python

This snscrape tutorial equips you to install, use, and troubleshoot snscrape. You'll learn to scrape Tweets, Facebook posts, Instagram hashtags, or Subreddits.

Amberle McKee

8 min

tutorial

AWS Storage Tutorial: A Hands-on Introduction to S3 and EFS

The complete guide to file storage on AWS with S3 & EFS.

Zoumana Keita

16 min

code-along

Getting Started with Machine Learning Using ChatGPT

In this session Francesca Donadoni, a Curriculum Manager at DataCamp, shows you how to make use of ChatGPT to implement a simple machine learning workflow.

Francesca Donadoni

See More See More