Skip to main content

Importing Data in Python Cheat Sheet

With this Python cheat sheet, you'll have a handy reference guide to importing your data, from flat files to files native to other software, and relational databases.
Jun 2021  · 5 min read

Before doing any data cleaning, wrangling, visualizing, ... You'll need to know how to get data into Python. As you know, there are many ways to import data into Python, depending also on which files you're dealing with.

However, you'll most often make use of the pandas and the NumPy libraries: The pandas library is one of the most preferred tools for data scientists to do data manipulation and analysis, next to matplotlib for data visualization and NumPy, the fundamental library for scientific computing in Python on which Pandas was built.

In this importing data in Python cheat sheet, you'll find some NumPy and pandas functions, together with functions that are built in the Python programming language, that will help you to get your data in Python fast!

This quick guide helps you to learn the basics of importing data in Python that you will need to get started on cleaning and wrangling your data!

data import python cheat sheet

Have this cheat sheet at your fingertips

Download PDF

The Importing Data in Python cheat sheet will guide you through the basics of getting your data in your workspace: you'll not only learn how to import flat files such as text files, but you'll also see how you can get data from files native to other software such as Excel spreadsheets, Stata, SAS and MATLAB files and relational databases. On top of that, you'll get some more information on how to ask for help, how to navigate your filesystem and how to start exploring your data.

In short, everything that you need to kickstart your data science learning with Python!

Do you want to learn more? Start the Importing Data in Python course for free now or try out our Python Excel tutorial!

Also, don't miss out on our Python cheat sheet for data science, or the many others we have here on our Community!

Importing Data in Python 

Most of the time, you'll use either NumPy or pandas to import your data:

>>> import numpy as np
>>> import pandas as pd

Help 

>>> np.info(np.ndarray.dtype)
>>> help(pd.read_csv)

Text Files 

Plain Text Files 

>>>filename= 'huck_finn.txt'
>>>file= open(filename, mode='r') #Open the file for reading
>>>text= file.read() #Read a file's contents
>>> print(file.closed) #Check whether file is closed
>>> file.close() #Close file
>>> print( text)

Use the content manager with:

>>> with open('huck_finn.txt', 'r') as file: 
    print(file.readline()) #Read a single line 
    print(file.readline()) 
    print(file.readline())

Table Data: Flat Files 

Importing Flat Files with NumPy 

>>>filename= 'huck_finn.txt'
>>>file= open(filename, mode='r') #Open the file for reading
>>>text= file.read() #Read a file's contents
>>> print(file.closed) #Check whether file is closed
>>> file.close() #Close file
>>> print(text)

Files with one data type: 

>>>filename= 'mnist.txt'
>>>data= np.loadtxt(filename,
    delimiter=',', #String used to separate values 
    skiprows=2, #Skip the first 2 lines 
    usecols=[0,2], #Read the 1st and 3rd column 
    dtype=str) #The type of the resulting array

Files with mixed data type 

>>>filename= 'titanic.csv'
>>>data= np.genfromtxt(filename,
    delimiter=',',
    names=True, #Look for column header
    dtype=None)
>>> data_array = np.recfromcsv(filename)
#The default dtype of the np.recfromcsv() function is None

Importing Flat Files with Pandas 

>>>filename= 'winequality-red.csv'
>>>data= pd.read_csv(filename,
         nrows=5, #Number of rows of file to read 
         header=None, #Row number to use as col names 
         sep='\t', #Delimiter  to use 
         comment='#', #Character to split comments 
         na_values=[""]) #String to recognize as NA/NaN

Exploring Your Data 

NumPy Arrays 

>>> data_array.dtype #Data type of array elements
>>> data_array.shape #Array  dimensions
>>> len(data_array) #Length of array

Pandas DataFrames 

>>> df.head() #Return first DataFrame rows
>>> df.tail() #Return last DataFrame rows
>>> df.index #Describe index
>>> df.columns #Describe DataFrame columns
>>> df.info() #Info an DataFrame
>>> data_array = data.values #Convert a DataFrame to an a NumPy array

SAS File 

>>> from sas7bdat import SAS7BDAT
>>> with SAS7BDAT('urbanpop.sas7bdat') as file: df_sas = file.to_data_frame()

Stata File 

>>>data= pd.read_stata('urbanpop.dta')

Excel Spreadsheets 

>>>file= 'urbanpop.xlsx'
>>>data= pd.ExcelFile(file)
>>> df sheet2 = data.parse('1960-1966',
          skiprows=[0], 
          names=['Country', 'AAM: War(2002)'])
>>> df sheetl = data.parse(0,
          parse_cols=[0], 
          skiprows=[0], 
          names=['Country'])

To access the sheet names, use the sheet_names attribute:

>>> data.sheet_names

Relational Databases

>>> from sqlalchemy import create_engine
>>>engine= create_engine('sqlite://Northwind.sqlite')

Use the table_names() method to fetch a list of table names:

>>> table_names = engine.table_names()

Querying Relational Databases 

>>>con= engine.connect()
>>> rs= con.execute("SELECT * FROM Orders")
>>> df = pd.DataFrame(rs.fetchall())
>>> df.columns = rs.keys()
>>> con.close()

Using the context manager with

>>> with engine.connect() as con:
          rs= con.execute("SELECT OrderID FROM Orders") 
          df = pd.DataFrame(rs.fetchmany(size=5)) 
          df.columns = rs.keys()

Querying Relational Databases with Pandas 

>>> df = pd.read_sql_query("SELECT * FROM Orders", engine)

Pickled Files

>>> import pickle
>>> with open('pickled_fruit.pkl', 'rb') as file: 
          pickled_data = pickle.load(file)

Matlab Files 

>>> import scipy.io
>>>filename= 'workspace.mat'
>>>mat= scipy.io.loadmat(filename)

HDF5 Files 

>>> import h5py
>>>filename= 'H-Hl LDSC 4 v1-815411200-4096.hdf5'
>>>data= h5py.File(filename, 'r')

Exploring Dictionaries 

Querying relational databases with pandas

>>> print(mat.keys()) #Print dictionary keys
>>> for key in data.keys(): #Print dictionary keys
              print(key)
     meta 
     quality 
     strain
>>> pickled_data.values() #Return dictionary values
>>> print(mat.items()) #Returns items in list format of (key, value) tuple pairs

Accessing Data Items with Keys

>>> for key in data ['meta'].keys() #Explore the HDF5
structure
               print(key) 
     Description 
     DescriptionURL 
     Detector
     Duration 
     GPSstart 
     Observatory 
     Type 
     UTCstart
#Retrieve the value for a key
>>> print(data['meta']['Description'].value)

Magic Commands 

!ls #List directory contents of files and directories
%cd .. #Change current working directory
%pwd #Return the current working directory path

OS Library 

>>> import os
>>> path = "/usr/tmp"
>>> wd = os.getcwd() #Store the name of current directory in a string
>>> os.listdir(wd) #Output contents of the directory in a list
>>> os.chdir(path) #Change current working directory
>>> os.rename( "test1.txt", #Rename a file
               "test2.txt")
>>> os.remove("test1. txt") #Delete an existing file
>>> os.mkdir("newdir") #Create a new directory

← Back to Cheat Sheets