This pandas tutorial will cover the basics and help Python programmers get up to speed quickly on this great data analysis library.
Pandas is a widely used Python library in the data science field. It comes with a number of excellent functions that make working with large data sets super convenient. It’s recommended to be familiar with the numpy library if you’re learning pandas. You can check out our numpy tutorial.
Pandas tutorial table of contents
For quick reference, here are the areas we will cover in this tutorial:
- Understanding the pandas Series
- Understanding the pandas DataFrame
- Using DataFrames for time series data
- Joining multiple DataFrames
- Slicing DataFrames
Understanding the pandas Series
The Series in the pandas library refers to a 1-dimensional labeled array which can hold any type of data. For 2-dimensional data, we’ll use the dataframe object which we’ll cover next.
You can create a Series a few different ways: from a list, from a numpy array or from a Python dict. Let’s look at how to do this.
Note: We’ll often use the head() method on a pandas series to print the first five rows of the Series, rather than the entire Series.
import pandas as pd
#create a pandas Series from a list
mylist = [‘a’, ‘b’, ‘c’, ‘d’, ‘e’, ‘f’, ‘g’]
s1 = pd.Series(mylist)
#print the Series
print(s1.head())
0 a
1 b
2 c
3 d
4 e
dtype: object
#create a pandas Series from a dict
mydict = {‘b’ : 1, ‘a’ : 0, ‘c’ : 2}
s2 = pd.Series(mydict)
#print the Series
print(s2.head())
a 0
b 1
c 2
dtype: int64
#create a pandas Series from a numpy array
myarr = np.arange(20)
s3 = pd.Series(myarr)
#print the Series
print(s3.head())
0 0
1 1
2 2
3 3
4 4
dtype: int64
Understanding the pandas DataFrame
A pandas Dataframe is a two dimensional (can be more than two dimensions) structure used in data analysis. The pandas library provides a number of native functions for the DataFrame data structure that makes a wide range of analysis very easy to do. Commonly, the pandas DataFrame is two-dimensional and resembles a table with rows and columns. The columns have names and the rows have indexes.
You can create a DataFrame in a number of different ways. Reading in a csv file directly into a newly created DataFrame is a very handy feature provided by pandas.
#read in the csv
df = pd.read_csv(‘https://nextlevel.finance/wp-content/uploads/2018/09/test.csv’, ‘,’)
#print the information about the DataFrame
print(type(df))
<class ‘pandas.core.frame.DataFrame’>class
#print the DataFrame
print(df)
2 1 3 0 0.1
0 7 1 3 5 9
1 2 5 3 3 4
2 2 1 3 9 8
3 8 1 3 0 7
4 2 3 2 3 3
5 1 1 3 4 4
6 8 1 3 3 4
7 3 1 1 5 9
8 2 1 6 6 4
If you notice when we print the above dataframe, it used our first row as a header row. Since this example csv file is just raw data, we don’t want the first row to be used as column names, we want it to be actual data. So, we’ll need to read in that csv file again with a few parameters set.
#read in the csv
df = pd.read_csv(‘https://nextlevel.finance/wp-content/uploads/2018/09/test.csv’, sep=’,’, header=None)
#print the DataFrame
print(df)
0 1 2 3 4
0 2 1 3 0 0
1 7 1 3 5 9
2 2 5 3 3 4
3 2 1 3 9 8
4 8 1 3 0 7
5 2 3 2 3 3
6 1 1 3 4 4
7 8 1 3 3 4
8 3 1 1 5 9
9 2 1 6 6 4
#we can also rename the columns
df.columns = [‘a’, ‘b’, ‘c’, ‘d’, ‘e’]
print(df)
a b c d e
0 2 1 3 0 0
1 7 1 3 5 9
2 2 5 3 3 4
3 2 1 3 9 8
4 8 1 3 0 7
5 2 3 2 3 3
6 1 1 3 4 4
7 8 1 3 3 4
8 3 1 1 5 9
9 2 1 6 6 4
You can also create a DataFrame using one or more pandas Series.
#create the dict of Series
d = {‘col1’ : pd.Series([1, 2, 3], index=[‘a’, ‘b’, ‘c’]), ‘col2’ : pd.Series([1, 2, 3, 4], index=[‘a’, ‘b’, ‘c’, ‘d’])}
#create the DataFrame
df = pd.DataFrame(d)
#print the DataFrame
print(df)
col1 col2
a 1.0 1
b 2.0 2
c 3.0 3
d NaN 4
Using DataFrames for time series data
Pandas has some great features for creating and analyzing time series data.
Often in time series data, you’ll want the row indexes to be dates. Let’s look at how to build this kind of DataFrame in pandas.
#create the range of dates
startdate = ‘2012-01-01’
enddate = ‘2012-12-31’
dates = pd.date_range(startdate, enddate)
#print the range of dates – note the DatetimeIndex object
print(dates)
DatetimeIndex([‘2012-01-01’, ‘2012-01-02’, ‘2012-01-03’, ‘2012-01-04’,
‘2012-01-05’, ‘2012-01-06’, ‘2012-01-07’, ‘2012-01-08’,
‘2012-01-09’, ‘2012-01-10’,
…
‘2012-12-22’, ‘2012-12-23’, ‘2012-12-24’, ‘2012-12-25’,
‘2012-12-26’, ‘2012-12-27’, ‘2012-12-28’, ‘2012-12-29’,
‘2012-12-30’, ‘2012-12-31′],
dtype=’datetime64[ns]’, length=366, freq=’D’)
#Let’s create an empty DataFrame where the row indexes are the dates created in the previous step
df = pd.DataFrame(index=dates)
#print the DataFrame
print(df)
Empty DataFrame
Columns: []
Index: [2012-01-01 00:00:00, 2012-01-02 00:00:00 … ]
In the above step, we still have an empty DataFrame which isn’t overly useful, but it’s now ready for joining other data to it. We’ll look at joining in the next section.
Before we move on to the next section, let’s look at how we might read in time series data from a csv file as well.
Again, we’ll use the read_csv method, but we’ll add in the parameters parse_dates=True and tell the method to use the Date column as the index column.
#read in the csv file
df = pd.read_csv(‘https://nextlevel.finance/wp-content/uploads/2018/09/datestest.csv’, header=None, parse_dates=True, index_col=’Date’, names=[‘Date’,’A’,’B’,’C’,’D’,’E’])
#print the DataFrame
print(df)
A B C D E
Date
2013-01-01 2 1 3 0 0
2013-01-02 7 1 3 5 9
2013-01-03 2 5 3 3 4
2013-01-04 2 1 3 9 8
2013-01-05 8 1 3 0 7
2013-01-06 2 3 2 3 3
2013-01-07 1 1 3 4 4
2013-01-08 8 1 3 3 4
2013-01-09 3 1 1 5 9
2013-01-10 2 1 6 6 4