This numpy tutorial will cover all the basics of the numpy python library. Mastering this library is key if you wish to do meaningful data analysis in python. It’s heavily used for data analysis of all types including analyzing financial data.
The numpy library provides the ndarray object (an n-dimensional array) for holding data. A big reason why numpy is so popular is the advantages of the ndarray objects over typical lists in Python.
As you’ll see, key to the ndarray is that it allows vectorized operations. We’ll get into vectorized operations in this tutorial, but for now just know that it’s an important part of why we use numpy for data analysis.
Numpy tutorial table of contents
For quick reference, here are the areas we will cover in this tutorial:
- Creating numpy arrays
- Shape and size of numpy arrays
- Extracting items and slicing numpy arrays
- Vectorized operations on numpy arrays
- Statistical operations such as mean, min and max
- Conditional operations using where and take
- Loading data from a file
- Joining multiple numpy arrays
- Sorting numpy arrays
- Working with dates in numpy arrays
Creating numpy arrays
There are a few different ways you can create a numpy array.
Perhaps one of the most common ways is to create a numpy array from a Python list. You can utilize the numpy.array function (or np.array function as it’s common to refer to numpy as np).
#create the array
mylist = [0,1,2,3]
myarr = np.array(mylist)
#print the array
print(myarr)
[0 1 2 3]
#print the type
print(type(myarr))
<type ‘numpy.ndarray’>
It’s worth noting that even a 1-d array like above is still considered of type ndarray. The np.array function is just a function. There is not an array object in numpy.
You can also create a 2-d array from a Python list of lists.
#create the array
mylist = [[0,1,2,3],[4,5,6,7]]
myarr = np.array(mylist)
#print the array
print(myarr)
[[0 1 2 3]
[4 5 6 7]]
Another consideration when creating numpy arrays is specifying the data type. Unlike lists in Python, all elements of a numpy array must be of the same data type.
You can specify the data type when creating the array as follows:
#create the array
mylist = [[0,1,2,3],[4,5,6,7]]
myarr = np.array(mylist, dtype=’float’)
#print the array
print(myarr)
[[0. 1. 2. 3.]
[4. 5. 6. 7.]]
Note that the above decimal point indicates that the numbers are of float data type.
You can convert an existing array to a different data type by using the astype functon.
#create the array
mylist = [[0,1,2,3],[4,5,6,7]]
myarr = np.array(mylist, dtype=’float’)
myarr2 = myarr.astype(‘int’)
#print the array
print(myarr)
[[0 1 2 3]
[4 5 6 7]]
Note that if you aren’t sure about data type or want to hold a variety of data types in the array, you can use dtype=’object’ as follows:
#create the array
mylist = [0,1,’hello’]
myarr = np.array(mylist, dtype=’object’)
#print the array
print(myarr)
[0 1 ‘hello’]
Since we’ve talked about lists and the differences between them and numpy arrays, it’s also worth noting that you can convert a numpy array back to a list using the tolist() method.
#create the array
mylist = [0,1,2]
myarr = np.array(mylist)
mylist2 = myarr.tolist()
#print the list
print(mylist2)
[0, 1, 2]
#print the type of the list
print(type(mylist2))
<type ‘list’>
You can also create numpy arrays of a specific size and a set of default values quite easily using some numpy methods.
#create a 3×3 array with all zeroes
myarr = np.zeros([3,3])
#print the array
print(myarr)
[[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]]
#create a 3×3 array with all ones
myarr = np.ones([3,3])
#print the array
print(myarr)
[[1. 1. 1.]
[1. 1. 1.]
[1. 1. 1.]]
Lastly, you can create numpy arrays with random values quickly as well.
# Create a 3×3 array with random numbers between [0,1)
print(np.random.rand(2,2))
[[0.5035 0.0745 0.3731]
[0.6996 0.1923 0.8273]
[0.3800 0.8746 0.7461]]
# Create a 3×3 array with random integers between [0, 10)
print(np.random.randint(0, 10, size=[3,3]))
[[5 3 9]
[7 7 8]
[3 6 3]]
Shape and size of numpy arrays
It can occasionally be important to know the shape and size of your numpy arrays. There are some built-in methods to help you do this.
How to determine the number of elements in a numpy array:
#create the array
mylist = [[0,1,2],[3,4,5]]
myarr = np.array(mylist)
#print the number of elements
print(myarr.size)
6
How to determine the shape of a numpy array, and how to get the number of rows and columns of a 2d-array:
#create the array
mylist = [[0,1,2],[3,4,5]]
myarr = np.array(mylist)
#print the shape
print(myarr.shape)
(2,3)
#print the number of dimensions
print(myarr.ndim)
2
#print the number of rows
print(myarr.shape[0])
2
#print the number of columns
print(myarr.shape[1])
3
Extracting items and slicing numpy arrays
Numpy arrays use zero-based indexing, so your indexing always begins at zero.
You can access a single element of an array using square brackets and indicating the index for each dimension:
#create and print the array
mylist = [[0,1,2],[3,4,5]]
myarr = np.array(mylist)
print(myarr)
[[0 1 2]
[3 4 5]]
#print the element at 1,1
print(myarr[1,1])
4
You can also take subsets of numpy arrays, often called slices by providing a start and stop value for each dimension separated by a colon (:). Note that the start value of the slice is included in the subset, but the stop value is not.
#create and print the array
mylist = [[0,1,2],[3,4,5],[6,7,8]]
myarr = np.array(mylist)
print(myarr)
[[0 1 2]
[3 4 5]
[6 7 8]]
#Take a 2×2 subset of the array starting with the 1,1 element
print(mylist2[1:3,1:3])
[[4 5]
[7 8]]
Similarly, you can leave off a starting value or stopping value to get the span of that dimension from either the zero-index on, or from a value to the end of the dimension:
#create and print the array
mylist = [[0,1,2],[3,4,5],[6,7,8]]
myarr = np.array(mylist)
print(myarr)
[[0 1 2]
[3 4 5]
[6 7 8]]
#Take the same 2×2 as in the previous example
print(mylist2[1:,1:])
[[4 5]
[7 8]]
You can grab sets of rows or columns of a 2d-array very easily using this slicing convention. This lets you perform operations on rows and columns quite easily and will be something you do quite often in data analysis.
#create and print the array
mylist = [[0,1,2],[3,4,5],[6,7,8]]
myarr = np.array(mylist)
print(myarr)
[[0 1 2]
[3 4 5]
[6 7 8]]
#Take 2nd row (index 1)
print(mylist2[1,:])
[3 4 5]
#Take the last two columns
print(mylist2[:,1:])
[[1 2]
[4 5]
[7 9]]
Vectorized operations on numpy arrays
Vectorized operations are a big reason why numpy arrays are so useful. The ability to use these operations is also a key difference between numpy arrays and Python lists. Vectorized operations let you apply a function on each element in a vector very quickly and with minimal lines of code. Whereas performing a calculation on each element in a list would typically require some sort of loop, the vectorized calculations push this looping mechanism into the compiled layer where it can execute much faster. As such, these operations are much less costly computationally.
In this simple example, we have a 2d-array and want to add 3 to each element.
#create the array, and add 3 to each element
mylist = [[0,1,2],[3,4,5],[6,7,8]]
myarr = np.array(mylist)
myarr2 = myarr + 3
#print the array
print(myarr2)
[[3 4 5]
[6 7 8]
[9 10 11]]
Vector operations such as adding two arrays together is also easily possible.
#create two arrays
myarr = np.array([[0,1],[2,3]])
myarr2 = np.array([[4,5],[6,7]])
#print the array
print(myarr + myarr2)
[[4,6]
[8 10]]
Statistical operations such as mean, min and max
Statistical analysis of numpy arrays is quite common and there are a number of functions built in to the numpy library.
mean(), max() and min() methods can be run on the numpy arrays directly to get the mean, max and min values of the entire array.
#create an array
myarr = np.array([[0,1],[2,3]])
#print the mean, min, max of the array
print(myarr.mean())
1.5
print(myarr.min())
0
print(myarr.max())
3
Similarly, these functions can be run on slices of the array as well.
#create an array
myarr = np.array([[0,1],[2,3]])
#print the mean, min, max of the first column of the array
print(myarr[:,0].mean())
1.0
print(myarr[:,0].min())
0
print(myarr[:,0].max())
2
Quite often you might want to calculate the mean of each column, or the max value of each row. Numpy also makes this very easy. The axis parameter is common in numpy. Remember that axis=0 refers to a column wise operation, and axis=1 refers to a row wise operation.
Note that these functions return a ndarray object.
#create an array
myarr = np.array([[0,1,2],[3,4,5]])
#print the mean, min, max of the each column
print(np.mean(myarr, axis=0))
[1.5 2.5 3.5]
print(np.amin(myarr, axis=0))
[0 1 2]
print(np.amax(myarr, axis=0))
[3 4 5]
#print the mean, min, max of the each row
print(np.mean(myarr, axis=1))
[1. 4.]
print(np.amin(myarr, axis=0))
[0 3]
print(np.amax(myarr, axis=0))
[2 5]
When you need the index of a minimum or a maximum value, rather than the value itself, the argmin() and argmax() functions can be used.
These functions can be used with or without an axis parameter. If no axis is used, then the array is flattened into a 1-dimensional array, and the index of the flattened array is returned.
#create an array
myarr = np.array([[0,1,2],[3,4,5]])
#print the index of the max value in the array
print(np.argmax(myarr))
5
#print the index of the min value in each row
print(np.argmin(myarr, axis=1))
[2 2]
Conditional operations using where and take
On occasion, you may want to grab the elements of a numpy array that satisfy a particular condition. The where method from the numpy library is very useful for this.
Similarly, you can use the take method to grab the values of the array at a provided set of index values.
#create an array
myarr = np.array([0,1,5,2,3,4,5,1])
#get the index of the elements that satisfy the condition of the element being larger than 2
i = np.where(myarr > 2)
print(i)
(array([2,4,5,6]),)
#take the values of the array based on the given index values
v = myarr.take(i)
print(v)
[[5 3 4 5]]
Loading data from a file into a numpy array
The np.genfromtxt function is quite useful in loading in a csv file or other data files. For this example, we’ll load in the data from this csv file. The np.genfromtxt function lets you specify a file address, web address and more.
#load the data from web URL
url = ‘https://nextlevel.finance/wp-content/uploads/2018/09/test.csv’
data = np.genfromtxt(url, delimiter=’,’, skip_header=0, filling_values=-1, dtype=’int’)
#Let’s look at the resulting array
print(data)
[[2 1 3 0 0]
[7 1 3 5 9]
[2 5 3 3 4]
[2 1 3 9 8]
[8 1 3 0 7]
[2 3 2 3 3]
[1 1 3 4 4]
[8 1 3 3 4]
[3 1 1 5 9]
[2 1 6 6 4]]
print(data.shape)
(10,5)
Joining multiple numpy arrays
There are a few different ways you can concatenate numpy arrays. The numpy methods mostly used are np.concatenate, np.vstack and np.hstack.
#create two arrays
myarr1 = np.zeros([2,2])
myarr2 = np.ones([2,2])
print(myarr1)
[[0. 0.]
[0. 0.]]
print(myarr2)
[[1. 1.]
[1. 1.]]
#Concat vertically using concatenate
myarr3 = np.concatenate([myarr1, myarr2], axis=0)
print(myarr3)
[[0. 0.]
[0. 0.]
[1. 1.]
[1. 1.]]
#Concat vertically using vstack
myarr4 = np.vstack([myarr1,myarr2])
print(myarr4)
[[0. 0.]
[0. 0.]
[1. 1.]
[1. 1.]]
#Concat horizontally using concatenate
myarr5 = np.concatenate([myarr1, myarr2], axis=1)
print(myarr5)
[[0. 0. 1. 1.]
[0. 0. 1. 1.]]
#Concat horizontally using vstack
myarr6 = np.hstack([myarr1,myarr2])
print(myarr6)
[[0. 0. 1. 1.]
[0. 0. 1. 1.]]
Sorting numpy arrays
When sorting numpy arrays, typically you want to maintain the integrity of the rows while sorting on a specific column. To do so, you want to use a two-step process using the argsort method rather than the sort method. The sort method will sort every column (or row) and thus corrupt the row (or column) integrity. We’ll look at the more complex case using argsort first since it’s the more common use case.
#create the array
myarr = np.array([[9,5,4],[2,1,6],[4,8,2]])
print(myarr)
[[9 5 4]
[2 1 6]
[4 8 2]]
#Let’s argsort the first column, and see what happens
sorted_index_col1 = myarr[:, 0].argsort()
#The output is the index numbers of the sorted values, So the 1st item is smallest, then the 2nd item, then the 0th item.
print(sorted_index_col1)
[1 2 0]
#Now we can use the index information to sort the array
sorted_arr = myarr[sorted_index_col1]
print(sorted_arr)
[[2 1 6]
[4 8 2]
[9 5 4]]
#We can also sort using the index information in a reversed or descending order
sorted_arr = myarr[sorted_index_col1[::-1]]
print(sorted_arr)
[[9 5 4]
[4 8 2]
[2 1 6]]
Lastly, let’s quickly look at using the numpy sort() method which does not maintain row/record integrity.
#create the array
myarr = np.array([[9,5,4],[2,1,6],[4,8,2]])
print(myarr)
[[9 5 4]
[2 1 6]
[4 8 2]]
#Let’s sort the columns of the numpy array
sorted_arr = np.sort(myarr, axis=0)
print(sorted_arr)
[[1 2 6]
[2 4 8]
[4 5 9]]
#Let’s sort the rows of the numpy array
sorted_arr = np.sort(myarr, axis=1)
print(sorted_arr)
[[4 5 9]
[1 2 6]
[2 4 8]]
Working with dates in numpy arrays
Numpy uses the datetime64 object to implement dates. The object comes with a number of functions for manipulating dates.
#create a date
mydate = np.datetime64(‘2018-08-25 15:30:30’)
print(mydate)
2018-08-25T15:30:30
#create a range of dates
mydaterange = np.arange(np.datetime64(‘2018-08-01’), np.datetime64(‘2018-08-10’))
print(mydaterange)
[‘2018-08-01’ ‘2018-08-02’ ‘2018-08-03’ ‘2018-08-04’ ‘2018-08-05’ ‘2018-08-06’ ‘2018-08-07’ ‘2018-08-08’ ‘2018-08-09’]