• Skip to content
  • Skip to primary sidebar

Personal finance and investing insight beyond the basics

September 13, 2018 by kduffey

Numpy Tutorial: Quickly master this key Python library for data analysis

This numpy tutorial will cover all the basics of the numpy python library. Mastering this library is key if you wish to do meaningful data analysis in python. It’s heavily used for data analysis of all types including analyzing financial data.

The numpy library provides the ndarray object (an n-dimensional array) for holding data. A big reason why numpy is so popular is the advantages of the ndarray objects over typical lists in Python.

As you’ll see, key to the ndarray is that it allows vectorized operations. We’ll get into vectorized operations in this tutorial, but for now just know that it’s an important part of why we use numpy for data analysis.

Numpy tutorial table of contents

For quick reference, here are the areas we will cover in this tutorial:

  1. Creating numpy arrays
  2. Shape and size of numpy arrays
  3. Extracting items and slicing numpy arrays
  4. Vectorized operations on numpy arrays
  5. Statistical operations such as mean, min and max
  6. Conditional operations using where and take
  7. Loading data from a file
  8. Joining multiple numpy arrays
  9. Sorting numpy arrays
  10. Working with dates in numpy arrays

Creating numpy arrays

There are a few different ways you can create a numpy array.

Perhaps one of the most common ways is to create a numpy array from a Python list. You can utilize the numpy.array function (or np.array function as it’s common to refer to numpy as np).

import numpy as np

#create the array
mylist = [0,1,2,3]
myarr = np.array(mylist)

#print the array
print(myarr)
[0 1 2 3]

#print the type
print(type(myarr))
<type ‘numpy.ndarray’>

It’s worth noting that even a 1-d array like above is still considered of type ndarray. The np.array function is just a function. There is not an array object in numpy.

You can also create a 2-d array from a Python list of lists.

import numpy as np

#create the array
mylist = [[0,1,2,3],[4,5,6,7]]
myarr = np.array(mylist)

#print the array
print(myarr)
[[0 1 2 3]
[4 5 6 7]]

Another consideration when creating numpy arrays is specifying the data type. Unlike lists in Python, all elements of a numpy array must be of the same data type.

You can specify the data type when creating the array as follows:

import numpy as np

#create the array
mylist = [[0,1,2,3],[4,5,6,7]]
myarr = np.array(mylist, dtype=’float’)

#print the array
print(myarr)
[[0. 1. 2. 3.]
[4. 5. 6. 7.]]

Note that the above decimal point indicates that the numbers are of float data type.

You can convert an existing array to a different data type by using the astype functon.

import numpy as np

#create the array
mylist = [[0,1,2,3],[4,5,6,7]]
myarr = np.array(mylist, dtype=’float’)
myarr2 = myarr.astype(‘int’)

#print the array
print(myarr)
[[0 1 2 3]
[4 5 6 7]]

Note that if you aren’t sure about data type or want to hold a variety of data types in the array, you can use dtype=’object’ as follows:

import numpy as np

#create the array
mylist = [0,1,’hello’]
myarr = np.array(mylist, dtype=’object’)

#print the array
print(myarr)
[0 1 ‘hello’]

Since we’ve talked about lists and the differences between them and numpy arrays, it’s also worth noting that you can convert a numpy array back to a list using the tolist() method.

import numpy as np

#create the array
mylist = [0,1,2]
myarr = np.array(mylist)
mylist2 = myarr.tolist()

#print the list
print(mylist2)
[0, 1, 2]

#print the type of the list
print(type(mylist2))
<type ‘list’>

You can also create numpy arrays of a specific size and a set of default values quite easily using some numpy methods.

import numpy as np

#create a 3×3 array with all zeroes
myarr = np.zeros([3,3])

#print the array
print(myarr)
[[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]]

#create a 3×3 array with all ones
myarr = np.ones([3,3])

#print the array
print(myarr)
[[1. 1. 1.]
[1. 1. 1.]
[1. 1. 1.]]

Lastly, you can create numpy arrays with random values quickly as well.

import numpy as np

# Create a 3×3 array with random numbers between [0,1)
print(np.random.rand(2,2))
[[0.5035 0.0745 0.3731]
[0.6996 0.1923 0.8273]
[0.3800 0.8746 0.7461]]

# Create a 3×3 array with random integers between [0, 10)
print(np.random.randint(0, 10, size=[3,3]))
[[5 3 9]
[7 7 8]
[3 6 3]]

Shape and size of numpy arrays

It can occasionally be important to know the shape and size of your numpy arrays. There are some built-in methods to help you do this.

How to determine the number of elements in a numpy array:

import numpy as np

#create the array
mylist = [[0,1,2],[3,4,5]]
myarr = np.array(mylist)

#print the number of elements
print(myarr.size)
6

How to determine the shape of a numpy array, and how to get the number of rows and columns of a 2d-array:

import numpy as np

#create the array
mylist = [[0,1,2],[3,4,5]]
myarr = np.array(mylist)

#print the shape
print(myarr.shape)
(2,3)

#print the number of dimensions
print(myarr.ndim)
2

#print the number of rows
print(myarr.shape[0])
2

#print the number of columns
print(myarr.shape[1])
3

Extracting items and slicing numpy arrays

Numpy arrays use zero-based indexing, so your indexing always begins at zero.

You can access a single element of an array using square brackets and indicating the index for each dimension:

import numpy as np

#create and print the array
mylist = [[0,1,2],[3,4,5]]
myarr = np.array(mylist)
print(myarr)
[[0 1 2]
[3 4 5]]

#print the element at 1,1
print(myarr[1,1])
4

You can also take subsets of numpy arrays, often called slices by providing a start and stop value for each dimension separated by a colon (:). Note that the start value of the slice is included in the subset, but the stop value is not.

import numpy as np

#create and print the array
mylist = [[0,1,2],[3,4,5],[6,7,8]]
myarr = np.array(mylist)
print(myarr)
[[0 1 2]
[3 4 5]
[6 7 8]]

#Take a 2×2 subset of the array starting with the 1,1 element
print(mylist2[1:3,1:3])
[[4 5]
[7 8]]

Similarly, you can leave off a starting value or stopping value to get the span of that dimension from either the zero-index on, or from a value to the end of the dimension:

import numpy as np

#create and print the array
mylist = [[0,1,2],[3,4,5],[6,7,8]]
myarr = np.array(mylist)
print(myarr)
[[0 1 2]
[3 4 5]
[6 7 8]]

#Take the same 2×2 as in the previous example
print(mylist2[1:,1:])
[[4 5]
[7 8]]

You can grab sets of rows or columns of a 2d-array very easily using this slicing convention. This lets you perform operations on rows and columns quite easily and will be something you do quite often in data analysis.

import numpy as np

#create and print the array
mylist = [[0,1,2],[3,4,5],[6,7,8]]
myarr = np.array(mylist)
print(myarr)
[[0 1 2]
[3 4 5]
[6 7 8]]

#Take 2nd row (index 1)
print(mylist2[1,:])
[3 4 5]

#Take the last two columns
print(mylist2[:,1:])
[[1 2]
[4 5]
[7 9]]

Vectorized operations on numpy arrays

Vectorized operations are a big reason why numpy arrays are so useful. The ability to use these operations is also a key difference between numpy arrays and Python lists. Vectorized operations let you apply a function on each element in a vector very quickly and with minimal lines of code. Whereas performing a calculation on each element in a list would typically require some sort of loop, the vectorized calculations push this looping mechanism into the compiled layer where it can execute much faster. As such, these operations are much less costly computationally.

In this simple example, we have a 2d-array and want to add 3 to each element.

import numpy as np

#create the array, and add 3 to each element
mylist = [[0,1,2],[3,4,5],[6,7,8]]
myarr = np.array(mylist)
myarr2 = myarr + 3

#print the array
print(myarr2)
[[3 4 5]
[6 7 8]
[9 10 11]]

Vector operations such as adding two arrays together is also easily possible.

import numpy as np

#create two arrays
myarr = np.array([[0,1],[2,3]])
myarr2 = np.array([[4,5],[6,7]])

#print the array
print(myarr + myarr2)
[[4,6]
[8 10]]

Statistical operations such as mean, min and max

Statistical analysis of numpy arrays is quite common and there are a number of functions built in to the numpy library.

mean(), max() and min() methods can be run on the numpy arrays directly to get the mean, max and min values of the entire array.

import numpy as np

#create an array
myarr = np.array([[0,1],[2,3]])

#print the mean, min, max of the array
print(myarr.mean())
1.5
print(myarr.min())
0
print(myarr.max())
3

Similarly, these functions can be run on slices of the array as well.

import numpy as np

#create an array
myarr = np.array([[0,1],[2,3]])

#print the mean, min, max of the first column of the array
print(myarr[:,0].mean())
1.0
print(myarr[:,0].min())
0
print(myarr[:,0].max())
2

Quite often you might want to calculate the mean of each column, or the max value of each row. Numpy also makes this very easy. The axis parameter is common in numpy. Remember that axis=0 refers to a column wise operation, and axis=1 refers to a row wise operation.

Note that these functions return a ndarray object.

import numpy as np

#create an array
myarr = np.array([[0,1,2],[3,4,5]])

#print the mean, min, max of the each column
print(np.mean(myarr, axis=0))
[1.5 2.5 3.5]
print(np.amin(myarr, axis=0))
[0 1 2]
print(np.amax(myarr, axis=0))
[3 4 5]

#print the mean, min, max of the each row
print(np.mean(myarr, axis=1))
[1. 4.]
print(np.amin(myarr, axis=0))
[0 3]
print(np.amax(myarr, axis=0))
[2 5]

When you need the index of a minimum or a maximum value, rather than the value itself, the argmin() and argmax() functions can be used.

These functions can be used with or without an axis parameter. If no axis is used, then the array is flattened into a 1-dimensional array, and the index of the flattened array is returned.

import numpy as np

#create an array
myarr = np.array([[0,1,2],[3,4,5]])

#print the index of the max value in the array
print(np.argmax(myarr))
5

#print the index of the min value in each row
print(np.argmin(myarr, axis=1))
[2 2]

Conditional operations using where and take

On occasion, you may want to grab the elements of a numpy array that satisfy a particular condition. The where method from the numpy library is very useful for this.

Similarly, you can use the take method to grab the values of the array at a provided set of index values.

import numpy as np

#create an array
myarr = np.array([0,1,5,2,3,4,5,1])

#get the index of the elements that satisfy the condition of the element being larger than 2
i = np.where(myarr > 2)
print(i)
(array([2,4,5,6]),)

#take the values of the array based on the given index values
v = myarr.take(i)
print(v)
[[5 3 4 5]]

Loading data from a file into a numpy array

The np.genfromtxt function is quite useful in loading in a csv file or other data files. For this example, we’ll load in the data from this csv file. The np.genfromtxt function lets you specify a file address, web address and more.

import numpy as np

#load the data from web URL
url = ‘https://nextlevel.finance/wp-content/uploads/2018/09/test.csv’
data = np.genfromtxt(url, delimiter=’,’, skip_header=0, filling_values=-1, dtype=’int’)

#Let’s look at the resulting array
print(data)
[[2 1 3 0 0]
[7 1 3 5 9]
[2 5 3 3 4]
[2 1 3 9 8]
[8 1 3 0 7]
[2 3 2 3 3]
[1 1 3 4 4]
[8 1 3 3 4]
[3 1 1 5 9]
[2 1 6 6 4]]

print(data.shape)
(10,5)

Joining multiple numpy arrays

There are a few different ways you can concatenate numpy arrays. The numpy methods mostly used are np.concatenate, np.vstack and np.hstack.

import numpy as np

#create two arrays
myarr1 = np.zeros([2,2])
myarr2 = np.ones([2,2])
print(myarr1)
[[0. 0.]
[0. 0.]]

print(myarr2)
[[1. 1.]
[1. 1.]]

#Concat vertically using concatenate
myarr3 = np.concatenate([myarr1, myarr2], axis=0)
print(myarr3)
[[0. 0.]
[0. 0.]
[1. 1.]
[1. 1.]]

#Concat vertically using vstack
myarr4 = np.vstack([myarr1,myarr2])
print(myarr4)
[[0. 0.]
[0. 0.]
[1. 1.]
[1. 1.]]

#Concat horizontally using concatenate
myarr5 = np.concatenate([myarr1, myarr2], axis=1)
print(myarr5)
[[0. 0. 1. 1.]
[0. 0. 1. 1.]]

#Concat horizontally using vstack
myarr6 = np.hstack([myarr1,myarr2])
print(myarr6)
[[0. 0. 1. 1.]
[0. 0. 1. 1.]]

Sorting numpy arrays

When sorting numpy arrays, typically you want to maintain the integrity of the rows while sorting on a specific column. To do so, you want to use a two-step process using the argsort method rather than the sort method. The sort method will sort every column (or row) and thus corrupt the row (or column) integrity. We’ll look at the more complex case using argsort first since it’s the more common use case.

import numpy as np

#create the array
myarr = np.array([[9,5,4],[2,1,6],[4,8,2]])
print(myarr)
[[9 5 4]
[2 1 6]
[4 8 2]]

#Let’s argsort the first column, and see what happens
sorted_index_col1 = myarr[:, 0].argsort()

#The output is the index numbers of the sorted values, So the 1st item is smallest, then the 2nd item, then the 0th item.
print(sorted_index_col1)
[1 2 0]

#Now we can use the index information to sort the array
sorted_arr = myarr[sorted_index_col1]
print(sorted_arr)
[[2 1 6]
[4 8 2]
[9 5 4]]

#We can also sort using the index information in a reversed or descending order
sorted_arr = myarr[sorted_index_col1[::-1]]
print(sorted_arr)
[[9 5 4]
[4 8 2]
[2 1 6]]

Lastly, let’s quickly look at using the numpy sort() method which does not maintain row/record integrity.

import numpy as np

#create the array
myarr = np.array([[9,5,4],[2,1,6],[4,8,2]])
print(myarr)
[[9 5 4]
[2 1 6]
[4 8 2]]

#Let’s sort the columns of the numpy array
sorted_arr = np.sort(myarr, axis=0)
print(sorted_arr)
[[1 2 6]
[2 4 8]
[4 5 9]]

#Let’s sort the rows of the numpy array
sorted_arr = np.sort(myarr, axis=1)
print(sorted_arr)
[[4 5 9]
[1 2 6]
[2 4 8]]

Working with dates in numpy arrays

Numpy uses the datetime64 object to implement dates. The object comes with a number of functions for manipulating dates.

import numpy as np

#create a date
mydate = np.datetime64(‘2018-08-25 15:30:30’)
print(mydate)
2018-08-25T15:30:30

#create a range of dates
mydaterange = np.arange(np.datetime64(‘2018-08-01’), np.datetime64(‘2018-08-10’))
print(mydaterange)
[‘2018-08-01’ ‘2018-08-02’ ‘2018-08-03’ ‘2018-08-04’ ‘2018-08-05’ ‘2018-08-06’ ‘2018-08-07’ ‘2018-08-08’ ‘2018-08-09’]

Filed Under: Code Snippets, Machine Learning, python

Reader Interactions

Primary Sidebar

Recent Posts

  • Market Brief (Feb. 19, 2019): Is Walmart catching up to Amazon?
  • Understanding the tax cost ratio
  • Medishare Reviews: A guide to choosing the right health sharing ministry
  • Market Brief (Jan. 31, 2019): Facebook pops, again proving that investors should ignore most of the press
  • Market Brief (Jan. 24, 2019): Ford (F) as an example of a downtrodden, dividend play
  • How to increase net worth: 25 tips to start doing today
  • Market Brief (Jan. 18, 2019): Netflix growth accelerates
  • Market Brief (Jan. 4, 2019): Selling into this rally
  • Stock market for kids: Tips and resources for teaching your children about stocks
  • Market Brief (Dec. 19, 2018): Tech is on sale, should you buy?
  • Market Brief (Dec. 7, 2018): What’s causing the volatility?
  • Why the shift to streaming is my favorite trend, and why I’m long Netflix (NFLX)
  • Investable Assets: Why you should run hard to the key level of $500,000
  • Pandas tutorial: A guide to mastering this key data science library in Python
  • Numpy Tutorial: Quickly master this key Python library for data analysis
Copyright © 2018 · Next Level Finance