Oh the amazing things you can do with Numpy.
NumPy is a blazing fast maths library for Python with a heavy emphasis on arrays. It allows you to do vector and matrix maths within Python and as a lot of the underlying functions are actually written in C, you get speeds that you would never reach in vanilla Python.
Numpy is an absolutely key piece to the success of scientific Python and if you want to get into Data Science and or Machine Learning in Python, it's a must learn. NumPy is well built in my opinion and getting started with it is not difficult at all.
This is the second post in a series of posts on scientific Python, don't forget to check out the others too. An up-to-date list of posts in this series is at the bottom of this post.
ARRAY BASICS
Creation
NumPy revolves around these things called arrays. Actually nparrays, but we don't need to worry about that. With these arrays we can do all sorts of useful things like vector and matrix maths at lightning speeds. Get your linear algebra on! (Just kidding we won't be doing any heavy maths)
# 1D Array
a = np.array([0, 1, 2, 3, 4])
b = np.array((0, 1, 2, 3, 4))
c = np.arange(5)
d = np.linspace(0, 2*np.pi, 5)
print(a) # >>>[0 1 2 3 4]
print(b) # >>>[0 1 2 3 4]
print(c) # >>>[0 1 2 3 4]
print(d) # >>>[ 0. 1.57079633 3.14159265 4.71238898 6.28318531]
print(a[3]) # >>>3
The above code shows 4 different ways of creating an array. The most basic way is just passing a sequence to NumPy's array() function; you can pass it any sequence, not just lists like you usually see.
Notice how when we print an array with numbers of different length, it automatically pads them out. This is useful for viewing matrices. Indexing on arrays works just like that of a list or any other of Python's sequences. You can also use slicing on them, I won't go into slicing a 1D array here, if you want more information on slicing, check out this post.
Notice how when we print an array with numbers of different length, it automatically pads them out. This is useful for viewing matrices. Indexing on arrays works just like that of a list or any other of Python's sequences. You can also use slicing on them, I won't go into slicing a 1D array here, if you want more information on slicing, check out this post.
The above array example is how you can represent a vector with NumPy, next we will take a look at how we can represent matrices and more with multidimensional arrays.
# MD Array,
a = np.array([[11, 12, 13, 14, 15],
[16, 17, 18, 19, 20],
[21, 22, 23, 24, 25],
[26, 27, 28 ,29, 30],
[31, 32, 33, 34, 35]])
print(a[2,4]) # >>>25
To create a 2D array we pass the array() function a list of lists (or a sequence of sequences). If we wanted a 3D array we would pass it a list of lists of lists, a 4D array would be a list of lists of lists of lists and so on.
Notice how with a 2D array (with the help of our friend the space bar), is arranged in rows and columns. To index a 2D array we simply reference a row and a column.
Notice how with a 2D array (with the help of our friend the space bar), is arranged in rows and columns. To index a 2D array we simply reference a row and a column.
A Bit of the Maths Behind It
To understand this properly, we should really take a look at what vectors and matrices are.
A vector is a quantity that has both direction and magnitude. They are often used to represent things such as velocity, acceleration and momentum. Vectors can be written in a number of ways although the one which will be most useful to us is the form where they are written as an n-tuple such as (1, 4, 6, 9). This is how we represent them in NumPy.
A matrix is similar to a vector, except it is made up of rows and columns; much like a grid. The values within the matrix can be referenced by giving the row and the column that it resides in. In NumPy we make arrays by passing a sequence of sequences as we did previously.
A vector is a quantity that has both direction and magnitude. They are often used to represent things such as velocity, acceleration and momentum. Vectors can be written in a number of ways although the one which will be most useful to us is the form where they are written as an n-tuple such as (1, 4, 6, 9). This is how we represent them in NumPy.
A matrix is similar to a vector, except it is made up of rows and columns; much like a grid. The values within the matrix can be referenced by giving the row and the column that it resides in. In NumPy we make arrays by passing a sequence of sequences as we did previously.
Multidimensional Array Slicing
Slicing a multidimensional array is a bit more complicated than a 1D one and it's something that you will do a lot while using NumPy.
# MD slicing
print(a[0, 1:4]) # >>>[12 13 14]
print(a[1:4, 0]) # >>>[16 21 26]
print(a[::2,::2]) # >>>[[11 13 15]
# [21 23 25]
# [31 33 35]]
print(a[:, 1]) # >>>[12 17 22 27 32]
As you can see you slice a multidimensional array by doing a separate slice for each dimension separated with commas. So with a 2D array our first slice defines the slicing for rows and our second slice defines the slicing for columns.
Notice that you can simply specify a row or a column by entering the number. The first example above selects the 0th column from the array.
The diagram below illustrates what the given example slices do.
Notice that you can simply specify a row or a column by entering the number. The first example above selects the 0th column from the array.
The diagram below illustrates what the given example slices do.
Array Properties
When working with NumPy you might want to know certain things about your arrays. Luckily there are lots of handy methods included within the package to give you the information that you need.
# Array properties
a = np.array([[11, 12, 13, 14, 15],
[16, 17, 18, 19, 20],
[21, 22, 23, 24, 25],
[26, 27, 28 ,29, 30],
[31, 32, 33, 34, 35]])
print(type(a)) # >>><class 'numpy.ndarray'>
print(a.dtype) # >>>int64
print(a.size) # >>>25
print(a.shape) # >>>(5, 5)
print(a.itemsize) # >>>8
print(a.ndim) # >>>2
print(a.nbytes) # >>>200
As you can see in the above code a NumPy array is actually called an ndarray. I don't know why it's called an ndarray, if anyone knows please leave a comment! My guess is that it stands for n dimensional array.
The shape of an array is how many rows and columns it has, the above array has 5 rows and 5 columns so its shape is (5, 5).
The 'itemsize' property is how many bytes each item takes up. The data type of this array is int64, there are 64 bits in an int64, 8 bits in a byte, divide 64 by 8 and you get how many bytes it takes up, which in this case is 8.The shape of an array is how many rows and columns it has, the above array has 5 rows and 5 columns so its shape is (5, 5).
The 'ndim' property is how many dimensions the array has. This one has 2. A vector for example however, has just 1.
The 'nbytes' property is how many bytes are used up by all the data in the array. You should note that this does not count the overhead of an array and so the actual space that the array takes up will be a little bit larger.
WORKING WITH ARRAYS
Basic Operators
Just being able to make and retrieve elements and properties from an array isn't going to get you very far, you will need to do maths on them sometimes too. You can do this using the basic operators such as +, -, /, etc.
# Basic Operators
a = np.arange(25)
a = a.reshape((5, 5))
b = np.array([10, 62, 1, 14, 2, 56, 79, 2, 1, 45,
4, 92, 5, 55, 63, 43, 35, 6, 53, 24,
56, 3, 56, 44, 78])
b = b.reshape((5,5))
print(a + b)
print(a - b)
print(a * b)
print(a / b)
print(a ** 2)
print(a < b)
print(a > b)
print(a.dot(b))
With the exception of dot() all of these operators work element-wise on the array. For example (a, b, c) + (d, e, f) would be (a+d, b+e, c+f). It will work separately on each element, pairing the corresponding elements up and doing arithmetic on them. It will then return an array of the results. Note that when using logical operators such as < and > an array of booleans will be returned, which has a very useful application which we will go through later.
The dot() function works out the dot product of two arrays. This does not return an array, but a scalar (a value with just magnitude and no direction).
The dot() function works out the dot product of two arrays. This does not return an array, but a scalar (a value with just magnitude and no direction).
A Bit of the Maths Behind It
The dot() function is something called the dot product. The best way to understand this is to see how it is calculated.Array Specific Operators
There are also some useful operators provided by NumPy for processing an array.
# dot, sum, min, max, cumsum
a = np.arange(10)
print(a.sum()) # >>>45
print(a.min()) # >>>0
print(a.max()) # >>>9
print(a.cumsum()) # >>>[ 0 1 3 6 10 15 21 28 36 45]
The sum(), min() and max() functions are pretty obvious in what they do. Add up all the elements and find the minimum and maximum elements.
The cumsum() function however is a little less obvious. It adds together every element like sum() but it does this by first adding up the first and the second and storing the result of that calculation in a list and adding that result to the third, which again is then stored in a list. This is done for all elements in the array, returning a running total of the sum of the array as a list.
The cumsum() function however is a little less obvious. It adds together every element like sum() but it does this by first adding up the first and the second and storing the result of that calculation in a list and adding that result to the third, which again is then stored in a list. This is done for all elements in the array, returning a running total of the sum of the array as a list.
Advanced Indexing
Fancy Indexing
'Fancy indexing' is a useful way of picking out specific array elements that you want to work with.
# Fancy indexing
a = np.arange(0, 100, 10)
indices = [1, 5, -1]
b = a[indices]
print(a) # >>>[ 0 10 20 30 40 50 60 70 80 90]
print(b) # >>>[10 50 90]
As you can see in the above example we index the array with a sequence of the specific indexes that we want to retrieve. This in turn returns a list of the the elements we indexed.
Boolean masking
Boolean masking is a fantastic feature that allows us to retrieve elements in an array based on a condition that we specify.
# Boolean masking
import matplotlib.pyplot as plt
a = np.linspace(0, 2 * np.pi, 50)
b = np.sin(a)
plt.plot(a,b)
mask = b >= 0
plt.plot(a[mask], b[mask], 'bo')
mask = (b >= 0) & (a <= np.pi / 2)
plt.plot(a[mask], b[mask], 'go')
plt.show()
The above example shows how to do boolean masking. All you have to do is pass the array a conditional involving the array and it will give you an array of the values that return true for that condition.
The example produces the following plot:
We use the conditions to select different points on the plot. The blue points (which in the diagram also include the green points, but the green points cover up the blue ones), show all the points that have a value greater than 0. The green points show all points that have a value greater than 0 and that are less than half pi.
The example produces the following plot:
We use the conditions to select different points on the plot. The blue points (which in the diagram also include the green points, but the green points cover up the blue ones), show all the points that have a value greater than 0. The green points show all points that have a value greater than 0 and that are less than half pi.
Incomplete Indexing
Incomplete indexing is a convenient way of taking an index or slice from the first dimension of a multidimensional array. For example if you had the array a = [[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]], then a[3] would give you the element with index 3 in the first dimension of the array, which here would be the value 4.
# Incomplete Indexing
a = np.arange(0, 100, 10)
b = a[:5]
c = a[a >= 50]
print(b) # >>>[ 0 10 20 30 40]
print(c) # >>>[50 60 70 80 90]
Where
the where() function is another useful way of retrieving elements of an array conditionally. Simply pass it a condition and it will return a list of elements where that condition is true.
# Where
a = np.arange(0, 100, 10)
b = np.where(a < 50)
c = np.where(a >= 50)[0]
print(b) # >>>(array([0, 1, 2, 3, 4]),)
print(c) # >>>[5 6 7 8 9]
And that's NumPy, not so hard right? Of course this post only covers the basics to get you going, there are many other things that you can do in NumPy that when you are comfortable with these basics, you should take a look at.
Share this post so that other people can read it too and don't forget to subscribe to this blog via email, follow me on Twitter and/or add me on Google+ to make sure you don't miss any posts that you will find useful. Also, feel free to leave a comment whether to ask a question, point out something I've missed or anything else.
This is the second instalment in a series of posts on scientific Python. If you want to learn more about scientific Python, you might like these posts too: