Chapter 5 — Modules and NumPy

5.0. Core Python modules

This week we’re going to learn about a bunch of modules that you can use to do different useful things.

As we’ve said, a Python module is just an extra bit of code (functions, data types, or classes) that are not a part of standard Python, but which you can import to give it added functionality. There are two basic kinds of Python modules:

“Core” Python modules, which are created by the same people who made Python
“Third party” modules (called that because they were not created by the Python creators or you).

Core modules come installed with Python, so all you need to do is use an import statement to use them. Third party modules need to be installed. In the rest of this section we will go over some useful core python modules, and then we will talk about two of the most popular third party modules (NumPy and Pillow) in later sections.

The `math` module

The math module gives you a bunch of basic math operations:

import math
x = 82.2

a = math.sqrt(x)  # for calculating square roots, result 9.066
b = math.ceil(x)  # rounds a number up, result 83
c = math.floor(x)  # rounds a number down, result 82
d = math.log10(x)  # 1.914
e = math.cos(x) # 0.868
f = math.pi  # result is 3.1415....

There are many, many more math functions. You can see the full list here: https://docs.python.org/3/library/math.html.

The `time` module

The time module is, as you would expect, for various time related functions. Some of you may have already used it to slow down the Tkinter graphics.

import time

for i in range(10):
    print(i)
    time.sleep(0.1)  # forces python to pause and not execute the next line for 0.1 seconds

Tracking how long code takes to execute (or how long a user takes to do something)

Another very common use of the time module is to keep track of how long certain operations take. This can be useful for comparing different ways of writing code to see which runs faster. It can also be useful as experimental data. Think back to the big five survey. We could keep track of how long it took people to answer each question. Maybe how much time they had to think about each question would tell us something extra about their personality?

import time
x = 0

# let's see how long it takes for Python to loop through the numbers 0 to 999
# first, we start the timer
start_time = time.time()  # retrieves the current system time in seconds

# then we do the thing we want to time (in this case, looping through the numbers 0 to 999)
for i in range(1000):
    x += 1

# then we stop the timer
stop_time = time.time()  # again retrieves the current system time in seconds

# then we calculate how long it took
elapsed_time = stop_time - start_time

The times you get from time.time() is a big number, in seconds. Technically, It is a number that is defined as the number of seconds that have elapsed since some arbitrary time. On Mac/Linux, it is the number of seconds since January 1, 1970, 00:00:00 (UTC). So the absolute number isn’t that important. What is useful is for comparing two different times as we did above, to get amount of time that has elapsed. Notice that it gives you 5 decimal points. Technically that is 1/100 of a millisecond accuracy! But in reality, for various reasons having to do with computer operating systems, the time is not that trustworthy. Supposedly on Macs and Linux machines you can trust the numbers out to the third decimal place (millisecond accuracy), whereas on Windows you can only trust it out to the second decimal point (there is 16 ms of noise in each time.time() estimate on Windows computers). Note: if your computer is fast enough, you may need to use time.perf_counter() instead of time.time() in order to see any appreciable differences for such computations. Computers just keep getting faster!

Other time functions can be read about here: https://docs.python.org/3/library/time.html.

The `datetime` module

The datetime module can be used to get all kinds of current day and time information, and to format it various ways.

import datetime
x = datetime.datetime.now()  # gets the current date and time and stores it in a datetime object
print(x)  # result looks like datetime.datetime(2023, 2, 21, 20, 7, 5, 749192)
# that is year, month, day, hour, minute, second (in system clock units

# you can then convert this to a formatted string:
my_date_time_string = x.strftime("Day: %d  Month: %b   Year: %Y")

There are all kinds of different %-sign symbols you can stick in there to convert the datetime object to different formats. You can see them here: https://www.w3schools.com/python/python_datetime.asp.

The datetime module has tons of other date and time related functions, for doing things like comparing lengths of time, checking if days are weekdays, converting times, and other stuff. You can read more here: https://docs.python.org/3/library/datetime.html.

The `random` module

The random module allows us to generate pseudo random numbers and lists of numbers from various distributions. We say “pseudo” because it is actually impossible to generate truly random numbers (well, except perhaps without using quantum physics). Any algorithm we would use to generate the numbers will be deterministic by definition, and so pure randomness isn’t possible. So when you want a “random” number, python gets the current system time out to a pretty extreme decimal point, and then performs an algorithmic transformation of that to give you back a number that is pretty darn close to random, assuming you aren’t performing it on a regular basis every 0.00001 seconds.

import random
a = random.random()  # generates a random float between 0 and 1
b = random.randint(4, 10)  # generates a random number between 4 and 10 (inclusive)
c = random.uniform(4, 10)  # generates a random float between 4 and 10 (inclusive)
d = random.gauss(4, 10) # generates a number from a gaussian (normal) distribution with a mean of 4 and stdev of 10
list1 = [a, b, c, d]
random.shuffle(list1)  # randomizes the elements of a list, copy it first if you want to save the original!
random.choice(list1)  # chooses a random item from the list

Random seed

As we noted, the random numbers you can generate with Python (or any programming language) are not truly random. They are generated by an algorithm that by default uses the system time to five decimal places as the input and sticks that in a very funky hard-to-predict equation to get a very hard-to-predict number back out. So this is good if you want very hard to predict random numbers. But sometimes we want to be able to replicate analyses we do using random numbers. This is where a random “seed” comes in. The seed is the number that is stuck into the funky equation, that by default is the system time. But you can set the random seed to any number you want, and if you consistently use the same random seed, then you will consistently get back the same “random” number.

import random

random_int1 = random.randint(0, 1000)  # you won't be able to predict this number

random_seed = 10
random.seed(random_seed)

random_int2 = random.randint(0, 1000)  # this number will be the same every time
random_int3 = random.randint(0, 1000)  # this number will also be the same every time, but not the same as the last one

random.seed()
random_int4 = random.randint(0, 1000)  # you won't be able to predict this number

print(random_int1, random_int2, random_int3, random_int4)

One thing to note about random seed is that once you set it to something (or if you use the default system time) if you ask for 10 random numbers, you get 10 different numbers, not the same random number 10 times. Why is this? Effectively what happens is that each time you request a new random number, the seed is changed, and so a different number is put into the funky equation to give you back a random answer. Python automatically changes the seed each time, in a deterministic way based on the previous random number that you requested. So that is why the random number is different each time you ask for a new one, it is actually using a different random seed each time. But that’s also why the sequence of random numbers is the same every time if you set the random seed manually, because each random seed that is used is chosen deterministically from the one that came before.

There are many more things you can do with the random module, which you can read about here: https://docs.python.org/3/library/random.html

The `sys` module

The sys module is used to interact with the python interpreter, the actual program that executes your python code. Below are three things you can do with sys, but there are dozens, which you could read about in the documents.

For each of the sys functions below, comment what it is doing.

Quitting a program

There may be times you want to force your program to quit. One common situation is when you are checking to see if some data is correct, or a user has input something wrong, and you want to stop the program if you notice a problem.

import sys

some_data = [1,2,3,4,5]
if len(some_data) < 10:
    print("ERROR: Your data was supposed to have at least 10 elements. Program terminating.")
    sys.exit()

Getting the memory size of an object

Sometimes you may be curious about how much of your memory a variable is taking up, if you are doing big data analyses and your computer is running slow.

import sys
x = 100000
big_list = list(range(100000))
size = sys.getsizeof(big_list) # 800056 bytes of memory

Passing command-line arguments to your program

A final thing that sys is useful for is passing command line arguments to your program. A common use here is if you want to specify some piece of information, like a file name or a subject number, at the time you run a program, without needing to hard-code it into the program itself. To do this, we use sys.argv:

import sys

argument_list = sys.argv

The way this works is that when you run a python script that uses that line, then the name of the script, and anything you type after the name of the script, get saved in the sys.argv variable. So if you typed: python script.py pizza 1 3.14 when you ran the script, then you would end up with the list:

[“script.py”, “pizza”, “1”, “3.14”]

As noted, this is a really useful way for providing information to your program that you may want to vary different times you run the program, like perhaps what file to use as input. Imagine the survey program you wrote for the last homework, but also imagine that you had many files of questions, and you wanted to specify which one was used. It is good for situations like that. (Perhaps somewhat confusingly, the script name itself is considered the first argument.)

Note that this returns a list of strings! If you entered numbers into the command line, and want them to work as numbers, you need to convert them to floats or ints.

“Why isn’t my Play button working?”

At the risk of stating the obvious, this only works if you run the script from the command line. If you try to run your script from an IDE (e.g. pressing the big “Play” button in VS Code), it’s not going to sit there waiting for you to type something to input as an argument. Run the script from the command line to use this feature: python my_script.py x y z (or uv run python my_script.py x y z if using uv) where x, y, and z are placeholders for the arguments you want to pass to the script.

Why use command-line arguments instead of input statements?

You might reasonably ask why we would bother passing arguments directly in the command line instead of using a series of input statements. Beyond the obvious benefits of being easier, quicker, and less error-prone to type in one go (rather than needing to wait to be prompted for each input serially, where errors and typos can accumulate), there are also some subtle advantages. For example, if you are writing a script that needs to be run repeatedly, you can pass different arguments each time to get different results. This is useful for repeatedly running simulations or experiments with different parameters each time. It is also useful when you need to run code on a different computer you don’t have full control over, or when running code that you cannot even change because it is an executable file (so you don’t even have access to the source code).

Indeed, you have experienced some of these benefits already throughout this course, especially when running the uv command to install packages. When you enter something like uv add numpy matplotlib pandas, you are asking uv to install the packages numpy, matplotlib, and pandas, and then record all that in the project’s pyproject.toml file so that you can easily recreate the same environment later. You don’t have to type in the package names one by one, and you can review them all right in front of you to make sure they are spelled correctly.

There are many more sys functions, which you can read about here: https://docs.python.org/3/library/sys.html.

The `os` module

The final module we will discuss here is the os module. The os module is a way to do operating system dependent stuff. The most common case is using your python code to do things like make and delete folders, get lists of files in a directory, and stuff like that.

Creating and deleting folders and files

import os
os.mkdir("new folder")  # will create a new folder named "new folder" in the directory where the program was run
os.mkdir("/Users/jon/Desktop/new folder") # will create a folder called "new folder" on my desktop

os.rmdir("new folder")  # will delete a folder called "new folder" if it exists in that location
os.remove("some_file.txt") # will delete a file called "some_file.txt" if it exists in that location

This won’t work if the folder is there, or if the folder isn’t empty. You’ll get an error.

Say you want to create a folder inside a folder inside a folder, and you have the name of each stored in a string.

folder1 = "music"
folder2 = "classical"
folder3 = "bach"

You could concatenate the strings to create a path.

folder_path = folder1 + "/" + folder2 + "/" + folder3
os.mkdir(folder_path)

But adding all those slashes is annoying. The os module has a nice function you can use instead.

folder_path = os.path.join(folder1, folder2, folder3)  # folder path will be "music/classical/bach"
os.mkdirs(folder_path)

Note that mkdirs has an extra “s”. If you use mkdir() and the music and classical folders didn’t exist, you would get an error. But if you used mkdirs() it will create any chain of folders you want to make.

Tip

Alternatively, you can use the pathlib module (another core Python module) to create the path and perform operations on it. When running .mkdir(), setting parents=True will create all the folders in the path if they don’t exist, and exist_ok=True will not raise an error if the folder already exists.

from pathlib import Path
folder_path = Path(folder1, folder2, folder3)  # folder path will be "music/classical/bach"
folder_path.mkdir(parents=True, exist_ok=True)

Listing files and directories

There may be times you want to write a program that will read in a bunch of files and do something with all of them. Like what if we wanted to count and sum the frequencies of words in a bunch of different files? One (bad) way to do this is to hard-code all of the file names into the program. But that would mean you would need to change the program every time you changed the files. A better way would be to designate a folder for all the files you want to read, and then get a list of every file in that folder, and then open them.

import os

data_directory = "my_data"
directory_list = os.listdir(data_directory)

for file_name in directory_list:
    if item[0] is not ".":
        process_file(os.path.join(data_directory,file_name))

In the code above, the os.listdir() function gives you back a list of every file in that folder. One important thing to remember is that will include hidden files in that directory, which most operating systems begin with a “.”. The if statement above checks to see if the file name starts with a “.”, and if it does not, calls some function would process that file, and passes it the path to that file so it can be opened.

There is a whole lot more you can do with the os module, which you can read about here: https://docs.python.org/3/library/os.html

5.1 Installing modules with `uv`

As noted in Chapter 0.4, we’ll use uv to manage the Python packages we’ll use in our projects. That chapter covers how to install uv and create your class project, so revisit that chapter if you need to in order get started with uv.

Below we’ll go through the steps to install a new module with uv.

Step 1: Make sure `uv` is installed

Let’s verify your installation of uv. Open the terminal and run the following command:

uv --version

If that successfully prints a version number, you are good to go. Mine prints uv 0.7.3 (3c413f74b 2025-05-07) but it’s okay if you’re using a different version.

Step 2: Installing a new module with `uv`

First, ensure that you are currently in your project folder. That is, your terminal window’s current working directoy should be your project folder (i.e., the one that contains a pyproject.toml file created by uv init). Then you can install a new package, like numpy, with:

uv add numpy

This installs numpy into your project environment and automatically records it as a dependency of the project, making recreating the same environment later easy and reproducible.

Step 3: Verify that your installation worked

One quick way is to run Python through uv, import numpy, and print the version number. Copy-paste and run the below code in your terminal:

uv run python -c "import numpy; print(numpy.__version__)"

Again, if you see that it successfully ran and printed a version number, you’re golden. If you get an error like ModuleNotFoundError: No module named 'numpy', double-check that you ran uv add numpy from inside the correct project folder! If you managed to successfully install numpy, feel free to move onto the next section.

Running Python code inline

We just ran a little Python script inline. The -c flag tells Python to execute the code passed to it as a string. Semicolons are used here to separate valid statements, rather than using new lines.

Troublshooting installation issues

If you didn’t manage to install numpy, this section will go over some of the issues encountered by past students and how to fix them.

Windows: `os error 396` or `os error 32`

This can be remedied by (1) moving your project folder out of cloud storage (like OneDrive) and (2) using the --link-mode=copy argument when trying to install things with uv. So try the following to get numpy installed:

uv add numpy --link-mode=copy

Packages added but not seen in the environment

In this problem scenario, although you’ve successfully added the packages using uv add, activating your local environment and trying to import your packages continues to fail. This is likely because you have multiple python environments in folders above your project folder. Delete those .venv folders (you can always remake them anew if needed) and run uv sync to try again.

5.2. NumPy

NumPy is short for “Numerical” Python, and it is designed to implement vector and matrix algebra in a much faster and more useful manner. Matrix algebra, from calculating means, to correlations, to complicated neural networks and machine learning algorithms, is an important part of modern scientific programming and data science. So it is great to have a module that makes this easier and more efficient in Python.

The primary element of NumPy is an “array”. It’s easiest to think of NumPy arrays as just an extension of python lists, but with a couple of changes:

when NumPy arrays are created, you have to say what size they are and put some data in them (even if it’s all zeros). You cannot simply “append” a NumPy array. NumPy arrays cannot change size.
the syntax for indexing is a little different, as we will describe
there are a ton of attributes, functions and methods for doing operations on NumPy arrays that don’t exist for lists

Creating NumPy arrays

Let’s start by showing some ways to create a NumPy array. A common way to create a NumPy array looks as though you’re calling NumPy’s array function and passing it a list, as in array1 and array2 below.

import numpy as np
array1 = np.array([1, 2, 3, 4, 5])

some_list = [1, 2, 3, 4, 5]
array2 = np.array(some_list)

Remember that when we import a module, we can use an alias to shorten it, so we don’t have to type it out every time we use it. That’s what we’ve done with import numpy as np above. Now we can use np every time we use a NumPy function instead of numpy.

You can also create a NumPy array of a specific size but full of all ones or all zeros.

import numpy as np

x = np.ones(10)  # creates an array with 10 elements that are all 1
y = np.zeros(100)  # creates an array with 100 elements that are all 0

Creating multidimensional arrays

We can easily create multidimensional arrays in Python. This looks just like when we create a list of lists. To create 2D arrays of ones or zeros, you just use commas to specify the size of each dimension.

import numpy as np

# an array with 2 rows and 3 columns per row
two_dim_array1 = np.array([[1, 2, 3], [10, 20, 30]])

# an array with 4 rows and 3 columns per row
two_dim_array2 = np.array([[1, 2, 3], [10, 20, 30], [4, 5, 6], [40, 50, 60]])

# a 2d array with 4 rows and 3 columns, all containing zeros
two_dim_array3 = np.zeros([4, 3])

# a 3d array with height=2, width=2, and depth=4
three_dim_array1 = np.array([[[1, 2, 3, 4], [10, 20, 30, 40]], [[1, 2, 3, 4], [10, 20, 30, 40]]])

# sometimes it is easier to visualize if you separate it over lines. It is legal to do so if you hit return after commas
three_dim_array2 = np.array([[[1, 2, 3, 4],
                              [10, 20, 30, 40]],
                             [[1, 2, 3, 4],
                              [10, 20, 30, 40]]])

# a 3d array with height=4, width=3, and depth=5, containing zeros in every cell
three_dim_array3 = np.zeros([4, 3, 5])

Getting the dimensions of NumPy arrays

When you print or define a NumPy array, you can always tell how many dimensions it is by counting how many square brackets there are at the very beginning or end. The 2d examples above have 2, and the 3d ones have 3.

But there is an easier way to get the size using code. So far we have used the len() function to get the size of an array, but that only works on one dimensional arrays. For multidimensional arrays, you want to access the .shape attribute of a NumPy array. You can also get the total number of elements using the .size attribute.

import numpy as np

three_dim_array1 = np.array([[[1, 2, 3, 4], [10, 20, 30, 40]], [[1, 2, 3, 4], [10, 20, 30, 40]]])

array_shape = three_dim_array1.shape  # will store the tuple (2, 2, 4) in array_shape
array_size = three_dim_array1.size  # will store the value 16 in array_size, the total number of cells in the array

Indexing NumPy arrays

We index a 1d NumPy array the same way we do a list:

import numpy as np
array1 = np.array([10, 20, 30, 40, 50])
print(array1[2])  # will print the number 20

for i in range(array1.shape[0]):
    print(i, array1[i])

The output of the above for loop would be:

The syntax for indexing multidimensional arrays is slightly different from Python lists. Instead of putting brackets around every dimension as you would for a list of lists, you just use one set of brackets, and you use commas to state what index of each dimension you want to access.

import numpy as np

array1 = np.ones([3, 4])

array1[2, 3] = 5
array1[0, 0] = 0

print(array1)

The output of the print statement would be:

[[0. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 5.]]

If you only specify one index in a multidimensional array, it will access every element of every other dimension along the one that is specified:

import numpy as np

array1 = np.ones([3, 4])
array1[2] = 10  # will change every column of the third row to 10
array2 = array1[0] # will create a 1D array consisting of reference to every column of the first row of array1

All the other slicing and indexing rules for lists apply to NumPy arrays as well, but are adapted in the same way for multidimensional arrays.

import numpy as np

array1 = np.array([[1, 2, 3, 4], [5, 5, 5, 5], [50, 500, 5000, 50000]])

array1[2, -2] = 10  # will change the second to last element of the third row to 10

array2 = array1[:, 2]  # creates a reference to a slice of array1 with every row of column index 3

# creates a reference to a slice of array1, with every row before index 2, and every column from index 1 onward
array3 = array1[:2, 1:]
print(array3)

The output of that print would be

[[2 3 4]
 [5 5 5]]

Copies and references to NumPy arrays

The last example is an opportunity for an important reminder. NumPy arrays are like Python lists, in that when you assign an array to a second variable, or take a slice of an array and assign it to another variable, it does not create a copy. It creates a reference pointing to the original array. That means that both variables point to the same data, and so if you change one, you change them both. In NumPy, these referenced arrays are called “views” of an array. That’s a useful way to think about it: the other array is allowing you to view the array (or part of the array, or a reshaped version of the array), not making a copy of it.

import numpy as np

array1 = np.array([[1, 2, 3, 4], [5, 5, 5, 5], [50, 500, 5000, 50000]])
array2 = array1[:, 2]  # creates a view of array1, with every row, but only of column index 3

array1[2, 2] = -10
array2[0] = -20

print(array1)
print(array2)

The resulting prints of array1 and array2 would be:

[[    1     2   -20     4]
 [    5     5     5     5]
 [   50   500   -10 50000]]

[ -20    5 -10]

The location [2,2] in array1 is also location [2] in array2, and the location [0] in array2 is also location [0,2] in array2, because both variables are pointing to the same data.

Often this is not what we want. You can use the .copy() method to create a copy instead of a view:

import numpy as np

array1 = np.array([[1, 2, 3, 4], [5, 5, 5, 5], [50, 500, 5000, 50000]])
array2 = array1[:, 2].copy()  # creates a view of array1 with every row of column index 3

array1[2, 2] = -10
array2[0] = -20

print(array1)
print(array2)

The result now is what we might have wanted:

[[    1     2     3     4]
 [    5     5     5     5]
 [   50   500   -10 50000]]

[ -20    5  5000]

5.3. Manipulating Numpy arrays

There are many ways you can manipulate NumPy arrays.

Reshaping NumPy arrays

You can create a new array (or reference to an array) that converts the existing array to a different shape, using the .reshape() method.

import numpy as np

array1 = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])

array2 = array1.reshape(4, 3)  # creates view of array1 that is reshaped from 1x12 to 4x3

array3 = array1.reshape(4, 3).copy()  # creates a copy of to array1 that is reshaped from 1x12 to 4x3

For a multidimensional array, the .reshape() method allows you not specify one of the sizes. It will then make it the size that it needs to be. The way you tell it what size you are not specifying is by using a -1.

import numpy as np

array1 = np.ones([4, 3, 5])
array2 = array1.reshape(2, 2, -1)  # creates a view of array1 that has a height=2, a width=2, and a depth=12.

The way this works is that you can figure out what total number of dimensions would be (4*3*5=60 in the above example), and so if you specify two dimensions that give you four elements (2*2 in the above example), then the remaining dimension must have 15 elements in it (60/4).

A common operation in many applications is also to take a multidimensional array and to “flatten” it into a single dimension. You do this the same way, by using reshape and with -1 as the only argument.

import numpy as np

array1 = np.ones([4, 3, 5])
array2 = array1.reshape(-1)  # creates a view of array1 that is a 1D array with 60 elements

Combining arrays

As we said, the size of an arrays cannot be changed after it is created, but you can create new arrays (or new views) of arrays that combine other arrays or use only parts of an array. There are many ways to do this.

Concatenating arrays

import numpy as np

array1 = np.array([1, 2, 3, 4, 5])
array2 = np.array([11, 22, 33, 44, 55])
array3 = np.array([111, 222, 333, 444, 555])

array4 = np.concatenate((array1, array2, array3))
# [  1, 2, 3, 4, 5, 11, 22, 33, 44, 55, 111, 222, 333, 444, 555])

Somewhat annoyingly to remember, np.concatenate() creates a copy, not a view. So in the example above, a change to array1 would not make a change to array4.

If you have a multidimensional array, you can specify what axis you want to concatenate along. The axis numbers are interpreted from outer-most to inner-most arrays if you think of them as lists within lists.

import numpy as np

array1 = np.array([[1, 2, 3, 4, 5], [11, 21, 31, 41, 51]])
array2 = np.array([[-1, -2, -3, -4, -5], [-11, -21, -31, -41, -51]])

array3 = np.concatenate((array1, array2), axis=0)
array4 = np.concatenate((array1, array2), axis=1)

print(array3)
print(array4)

The resulting output would be:

[[  1   2   3   4   5]
 [ 11  21  31  41  51]
 [ -1  -2  -3  -4  -5]
 [-11 -21 -31 -41 -51]]

[[  1   2   3   4   5  -1  -2  -3  -4  -5]
 [ 11  21  31  41  51 -11 -21 -31 -41 -51]]

If you specify the axis=0, that means you are concatenating along the outside dimension (the rows), and if you specify axis=1, you are concatenating the inside dimension, the columns. Another way to think of this is that the axis you specify is the axis whose size is going to stay the same. If you specify axis=0, we keep the row size = 5. If you specify axis=1, you keep the column size = 2.

Stacking arrays

In addition to concatenating, you can stack arrays. Stacking creates a new axis.

import numpy as np

array1 = np.array([1, 2, 3, 4, 5])
array2 = np.array([11, 22, 33, 44, 55])
array3 = np.array([111, 222, 333, 444, 555])

array4 = np.stack((array1, array2, array3)) # stacks along axis=0 by default
# array([[  1,   2,   3,   4,   5],
#        [ 11,  22,  33,  44,  55],
#        [111, 222, 333, 444, 555]])

array5 = np.stack((array1, array2, array3), axis=1)
# array([[  1,  11, 111],
#        [  2,  22, 222],
#        [  3,  33, 333],
#        [  4,  44, 444],
#        [  5,  55, 555]])

Splitting arrays

There are also many ways you can split arrays. The np.array_split() function takes an array and a number of chunks you want to split it into, and gives you a list of two arrays back:

import numpy as np

array1 = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

array_list = np.array_split(array1, 2)  # [array([1, 2, 3, 4, 5]), array([6, 7, 8, 9, 10])]

This function doesn’t need to divide evenly, if it doesn’t the different arrays will be different sizes. The split function also works with multidimensional arrays, and by default splits along the first axis, but you can specify a different axis if you want.

5.4. NumPy operations

Iterating over arrays

We’ve already shown that you can iterate over an array the same way you do a list. This applies to multidimensional arrays as well. If you want to iterate through every dimension, you just need a different embedded for loop for each dimension:

import numpy as np

x = np.ones([2, 3, 4])

counter = 0

for i in range(x.shape[0]):
    counter += 100
    for j in range(x.shape[1]):
        counter += 10
        for k in range(x.shape[2]):
            counter += 1
            x[i, j, k] = counter

print(x)

With the output:

[[[111. 112. 113. 114.]
  [125. 126. 127. 128.]
  [139. 140. 141. 142.]]

 [[253. 254. 255. 256.]
  [267. 268. 269. 270.]
  [281. 282. 283. 284.]]]

Searching NumPy Arrays

You can search for values in a NumPy array and get the indexes where the value occurs:

import numpy as np

array1 = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10])
indexes1 = np.where(array1==3)  # will give you a numpy array of size 1 with [2] in it
indexes2 = np.where(array1==10)  # will give you a numpy array of size 2 with [9,10] in it

array_list = np.array_split(array1, 2)  # split into 2 arrays: [array([1, 2, 3, 4, 5, 6]), array([7, 8, 9, 10, 10])]

You can do the same but looking for min or max values. You can get the value or the index:

import numpy as np
array1 = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10])

min_value = np.min(array1) # will give you 1
min_index = np.argmin(array1)  # will give you 0, the first index where the max value occurred

max_value = np.max(array1) # will give you 10
max_index = np.argmax(array1)  # will give you 9, the first index where the max value occurred

Finding Unique Values and Counting

NumPy’s unique function can find unique values in an array. It can also count how often they show up in the array, if you ask it to. Let’s look at an example:

import numpy as np

# An array with some repeated values
array1 = np.array([5, 2, 5, 2, 2, 8, 8, 3, 3, 3, 3, 5])

# Get unique values only
unique_values = np.unique(array1)  # array([2, 3, 5, 8])

# Get unique values *and* their counts
values, counts = np.unique(array1, return_counts=True)
# values is array([2, 3, 5, 8])
# counts is array([3, 4, 3, 2])  # the number 2 appears 3 times, 3 appears 4 times, etc.


# Works with 2D arrays too
array2 = np.array([[1, 2, 3], [2, 3, 4], [3, 4, 5]])
unique_2d = np.unique(array2)  # array([1, 2, 3, 4, 5]) -- notice that this array is 1D, or flat

# Can also get indices of unique values
values, indices, counts = np.unique(array1, return_index=True, return_counts=True)
# indices shows where each unique value first appears in the original array

# Can also get the (first) indices of each unique value at the same time
values, indices, counts = np.unique(array1, return_index=True, return_counts=True)
# indices => array([1, 7, 0, 5], dtype=int64)
# This means that the number 2 first appears at index 1, the number 3 first appears at index 7, etc.

Sorting NumPy Arrays

You can sort NumPy arrays, and specify an axis that you want to sort along.

import numpy as np

array1 = np.array([1, 4, 6, 2, 10, 3, 7, 9, 5, 8])
array2 = np.array([[10, 8, 9, 1, 2], [7, 6, 3, 5, 4]])

sorted_array1 = np.sort(array1)

sorted_array2 = np.sort(array2)
# result is sorted_array2 = array([[ 1, 2, 8, 9, 10], [3, 4, 5, 6, 7]])
# sorts the inner-most dimension only, Same as axis=1

sorted_array2_rows = np.sort(array2, axis=0)
# result is sorted_array2_rows = array([[7, 6, 3, 1, 2], [10, 8, 9, 5, 4]])
# sorts the outer-most dimension only, (e.g. compared 10 vs. 7; 8 vs. 6; 9 vs 3; 1 vs. 5; 2 vs. 4)

sorted_array2_columns = np.sort(array2, axis=1)
# result is sorted_array2_columns = array([[ 1, 2, 8, 9, 10], [3, 4, 5, 6, 7]])
# sorts the inner-most dimension only

sorted_array2_flattened = np.sort(array2, axis=None)
# result is sorted_array2_flattened = array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

sorted_array2_both_dims1 = np.sort(np.sort(array2, axis=0), axis=1)
# result is sorted_array2_both_dims1 = array([[ 1, 2, 3, 6, 7], [ 4, 5, 8, 9, 10]])

sorted_array2_both_dims2 = np.sort(np.sort(array2, axis=1), axis=0)
# result is sorted_array2_both_dims1 = array([[ 1, 2, 5, 6, 7], [ 3, 4, 8, 9, 10]])

Remember the hint that the axis you specify is effectively the one that is left alone.

Arithmetic, matrix algebra, and statistics with NumPy

NumPy has its own built-in functions for all the same math stuff we covered in the core Python modules (e.g., sin, cos, log, ceiling, floor, etc.). We won’t cover that again, but you can see the NumPy syntax for these here: https://numpy.org/doc/1.24/reference/routines.math.html

Simple matrix arithmetic

If you add, subtract, multiply or divide a number by a NumPy array, it will do it to every element:

import numpy as np

array1 = np.array([1, 2, 4, 8, 16])

array2 = array1 + 10  # result is np.array([11, 12, 14, 18, 26])
array3 = array1 - 20  # result is np.array([-19, -18, -16, -12, -4])
array4 = array1 * 20  # result is np.array([20, 40, 80, 160, 320])
array5 = array1 ** 2  # result is np.array([1, 4, 16, 64, 256])

You can add, subtract, and multiply matrices without needing to use loops (which is much faster in terms of processing time):

import numpy as np

array1 = np.array([[1, 2, 4, 8, 16], [-1, -2, -4, -8, -16]])
array2 = np.array([[1, 2, 3, 4, 5], [-1, -2, -3, -4, -5]])

array3 = array1 + array2  # array([[2, 4, 7, 12, 21], [-2, -4, -7, -12, -21]])
array4 = array1 / array2  # array([[1, 4, 12, 32, 80], [1, 4, 12, 32, 80]])

Addition, subtraction, multiplication, and division take place element-by-element.

Matrix algebra

In addition to basic arithmetic, you can do matrix algebra like compute inner and outer products and all that fun stuff.

import numpy as np

array1 = np.array([1, 2, 4, 8])
array2 = np.array([-1, 1, 0, 0.5])

dot_product = np.dot(array1, array2)  # result is 5.0 = (1*-1) + (2*1) + (4*0) + (8*0.5)

outer_product = np.outer(array1, array2)
'''
each column is array1 multiplied by each corresponding value of array2
column1 = [1, 2, 4, 8] * -1, column2 = [1, 2, 4, 8] * 1, column3 = [1, 2, 4, 8] * 0, column4 = [1, 2, 4, 8] * 0.5
array([[-1. ,  1. ,  0. ,  0.5],
       [-2. ,  2. ,  0. ,  1. ],
       [-4. ,  4. ,  0. ,  2. ],
       [-8. ,  8. ,  0. ,  4. ]])
'''

There is of course much much more matrix algebra you can do, which you can see here: https://numpy.org/doc/1.24/reference/routines.linalg.html

Statistics

What NumPy really excels at are all the operations you might want to compute on a vector of numbers. We can compute means and standard deviations (stdevs) just like with lists, but it’s much faster.

import numpy as np

array1 = np.array([[1, 11, 21, 31, 41],
                   [2, 22, 42, 62, 82],
                   [3,  6,  9, 12, 15],
                   [6, 12, 18, 24, 30],
                   [2,  4,  8, 16, 32],
                   [4,  8, 16, 32, 64]])

print(array1.sum())  # 636, the sum of the entire matrix
print(array1.sum(axis=0))  # [18 63 114 177 264], the sum of the columns
print(array1.sum(axis=1))  # [105 210  45  90  62 124], the sum of the rows

print(array1.mean())  # 21.1, the mean of the entire matrix
print(array1.mean(axis=0))  # [3. 10.5 19. 29.5 44.], the mean of the columns
print(array1.mean(axis=1))  # [21. 42.  9. 18. 12.4 24.8], the mean of the rows

print(array1.std())  # 19.86353442869622, the standard deviation of the entire matrix
print(array1.std(axis=0))  # [1.63299316 5.82380174 11.28420725 16.2455122 22.4870333], the stdev of the columns
print(array1.std(axis=1))  # [14.14213562 28.28427125 4.24264069 8.4852814 10.9105454 21.8210907] the stdev of the rows

print(np.corrcoef(array1))  # computes the correlation coefficient of every row with every other row in a numpy matrix
print(np.polyfit(array1[:,0], array1[:,1]), 1) # computes a line of best fit

The np.corrcoef() function gives you back a matrix with 6 rows and 6 columns, because there were 6 rows in the original matrix. Each row of the result shows you the correlation of that row from the original matrix, with every other row.

The np.polyfit() function takes two 1D arrays, and calculates the line or curve that best fits those arrays if you imagine them as x-values and y-values on a scatterplot (like a correlation). The third argument is the dimensionality of the line or curve you can use to fit the data. Use 1 as above for a line (like linear regression), y = b0 + x*b1; a 2 would be a quadratic function: y = b0 + b1*x1 + b2*x2**2; a 3 would be a cubic function: y = b0 + b1*x1 + b2*x1**2 + b3*x1**3. Polyfit returns the values of the parameters (the set of b variables) that best fit the data. For a line, you get back two variables (b0 and b1), for a quadratic function you get 3, for a cubic function you get 4, etc.

Random numbers in NumPy

The last issue we want to cover with NumPy is how to use it to generate random numbers. We find this really useful when doing simulations of all sorts.

Remember that you can create a NumPy array full of zeros or ones? NumPy random numbers basically work the same way, except that instead of a constant value, you can use random numbers.

import numpy as np

array1 = np.random.random(20)  # generates an array of 20 random floats between 0 and 1
array2 = np.random.random(20)*10  # # generates an array of 20 random floats between 0 and 10
array3 = np.random.randint(1,6, size=[10,20])  # generates a 10x20 matrix of random integers 1 through 6
array4 = np.random.normal(100, 10, size=[4,10])  # generates a 4x10 matrix of random floats with mean=100, std=10

You can set the seed for the random number generator, which is useful if you want to generate the same random numbers every time you run your code.

import numpy as np

np.random.seed(12345)
array1 = np.random.random(20)

You can also create separate random number generators for different parts of your code.

import numpy as np

rng = np.random.default_rng(12345)
array1 = rng.random(20)

rng2 = np.random.default_rng(678910)
array2 = rng2.random(20)

We can also “clip” the random numbers to be within a certain range, ensuring that there are no numbers that fall below a certain value or above another certain value.

import numpy as np

array1 = np.random.randint(0, 10, size=[10, 20])
array2 = np.clip(array1, 2, 7) # clips all values in array1 to be between 2 and 7

There are many more random distributions and functions, which you can see here: https://numpy.org/doc/1.16/reference/routines.random.html

5.5. Image processing

In my day to day work, I do a lot of image processing. One of my go-to tools is the Python Imaging Library (PIL), now maintained as the Pillow fork. This is a powerful library for opening, manipulating, and saving image files. It supports many common image formats including PNG, JPEG, BMP, and TIFF, and generally makes working with images in Python much easier.

Installing PIL

First, you’ll need to install PIL using uv. Copy-paste and run the below code in your terminal:

uv add pillow

Opening and displaying images

To work with images in PIL, first import the library and open an image. Note that the library is not called “pillow” but “PIL” when importing.

from PIL import Image

# Open an image file
img = Image.open("path/to/your/image.jpg")

# Display basic image information
print(f"Format: {img.format}")
print(f"Size: {img.size}")
print(f"Mode: {img.mode}")

# Display the image (opens in your default image viewer)
img.show()

How are images represented in memory?

Your computer represents images as a grid of pixels, which are the individual squares that make up the image. Images are, of course, two-dimensional, but each pixel also has a color, which is a combination of red, green, and blue (RGB) values. This means that a given image is really just a grid of numbers that can be represented as a matrix. This matrix is an M x N x 3 tensor, where M is the height of the image, N is the width, and 3 is for the different RGB values. So an image that is 100 pixels wide and 100 pixels tall is represented as a matrix that is 100 x 100 x 3 values in size.

Basic image operations

PIL provides many basic operations for image manipulation, including resizing, rotating, cropping, and converting to grayscale.

from PIL import Image

# Open image
img = Image.open("input.jpg")

# Resize image
resized_img = img.resize((800, 600))  # specify width, height

# Rotate image
rotated_img = img.rotate(90)  # degrees

# Crop image
# Specify left, top, right, bottom coordinates
# The top left corner is (0, 0), while the bottom right corner is (width, height)
cropped_img = img.crop((100, 100, 400, 400))

# Convert to grayscale -- L means grayscale
grayscale_img = img.convert("L")

# Flip the image
flipped_horizontal = img.transpose(Image.FLIP_LEFT_RIGHT)
flipped_vertical = img.transpose(Image.FLIP_TOP_BOTTOM)

# Save modified images
resized_img.save("resized.jpg")
rotated_img.save("rotated.jpg")
cropped_img.save("cropped.jpg")
grayscale_img.save("grayscale.jpg")
flipped_horizontal.save("flipped_horizontal.jpg")
flipped_vertical.save("flipped_vertical.jpg")

Image enhancement and filters

PIL includes various filters and enhancement options that would be familiar to anyone who has used a photo editor like Photoshop or GIMP.

from PIL import Image, ImageEnhance, ImageFilter

img = Image.open("input.jpg")

# Adjust brightness
enhancer = ImageEnhance.Brightness(img)
brightened = enhancer.enhance(1.5)  # 1.0 is original, <1 darkens, >1 brightens

# Adjust contrast
enhancer = ImageEnhance.Contrast(img)
increased_contrast = enhancer.enhance(1.5)

# Apply blur filter
blurred = img.filter(ImageFilter.BLUR)

# Apply edge detection
edges = img.filter(ImageFilter.FIND_EDGES)

# Sharpen image
sharpened = img.filter(ImageFilter.SHARPEN)

# Save enhanced images
brightened.save("bright.jpg")
increased_contrast.save("contrast.jpg")
blurred.save("blur.jpg")
edges.save("edges.jpg")
sharpened.save("sharp.jpg")

Drawing on images

You can also draw shapes and text directly onto images using PIL.

from PIL import Image, ImageDraw, ImageFont

# Create a new image with white background
img = Image.new("RGB", (400, 300), color="white")

# Create drawing object
draw = ImageDraw.Draw(img)

# Draw shapes
draw.rectangle([100, 100, 300, 200], outline="black", fill="red")
draw.ellipse([150, 50, 250, 150], outline="blue", fill="yellow")
draw.line([0, 0, 400, 300], fill="green", width=3)

# Add text
try:
    # Try to load a system font
    font = ImageFont.truetype("Arial.ttf", 36)
except:
    # Fall back to default font
    font = ImageFont.load_default()

draw.text((50, 250), "Hello, PIL!", fill="black", font=font)

# Save the result
img.save("drawing.jpg")

You could imagine using this to add text to images in a loop, for example, to add a timestamp to each image in a series, or perform other forms of watermarking. You could even make your own meme generator — the possibilities are endless!

Working with image data

If you need to make targeted edits to an image, you can do that as well by accessing and modifying image data directly.

from PIL import Image
import numpy as np

# Open image and convert to numpy array
img = Image.open("input.jpg")
img_array = np.array(img)

# Modify pixel values
# Example: Make image darker by reducing all RGB values by 50
darker_array = np.clip(img_array - 50, 0, 255)

# Convert back to PIL Image
darker_img = Image.fromarray(darker_array.astype("uint8"))
darker_img.save("darker.jpg")

# Get individual pixel values
pixel = img.getpixel((100, 100))  # Gets pixel at x=100, y=100
print(f"Pixel value at (100,100): {pixel}")

Common pitfalls and tips

Keep in mind that when you open() something in Python (or anything, really), you’re using system resources to keep it open. You need to close() it when you’re done with it to free up those resources so your computer doesn’t run as slow as molasses:

img = Image.open("input.jpg")
# ... process image ...
img.close()

You can use context managers to help you with this, as they automatically clean up after themselves:

with Image.open("input.jpg") as img:
    # ... process image ...
    img.save("output.jpg")

# image is automatically closed when the block is exited!

Some methods modify the original object in place (like my_list.append(new_item)), while others return a new object (like sorted(my_list)). Most PIL operations return new Image objects and don’t modify the original:

# This won't work as expected:
img = Image.open("input.jpg")
img.rotate(45)  # This creates a new image but doesn't store it in a variable, so it's lost
img.save("rotated.jpg")  # Oops! Saves the original, unrotated image

# Do this instead:
img = Image.open("input.jpg")
rotated = img.rotate(45)  # Store the new image in a new variable
rotated.save("rotated.jpg")  # This saves the rotated image

PIL is a powerful library with many more features than covered here. For more detailed information, you can refer to the official documentation: https://pillow.readthedocs.io/

5.6. Lab 5

As you go along with this lab, note that some of the questions will require you to jump back and forth between different functions, as well as your main function. This is because an important part of writing scripts is connecting things up, which may mean working on different functions at the same time.

def q1():
    print("\n######## Question 1 ########\n")
    """
    What are three similarities, and three differences, between python lists and NumPy arrays?
    Print your answer in this function.
    """


def q2():
    print("\n######## Question 2 ########\n")
    """
    What is a random seed, and why would you want to use one? Print your answer in this function.
    """


def q3():
    print("\n######## Question 3 ########\n")
    """
    Write code that ranks each item in the list below by how much memory it takes up in your computer, and
    prints them out in order from smallest to biggest, like this:
    4.5, 24
    4.6789999, 24
    1, 28
    1981, 28
    dog, 52
    [], 56
    a big dog, 58
    {}, 64
    ['a', 'list', 'inside', 'a', 'list'], 104
    """
    list_of_many_types = [
        1,
        1981,
        4.5,
        4.6789999,
        "dog",
        "a big dog",
        [],
        ["a", "list", "inside", "a", "list"],
        {},
        [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
        np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]),
    ]


def q4():
    print("\n######## Question 4 ########\n")
    """
    Change the definitions of array2, array3, array4, and array5 to generate the results described below.
    Do not "hard code" the definitions, instead use code that creates a view of array1
    """
    array1 = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]])
    array2 = None  # a tuple containing the number of rows and columns in array1
    array3 = None  # only column 0 from array1
    array4 = None  # the same data as array1, but reformatted into a matrix with 6 rows and 2 columns
    array5 = None  # array1 flattened into a single 12-element array instead of a matrix of rows and columns

    print(f"array1:\n{array1}\n")
    print(f"array2:\n{array2}\n")
    print(f"array3:\n{array3}\n")
    print(f"array4:\n{array4}\n")
    print(f"array5:\n{array5}\n")


def q5():
    print("\n######## Question 5 ########\n")
    """
    - rename this function (and the function call) "check_data()"
    - write code in your main function that lets you pass two arguments into the program from the command line.
    - save the values that are passed into the main function as "num_students" and "num_assignments"
    - pass those variables to this function
    - in this function, check to make sure they are positive integers
    - if so, print "Num Students: X, Num Assignments: Y", where X and Y are the values passed in
    - if you encounter a error, meaning that the user inputs something that is not an integer/
      is not a positive integer,
      then print an error message and use code to quit the program
    """


def q6():
    print("\n######## Question 6 ########\n")
    """
        - rename this function (and the function call) "generate_data()"
        - make this function take two input arguments: num_students, and num_assignments
        - pass num_students, and num_assignments into this function from your main function
        - in this function, create a numpy matrix called student_data with:
            - num_students as the number of rows (don't hard code it, get it from the variable)
            - num_assignments as the number of columns (don't hard code it)
            - fill the matrix with randomly generated data that:
                - has a mean of 80
                - a standard deviation of 10
                - and no values that are < 0 or > 100.
            - create a print statement that says "Generated simulated data with X students and Y assignments", where
                X and Y are the values of those variables
        - return student_data to the main function
    """


def q7():
    print("\n######## Question 7 ########\n")
    """
        - rename this function (and the function call) to "calculate_means()"
        - pass in the student_data matrix from the previous question
        - use numpy, and no loops, to calculate each student's mean score, best, and worst score, save them as variables
        - combine the students' means, best, and worst scores into a single matrix with num_students rows and 3 columns,
            called "student_results"
        - use numpy, and no loops, to calculate each assignment's mean, best, and worst score, save them as variables
        - combine the assignments' means, best, and worst scores into a single matrix with num_assignments rows and 3
            columns, called "assignment_results"
        - print out the two resulting variables by uncommenting the lines below
    """
    print("Student Scores")
    # print(student_results)
    print("Assignment Results")
    # print(assignment_results)


def q8():
    print("\n######## Question 8 ########\n")
    """
    STEP 1
    - use the random module to create "list1", a list of 10_000 (that's ten thousand)
    uniformly distributed random integers between -100 and +100
    - convert list1 to a numpy array, called array1

    STEP 2
    - create "array2" using the numpy function to create creates a numpy array of 10_000 (ten thousand)
    uniformly distributed random integers between -100 and +100 (i.e. the same as array1, just created a different way

    STEP 3
    - write code using the time module that keeps track of how long STEP 1 and STEP 2 took to execute, and prints
        out the result:
            Core Python Random: X sec.
            Numpy Random: X sec.

    STEP 4:
        return list1, array1 and array2 back to the main function
    """


def q9():
    print("\n######## Question 9 ########\n")
    """
    STEP 1
        pass array1 and array2 from the previous question into this function

    STEP 2
        - write a for loop that computes the dot product of both arrays, i.e. the sum of the product of each
            corresponding pair

    STEP 3
        - use the built in numpy function to compute the dot product of both arrays

    STEP 4
    - write code using the time module that keeps track of how long STEP 2 and STEP 3 took to execute, and prints
        out the result:
            Core Python Dot Product: X sec.
            Numpy Dot Product: X sec.
    """


def q10():
    print("\n######## Question 10 ########\n")
    """
    STEP 1
        pass list1 and array1 from Q8 into this function

    STEP 2
        - use list's sort method to sort list1

    STEP 3
        - use numpy's sort function to sort array1

    STEP 4
    - write code using the time module that keeps track of how long STEP 2 and STEP 3 took to execute, and prints
        out the result:
            Core Python Sort: X sec.
            Numpy Sort: X sec.
    """


def main():
    q1()
    q2()
    q3()
    q4()
    q5()
    q6()
    q7()
    q8()
    q9()
    q10()


if __name__ == "__main__":
    main()