Chapter 14 — Data Analysis with Pandas

14.0. Data analysis and visualization

Python has two very popular modules for data analysis and visualization. The pandas module is a popular Python library that helps users manipulate and analyze data. matplotlib is a module for creating plots and visualizing data. For people who have used R, pandas and matplotlib replicate a lot of the functionality of R (except for building complex statistical models, though there are other Python modules to do that).

Installing pandas and matplotlib

Let’s start with installation. Assuming that you followed the instructions all the way back in Chapter 0.4, matplotlib and pandas are likely installed. Even if they aren’t, you can install them easily with uv. Go to your terminal window (or command prompt on Windows) and enter the following command:

uv add pandas matplotlib

Understanding pandas

The value of matplotlib is pretty obvious: making graphs and figures can be very useful. But what does pandas do? To better understand the value of pandas, let’s consider some common tasks researchers in cognitive science might encounter and how pandas can help.

The main thing to know about pandas is that it gives us a new data structure to work with, the DataFrame. A pandas DataFrame is a versatile, two-dimensional, tabular data structure. Some of its key features include the fact that both rows and columns have labels. In such a DataFrame, each column is a pandas Series, which is a one-dimensional labeled array capable of holding any data type, such as integers, floats, and strings. So what are its advantages?

  • Data Organization: So far we have dealt with many datasets that we have had to store in matrices or lists of lists. pandas DataFrames give us many methods for inputting and organizing data, so we don’t have to write the code to do it ourselves.
  • Data Cleaning: Suppose a researcher is working with a dataset containing reaction times for participants in a study, but some reaction times are missing. pandas can be used to fill in those missing values with an average reaction time (aka data imputation) or remove the rows with missing information altogether, ensuring a clean dataset for further analysis.
  • Data Combining: In another scenario, the researcher might have separate datasets with information about participants’ demographics and their performance in a memory task. pandas makes it easy to merge these two datasets, connecting participants with their respective task results.
  • Data Filtering: When it comes to filtering data, pandas is equally helpful. Imagine a dataset that includes details about participants, such as their age and scores in a cognitive test. The researcher can use pandas to extract data for participants who scored above a certain threshold while also being within a specific age range, allowing for more targeted analysis.
  • Data Summary: Data summarization is another area where pandas shines. For instance, a dataset might include responses from multiple participants in a survey about learning strategies. The researcher can use pandas to group the data by strategy and calculate the average effectiveness rating for each learning technique.

14.1. Loading and saving DataFrames

Loading files into a pandas DataFrame

Once you have pandas installed, you’ll need a dataset to work with. We will be using a dataset collected in order to understand children’s vocabulary development. It’s stored as a single CSV file, and has a lot of different variables represented as columns (e.g., the frequencies of words, the percentage of kids who say those words). The data can be found here. Here is what the first few lines of the file look like:

Age,Word,Lexical_Class,MCIDp,LogFreq,ProKWo
16,a,function_words,0.035564854,4.432119101,0.093587561
17,a,function_words,0.043956044,4.466571072,0.106887602
17,a,function_words,0.04395604,4.466571072,0.106887602
18,a,function_words,0.07918552,4.535028144,0.171016303
19,a,function_words,0.1036585,4.580457373,0.220025328
20,a,function_words,0.1167883,4.635242405,0.243169211
21,a,function_words,0.1773399,4.677962637,0.273148016
22,a,function_words,0.28,4.714002658,0.379545093

Each row of data represents a single word at a single age, and the values in the columns are:

  1. Age: the age of the children, in months
  2. Word: the word that the rest of the data is about
  3. Lexical_Class: the grammatical part of speech of that word
  4. MCDIp: The percentage of children who, at that age, say that word
  5. LogFreq: the cumulative log frequency of that word in a dataset of speech to childen, up to that age
  6. ProKWo: the percentage of that word’s co-occurrences that, at that age, are estimated to be with other words that children of that age already know.

The dataset has data about 500 words at 15 different ages (age 16-30 months), resulting in 7500 rows (plus the header row).

Previously, we might have used the csv module or written our own code to read that in as a list of list, or maybe as a dictionary. The pandas module provides various functions to load different types of data, such as CSV, Excel, JSON, and SQL datasets. To load CSV data into a DataFrame, use the read_csv() function as shown below:

import pandas as pd
file_name = "/path/to/predict_mcdi.csv"  # path to the file relative to where the script this code is
mcdi_df = pd.read_csv(file_name)

Here, pd.read_csv calls the read_csv function from the pandas library (which is imported with the alias pd). The CSV file predict_mcdi.csv is passed as an argument to this function, and the returned output is assigned to the variable mcdi_df.

We can see what the data actually looks like if we just print the data structure:

print(mcdi_df)

If you run the above code, you will see:

      Age    Word   Lexical_Class     MCDIp   LogFreq    ProKWo
0      16       a  function_words  0.035565  4.432119  0.093588
1      17       a  function_words  0.043956  4.466571  0.106888
2      18       a  function_words  0.079186  4.535028  0.171016
3      19       a  function_words  0.103659  4.580457  0.220025
4      20       a  function_words  0.116788  4.635242  0.243169
...   ...     ...             ...       ...       ...       ...
7495   26  zipper           nouns  0.629834  2.363612  0.571920
7496   27  zipper           nouns  0.610256  2.374748  0.595180
7497   28  zipper           nouns  0.769134  2.385606  0.724979
7498   29  zipper           nouns  0.671717  2.406540  0.668059
7499   30  zipper           nouns  0.820339  2.416641  0.775563

[7500 rows x 6 columns]

You can see that pandas shows us a line number for each row of data. a preview of the first and last rows, and finishes by telling us how many rows and columns were in our dataset.

Specifying load options

The read_csv function can take a bunch of other optional parameters which may be useful depending on the nature of the dataset. Here’s an example of using some of these options with read_csv:

mcdi_df = pd.read_csv(
    "predict_mcdi.csv", # the input file name, a required argument
    sep=",", # Specify the delimiter used to separate fields in the input file. By default, it is set to ","
    header=None, # the row number to use as column names, default is 0. If None, names will be auto-generated
    index_col=None, # the column to set as the index (row labels) of the DataFrame
    skiprows=1, # The number of rows to skip from the start of the file
    nrows=150,  # The number of rows to read from the file
    names=None, # specifying a list of header names if you don't have them in the file
    dtype={'Word': str, 'MCDIp': float, 'LogFreq': float, 'ProKWo': float} # if you want to specify the column data type
)

For a complete list of options, you can refer to the pandas documentation on read_csv: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

Handling different file formats

pandas can load data from a variety of file formats, such as Excel, JSON, and SQL. To load data from these formats, you can use the corresponding functions, as shown below:

my_df = pd.read_excel("some_excel_file.xlsx") # Loading Excel data
my_df = pd.read_json("some_json_file.json") # Loading JSON data

For more information on loading data from different formats, refer to the pandas documentation on I/O tools: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

Saving a DataFrame to file:

After processing and analyzing the data, you may want to save the resulting DataFrame to a file. pandas provides the to_csv(), to_excel(), and other methods for this purpose.

# Save DataFrame to a CSV file
mcdi_df.to_csv('output.csv', index=False) # index=False means we don't want to save the row numbers themselves as a column

# Save DataFrame to an Excel file
mcdi_df.to_excel('output.xlsx', index=False)

14.2. Exploring DataFrames

Looking at the DataFrame

Now that we have loaded our dataset, let’s start exploring it. One of the first steps in any data analysis begins with understanding the nature of the dataset. If you recall, we loaded our dataset in the last section like so:

import pandas as pd
file_name = "predict_mcdi.csv"  # path to the file relative to where the script this code is
mcdi_df = pd.read_csv(file_name)
print(mcdi_df)

With the output:

      Age    Word   Lexical_Class     MCDIp   LogFreq    ProKWo
0      16       a  function_words  0.035565  4.432119  0.093588
1      17       a  function_words  0.043956  4.466571  0.106888
2      18       a  function_words  0.079186  4.535028  0.171016
3      19       a  function_words  0.103659  4.580457  0.220025
4      20       a  function_words  0.116788  4.635242  0.243169
...   ...     ...             ...       ...       ...       ...
7495   26  zipper           nouns  0.629834  2.363612  0.571920
7496   27  zipper           nouns  0.610256  2.374748  0.595180
7497   28  zipper           nouns  0.769134  2.385606  0.724979
7498   29  zipper           nouns  0.671717  2.406540  0.668059
7499   30  zipper           nouns  0.820339  2.416641  0.775563

[7500 rows x 6 columns]

So what exactly have we created? Let’s look at the output of some other print statements:

print(type(mcdi_df))

Output:

<class 'pandas.core.frame.DataFrame'>

The output shows that the variable mcdi_df is a pandas DataFrame object.

We can use .head() to just preview the top of the data:

print(mcdi_df.head(3))

Output:

   Age Word   Lexical_Class     MCDIp   LogFreq    ProKWo
0   16    a  function_words  0.035565  4.432119  0.093588
1   17    a  function_words  0.043956  4.466571  0.106888
2   18    a  function_words  0.079186  4.535028  0.171016

The number you put inside .head() specifies how many rows you want to see. The default is 5.

We can also get a simple look at our columns with .dtypes().

print(mcdi_df.dtypes)

Output:

Age                int64
Word              object
Lexical_Class     object
MCDIp            float64
LogFreq          float64
ProKWo           float64
dtype: object

The .dtypes attribute tells us what the data type of each column is. The object datatype is a catch all type that might have string data or numbers or both. Remember that when you create the DataFrame, you can specify what type of data type an object can be. This matters because there are certain operations we can perform on data, but only if it is a certain type.

We can get a more advanced look at our dataset with the .info() method:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7500 entries, 0 to 7499
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   Age            7500 non-null   int64
 1   Word           7500 non-null   object
 2   Lexical_Class  7500 non-null   object
 3   MCDIp          7500 non-null   float64
 4   LogFreq        7485 non-null   float64
 5   ProKWo         7485 non-null   float64
dtypes: float64(3), int64(1), object(2)
memory usage: 351.7+ KB
None

The .info() method gives us a lot more info about the data frame. It tells us the data structure’s data type (<class ‘pandas.core.frame.DataFrame’>), the number of rows and columns, and the data type of each column. It tells us if we have any null (missing) data in either of our columns, and tells us how much memory the dataframe takes up.

Here, the Non-Null Count column in the output shows that columns 0-3 (Age, Word, Lexical_Class, MCDIp) have 7500 non-null values, which suggests that these columns have no missing data. However, columns 4-5 (LogFreq nad ProKWo) have non-null counts of less 7485, which suggests that these columns have some missing data.

We will address how to handle missing data along with a host of operations in the next section.

Data selection

When working with DataFrame objects, you may want to focus on specific columns in order to calculate mean values or to specify values to be plotted in a graph. pandas allows you to select one or multiple columns easily, resulting in a new dataframe with only the specified data.

# Select a single column
mcdi_words = mcdi_df['Word']
print(mcdi_words.head())

# Select multiple columns
mcdi_word_fq = mcdi_df[['Word', 'LogFreq']]
print(mcdi_word_fq.head())

Both of these commands are creating new DataFrame objects that have only the subsetted columns in it.

There are other methods for selecting columns using more complex rules. There might be cases when you need to examine certain rows. The pandas module provides a simple way to select specific rows or ranges of rows using the .iloc[] function.

#select the first row
first_row = mcdi_df.iloc[0]
print(first_row)

#select the first 5 rows
first_five_rows = mcdi_df.iloc[0:5] # 0:5 means we want rows 0 through 4
print(first_five_rows)

The .loc[] function works very similarly, but you can specify row or column names instead of index numbers.

In addition to subsetting the dataframe by referencing specific columns or row indices, pandas allows you to subset rows based on a specific condition. Say we wanted to only analyze the word statistics of “nouns”. We can do so like this:

# two ways to do the same thing
mcdi_noun_data1 = mcdi_df.query('Lexical_Class == "nouns"')
mcdi_noun_data2 = mcdi_df.loc[mcdi_df['Lexical_Class'] == "nouns"]

In both cases we specify the column Lexical_Class and specify a matching row we wish to subset, nouns. The first method is producing what is called a “query string” and using that to produce a new DataFrame with only the data that matches the query. The second method is using a method called boolean indexing. The inner part (mcdi_df['Lexical_Class'] == "nouns") return a boolean entry for each row (True or False) corresponding to whether the row matches the condition specified. Inserting the resulting boolean DataFrame (a single columns of True and False values) into the DataFrame itself then gives us back only the rows where the boolean DataFrame was True.

Here are some examples of other queries and boolean indexing you might use to subset our current dataset:

# Single Condition: Select rows where word frequency is greater than 2
high_freq = mcdi_df.loc[mcdi_df['LogFreq'] > 2]

# Multiple Conditions (AND): select rows where the word frequency is greater than 2, and the lexical class is 'nouns':
high_freq_nouns = mcdi_df.loc[(mcdi_df['LogFreq'] > 2) & (mcdi_df['Lexical_Class'] == 'nouns')]

# Multipe Conditions (OR): select rows where the lexical class is either 'nouns' or 'verbs':
nouns_or_verbs = mcdi_df.loc[(mcdi_df['Lexical_Class'] == 'nouns') | (mcdi_df['Lexical_Class'] == 'verbs')]

# Negation: Select rows where the lexical class is not 'nouns':
not_nouns = mcdi_df.loc[mcdi_df['Lexical_Class'] != 'nouns']

# Select rows where the lexical class is in a specified list of values:
selected_classes = mcdi_df.loc[mcdi_df['Lexical_Class'].isin(['nouns', 'verbs', 'adjectives'])]

Notice that we can’t use standard logical operators like and or or in the query string. Instead, we use the & and | bitwise operators (the kind you’d use to do bitwise operations on binary numbers) because Pandas overloads them, using them to mean “and” and “or” for each row individually. So the likes of mcdi_df['Lexical_Class'] == 'nouns' | mcdi_df['Lexical_Class'] == 'verbs' is actually interpreted as: “for each row, if the lexical class is ‘nouns’ or ‘verbs’, then return that row”.

You can read more about all the different ways you can filter a dataset here: https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html#data-selection.

14.3. Pandas operations

Now that we have a basic understanding of what a DataFrame is, lets review basic operations you can perform on DataFrames. These operations will help you manipulate and analyze your data more efficiently.

Data aggregation

Once you have selected the data you need, you may want to perform aggregation operations like calculating sums, means, or other statistical measures. pandas makes it easy to perform these operations on the entire DataFrame or specific columns.

# Calculate the mean of the word frequency column 'LogFreq'
mean_freq = mcdi_df["LogFreq"].mean()
print("Mean word frequency:", mean_freq)

# Calculate the mean of the word frequency column 'LogFreq' for only 'verbs'
verb_mcdi_data = mcdi_df.loc[mcdi_df["Lexical_Class"] == "verbs"]
verbs_mean_freq = verb_mcdi_data["LogFreq"].mean()
print("Total frequency of verbs:", verbs_mean_freq)

output:

Mean word frequency: 2.7667932046101535
Total frequency of verbs: 2.8549415738659865

pandas also provides a powerful groupby function, which allows you to group your data based on the values in one or more columns and perform aggregate functions on each group.

# Group the data by 'Lexical_Class' and calculate the mean of the 'logfreq' column for each group
mean_freq_by_class = mcdi_df.groupby("Lexical_Class")["LogFreq"].mean()
print(mean_freq_by_class)

Output:

Lexical_Class
adjectives        2.745108
function_words    3.630915
nouns             2.430455
verbs             2.854942
Name: LogFreq, dtype: float64

Note again that, as always, the result is creating a new DataFrame object, but with the specified result. Since we grouped by lexical class, the values in that column in the original DataFrame became the main values of the rows in the new DataFrame.

You can even chain multiple aggregate functions together using the agg() method:

# Group the data by 'Lexical_Class' and calculate multiple aggregates for the 'logfreq' column
agg_freq_by_class = mcdi_df.groupby("Lexical_Class")["LogFreq"].agg(["count", "sum", "mean", "std"])
print(agg_freq_by_class)

output:

                count          sum      mean       std
Lexical_Class
adjectives        810  2223.537441  2.745108  0.518036
function_words   1365  4956.199297  3.630915  0.696591
nouns            3840  9332.946285  2.430455  0.618607
verbs            1470  4196.764114  2.854942  0.667298

In this command, instead of just having .mean() on the end like in the previous example, we used .agg() to group a bunch of stuff we wanted to calculate for each item, and got the results as a new complex DataFrame.

Calculating correlation coefficients

Sometimes it’s helpful to look at the relationships between numerical columns in your DataFrame. The .corr() method computes pairwise correlations of columns, excluding NA/null values. This is particularly useful when you want to understand relationships between different variables in your dataset.

# Calculate correlations between numerical columns
correlations = mcdi_df[["Age", "LogFreq", "MCDIp", "ProKWo"]].corr()
print(correlations)

Output:

              Age   LogFreq     MCDIp    ProKWo
Age      1.000000  0.261773  0.761078  0.980153
LogFreq  0.261773  1.000000  0.292132  0.244772
MCDIp    0.761078  0.292132  1.000000  0.813212
ProKWo   0.980153  0.244772  0.813212  1.000000

The resulting correlation matrix shows how each variable relates to every other variable. Values range from -1 to 1, where:

  • 1 indicates a perfect positive correlation
  • -1 indicates a perfect negative correlation
  • 0 indicates no relationship at all

(You can also visualize these correlations using a heatmap.)

Alternatively, if you only care about one particular correlation, you can compute that directly for each pair of columns:

# Calculate the correlation between 'ProKWo' and 'LogFreq'
prokwo_logfreq_corr = mcdi_df["ProKWo"].corr(mcdi_df["LogFreq"])
print(f"Correlation between ProKWo and LogFreq: {prokwo_logfreq_corr}")

Output:

Correlation between ProKWo and LogFreq: 0.244772

Data transformation

Up to this point we have reviewed ways of selecting specific sections of our dataset by reference and the use of conditionals. We have also shown how you can calculate summary statistics. Next we will review instances where we may want to change existing data or even create new columns in our dataset using existing information.

Let’s say you wanted to change the names of the columns in our dataset.

# Rename columns using a dictionary
mcdi_df_renamed = mcdi_df.rename(columns={"Word": "Vocabulary", "LogFreq": "Frequency"})
print(mcdi_df_renamed.head())

Output (Show renamed columns):

   Age Vocabulary   Lexical_Class     MCDIp  Frequency    ProKWo
0   16          a  function_words  0.035565   4.432119  0.093588
1   17          a  function_words  0.043956   4.466571  0.106888
2   18          a  function_words  0.079186   4.535028  0.171016
3   19          a  function_words  0.103659   4.580457  0.220025
4   20          a  function_words  0.116788   4.635242  0.243169

Here, we pass a dictionary to the .rename() method with the original column names as the keys and the new names as the values. Note that was just a normal python dictionary. You could have defined or created the dictionary in a separate line of code, and then referenced it in the columns parameter.

Often, you’ll need to create new columns based on existing ones. Say you wanted to create a new column called Days that creates a new column where the original Age column is converted to days (here we estimate the child’s age based on the average number of days in a month).

mcdi_df["AgeInDays"] = mcdi_df["Age"] * 30.44  # Convert months to days
print(mcdi_df.head())

Output (show new column):

   Age Word   Lexical_Class     MCDIp   LogFreq    ProKWo  AgeInDays
0   16    a  function_words  0.035565  4.432119  0.093588     487.04
1   17    a  function_words  0.043956  4.466571  0.106888     517.48
2   18    a  function_words  0.079186  4.535028  0.171016     547.92
3   19    a  function_words  0.103659  4.580457  0.220025     578.36
4   20    a  function_words  0.116788  4.635242  0.243169     608.80

Depending on the nature of the dataset and your goals you may need to create additional columns. Here are some additional examples:

# Creating a column using basic arithmetic operations
mcdi_df["Frequency_Squared"] = mcdi_df["Frequency"] ** 2

# Creating a new column using string manipulation
mcdi_df["Vocabulary_Upper"] = mcdi_df["Vocabulary"].str.upper()

# Creating a new column by combining two or more columns (outputs: Age_Frequency):
mcdi_df["Age_and_Frequency"] = mcdi_df["Age"].astype(str) + "_" + mcdi_df["Frequency"].astype(str)

# Creating a new column using conditional expressions:
mcdi_df["Dev_Stage"] = mcdi_df["Age"].apply(lambda x: "Toddler" if x >= 18 else "Child")

Missing data

Missing values can cause issues when analyzing data, such as preventing you from calculating various summary statistics such as means and standard deviations. In order to prevent issues we want to be able to identify and deal with missing data.You might want to drop rows with missing values or replace them with a specific value.

# Drop rows with missing data
mcdi_df_no_missing = mcdi_df.dropna()
print(mcdi_df_no_missing.head())

# Fill missing data with a specified value (e.g., 0)
mcdi_df_filled = mcdi_df.fillna(0)
print(mcdi_df_filled.head())

Here, we use the .dropna() method to remove rows with missing values and the .fillna() method to replace missing values with a specified value, such as 0.

Merging and joining DataFrames

Suppose you have another DataFrame, mcdi_additional_df, which contains additional information for each of the words in our dataset, such as the word’s syllable count and whether it is a common word or not. We can merge these two DataFrame objects into one as follows:

merged_df = pd.merge(mcdi_df, mcdi_additional_df, on="Word", how="inner")
print(merged_df.head())

In this case, we want to merge mcdi_df and mcdi_additional_df based on the common column Word. This will result in a combined DataFrame containing information from both DataFrame objects for each word. The how parameter specifies the type of join:

  • inner join only keeps rows with matching indices in both DataFrame objects
  • outer keeps all rows and fills in missing values with NaN.
  • left and right joins keep all rows from the left or right DataFrame, respectively, and fill in missing values with NaN.