2.3 Reading and Manipulating Tabular Data in Pandas

Questions we’ll cover

  • How do I use Pandas to load a simple CSV data set.
  • How do I get some basic information about a Pandas DataFrame.


Use the Pandas library to do statistics on tabular data.

  • Pandas is a widely-used Python library for statistics, particularly on tabular data.
  • Borrows many features from R’s dataframes.
    • A 2-dimensional table whose columns have names and potentially have different data types.
  • Read a Comma Separated Values (CSV) data file with pandas.read_csv.
    • Argument is the name of the file to be read.
    • Returns a dataframe that you can assign to a variable
import pandas

data_clcd = pandas.read_csv('2026-01_canadian-library-challenges-database_SUBSET.csv')
print(data_clcd)
          Item Challenged  ... Action Taken
0               Being you  ...     Retained
1               Being you  ...     Retained
2               Being you  ...     Retained
3    Clifford's halloween  ...     Retained
4    Clifford's halloween  ...      Removed
..                    ...  ...          ...
105      This book is gay  ...     Retained
106      This book is gay  ...     Retained
107      This book is gay  ...     Retained
108      This book is gay  ...      Removed
109      This book is gay  ...     Retained

[110 rows x 11 columns]
  • The columns in a dataframe are the observed variables, and the rows are the observations (books).
  • Using descriptive dataframe names helps us distinguish between multiple dataframes so we won’t accidentally overwrite a dataframe or read from the wrong one. ‘clcd’ stands for “Canadian Library Challenges Database”

Use the DataFrame.info() method to find out more about a dataframe

data_clcd.info()
<class 'pandas.DataFrame'>
RangeIndex: 110 entries, 0 to 109
Data columns (total 11 columns):
 #   Column                  Non-Null Count  Dtype
---  ------                  --------------  -----
 0   Item Challenged         110 non-null    str  
 1   Subtitle                55 non-null     str  
 2   Creator                 110 non-null    str  
 3   Form of Complaint       110 non-null    str  
 4   Library                 110 non-null    str  
 5   Year                    110 non-null    int64
 6   Object Category         110 non-null    str  
 7   Reason A for complaint  110 non-null    str  
 8   Reason B for complaint  110 non-null    str  
 9   Action requested        110 non-null    str  
 10  Action Taken            110 non-null    str  
dtypes: int64(1), str(10)
memory usage: 9.6 KB
  • This is a DataFrame
  • 110 rows named with indices 0 to 109
  • 10 columns, most of which contain strings; ‘Year’ contains integers
  • Only the ‘Subtitle’ column contains null values
  • Uses 9.6 KB of memory.

The DataFrame.columns variable stores information about the dataframe’s columns

  • Note that this is data, not a method (it doesn’t have parentheses).
    • Like math.pi.
    • So do not use () to try to call it.
  • Called a member variable, or just member.
print(data_clcd.columns)
Index(['Item Challenged', 'Subtitle', 'Creator', 'Form of Complaint',
       'Library', 'Year', 'Object Category', 'Reason A for complaint',
       'Reason B for complaint', 'Action requested', 'Action Taken'],
      dtype='str')

Use DataFrame.T to transpose a dataframe

  • Sometimes want to treat columns as rows and vice versa.
  • Transpose (written .T) doesn’t copy the data, just changes the program’s view of it.
  • Like columns, it is a member variable.
print(data_clcd.T)
                                                                      0    ...                             109
Item Challenged                                                 Being you  ...                This book is gay
Subtitle                                A first conversation about gender  ...                             NaN
Creator                 Madison, Megan; Ralli, Jessica; Passchier, Ann...  ...                    Dawson, Juno
Form of Complaint                                                  Direct  ...                          Direct
Library                                         Surrey School District 36  ...  Pembina Trails School Division
Year                                                                 2023  ...                            2025
Object Category                                                Collection  ...                      Collection
Reason A for complaint                                     Pro LGBTQIA2S+  ...                Explicit Content
Reason B for complaint                                                     ...                  Pro LGBTQIA2S+
Action requested                                                   Remove  ...                          Remove
Action Taken                                                     Retained  ...                        Retained

[11 rows x 110 columns]

Use DataFrame.describe() to get summary statistics about data

DataFrame.describe() gets the summary statistics of only the columns that have numerical data. All other columns are ignored, unless you use the argument include='all'.

print(data_clcd.describe())
              Year
count   110.000000
mean   2022.581818
std       1.199249
min    2021.000000
25%    2021.000000
50%    2023.000000
75%    2023.000000
max    2025.000000

This is not very useful because the only numerical column in this dataset is the Year. However, it could be helpful for the dataset we used in the previous episode.

Challenge: Reading Other Data

Read the data in 2026-02-18_google-trends-search-1.csv and display its summary statistics.

Solution

data_gt = pandas.read_csv("2026-02-18_google-trends-search-1.csv")

print(data_gt.describe())
       2021-02-01  2021-03-01  ...  2026-01-01  2026-02-01
count    8.000000    8.000000  ...    8.000000    8.000000
mean     0.320000    0.360000  ...    0.240000    0.420000
std      0.905097    1.018234  ...    0.678823    0.460062
min      0.000000    0.000000  ...    0.000000    0.000000
25%      0.000000    0.000000  ...    0.000000    0.240000
50%      0.000000    0.000000  ...    0.000000    0.320000
75%      0.000000    0.000000  ...    0.000000    0.400000
max      2.560000    2.880000  ...    1.920000    1.440000

Challenge 1: Inspecting Data

After reading the Google Trends data book searches over time, use help(data_gt.head) and help(data_gt.tail) to find out what DataFrame.head and DataFrame.tail do.

  1. What method call will display the first three rows of this data?
  2. What method call will display the last three columns of this data? (Hint: you may need to change your view of the data.)

Solution

  1. We can check out the first five rows of data_gt by executing data_gt.head() which lets us view the beginning of the DataFrame. We can specify the number of rows we wish to see by specifying the parameter n in our call to data_gt.head(). To view the first three rows, execute:
print(data_gt.head(n=3))
                                          Unnamed: 0  ...  2026-02-01
0                                Irreversible Damage  ...        0.64
1  Sex Is a Funny Word: A Book about Bodies, Feel...  ...        0.32
2                              It's Perfectly Normal  ...        0.32

[3 rows x 62 columns]
  1. To check out the last three rows of data_gt, we would use the command, data_gt.tail(n=3), analogous to head() used above. However, here we want to look at the last three columns so we need to change our view and then use tail(). To do so, we create a new DataFrame in which rows and columns are switched:
data_gt_flipped = data_gt.T

We can then view the last three columns of data_gt by viewing the last three rows of data_gt_flipped`:

print(data_gt_flipped.tail(n=3))
               0     1     2     3    4     5    6     7
2025-12-01     0     0     0     0    0     0    0     0
2026-01-01   0.0   0.0   0.0  1.92  0.0   0.0  0.0   0.0
2026-02-01  0.64  0.32  0.32  1.44  0.0  0.32  0.0  0.32

Challenge 2: Reading Files in Other Directories

  • create a subfolder called data in your project folder
  • move the data files into data
  • create a subfolder called scripts in your project folder (alongside the data folder, not in it)
  • move your script into scripts

What value(s) should you pass to read_csv to read 2026-01_canadian-library-challenges-database_SUBSET.csv in your script?

Solution

We need to specify the path to the file of interest in the call to pandas.read_csv. We first need to ‘jump’ out of the scripts folder using ../ and then into the folder data using data/. Then we can specify the filename 2026-01_canadian-library-challenges-database_SUBSET.csv. The result is as follows:

data_microbes = pandas.read_csv('../data/2026-01_canadian-library-challenges-database_SUBSET.csv')

Challenge 3: Writing Data

As well as the read_csv function for reading data from a file, Pandas provides a to_csv function to write dataframes to files. Applying what you’ve learned about reading from files, write one of your dataframes to a file called data_gt_processed.csv. You can use help to get information on how to use to_csv.

Solution

In order to write the DataFrame data_gt to a file called processed.csv, execute the following command:

data_gt.to_csv('data_gt_processed.csv')

For help on read_csv or to_csv, you could execute, for example:

help(data_gt.to_csv)
help(pandas.read_csv)

Note that help(to_csv) or help(pandas.to_csv) throws an error! This is due to the fact that to_csv is not a global Pandas function, but a member function of DataFrames. This means you can only call it on an instance of a DataFrame e.g., data_gt.to_csv or data_oceania.to_csv

Key Points

  • Use the Pandas library to get basic statistics out of tabular data.
  • Use index_col to specify that a column’s values should be used as row headings.
  • Use DataFrame.info to find out more about a dataframe.
  • The DataFrame.columns variable stores information about the dataframe’s columns.
  • Use DataFrame.T to transpose a dataframe.
  • Use DataFrame.describe to get summary statistics about data.

View in GitHub

Loading last updated date...