2.3 Reading and Manipulating Tabular Data in Pandas
Questions we’ll cover
- How do I use Pandas to load a simple CSV data set.
- How do I get some basic information about a Pandas DataFrame.
Use the Pandas library to do statistics on tabular data.
- Pandas is a widely-used Python library for statistics, particularly on tabular data.
- Borrows many features from R’s dataframes.
- A 2-dimensional table whose columns have names and potentially have different data types.
- Read a Comma Separated Values (CSV) data file with
pandas.read_csv.- Argument is the name of the file to be read.
- Returns a dataframe that you can assign to a variable
import pandas
data_clcd = pandas.read_csv('2026-01_canadian-library-challenges-database_SUBSET.csv')
print(data_clcd)
Item Challenged ... Action Taken
0 Being you ... Retained
1 Being you ... Retained
2 Being you ... Retained
3 Clifford's halloween ... Retained
4 Clifford's halloween ... Removed
.. ... ... ...
105 This book is gay ... Retained
106 This book is gay ... Retained
107 This book is gay ... Retained
108 This book is gay ... Removed
109 This book is gay ... Retained
[110 rows x 11 columns]
- The columns in a dataframe are the observed variables, and the rows are the observations (books).
- Using descriptive dataframe names helps us distinguish between multiple dataframes so we won’t accidentally overwrite a dataframe or read from the wrong one. ‘clcd’ stands for “Canadian Library Challenges Database”
Use the DataFrame.info() method to find out more about a dataframe
data_clcd.info()
<class 'pandas.DataFrame'>
RangeIndex: 110 entries, 0 to 109
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Item Challenged 110 non-null str
1 Subtitle 55 non-null str
2 Creator 110 non-null str
3 Form of Complaint 110 non-null str
4 Library 110 non-null str
5 Year 110 non-null int64
6 Object Category 110 non-null str
7 Reason A for complaint 110 non-null str
8 Reason B for complaint 110 non-null str
9 Action requested 110 non-null str
10 Action Taken 110 non-null str
dtypes: int64(1), str(10)
memory usage: 9.6 KB
- This is a
DataFrame - 110 rows named with indices 0 to 109
- 10 columns, most of which contain strings; ‘Year’ contains integers
- Only the ‘Subtitle’ column contains null values
- Uses 9.6 KB of memory.
The DataFrame.columns variable stores information about the dataframe’s columns
- Note that this is data, not a method (it doesn’t have parentheses).
- Like
math.pi. - So do not use
()to try to call it.
- Like
- Called a member variable, or just member.
print(data_clcd.columns)
Index(['Item Challenged', 'Subtitle', 'Creator', 'Form of Complaint',
'Library', 'Year', 'Object Category', 'Reason A for complaint',
'Reason B for complaint', 'Action requested', 'Action Taken'],
dtype='str')
Use DataFrame.T to transpose a dataframe
- Sometimes want to treat columns as rows and vice versa.
- Transpose (written
.T) doesn’t copy the data, just changes the program’s view of it. - Like
columns, it is a member variable.
print(data_clcd.T)
0 ... 109
Item Challenged Being you ... This book is gay
Subtitle A first conversation about gender ... NaN
Creator Madison, Megan; Ralli, Jessica; Passchier, Ann... ... Dawson, Juno
Form of Complaint Direct ... Direct
Library Surrey School District 36 ... Pembina Trails School Division
Year 2023 ... 2025
Object Category Collection ... Collection
Reason A for complaint Pro LGBTQIA2S+ ... Explicit Content
Reason B for complaint ... Pro LGBTQIA2S+
Action requested Remove ... Remove
Action Taken Retained ... Retained
[11 rows x 110 columns]
Use DataFrame.describe() to get summary statistics about data
DataFrame.describe() gets the summary statistics of only the columns that have numerical data. All other columns are ignored, unless you use the argument include='all'.
print(data_clcd.describe())
Year
count 110.000000
mean 2022.581818
std 1.199249
min 2021.000000
25% 2021.000000
50% 2023.000000
75% 2023.000000
max 2025.000000
This is not very useful because the only numerical column in this dataset is the Year. However, it could be helpful for the dataset we used in the previous episode.
Challenge: Reading Other Data
Read the data in 2026-02-18_google-trends-search-1.csv and display its summary statistics.
Solution
data_gt = pandas.read_csv("2026-02-18_google-trends-search-1.csv")
print(data_gt.describe())
2021-02-01 2021-03-01 ... 2026-01-01 2026-02-01
count 8.000000 8.000000 ... 8.000000 8.000000
mean 0.320000 0.360000 ... 0.240000 0.420000
std 0.905097 1.018234 ... 0.678823 0.460062
min 0.000000 0.000000 ... 0.000000 0.000000
25% 0.000000 0.000000 ... 0.000000 0.240000
50% 0.000000 0.000000 ... 0.000000 0.320000
75% 0.000000 0.000000 ... 0.000000 0.400000
max 2.560000 2.880000 ... 1.920000 1.440000
Challenge 1: Inspecting Data
After reading the Google Trends data book searches over time, use help(data_gt.head) and help(data_gt.tail) to find out what DataFrame.head and DataFrame.tail do.
- What method call will display the first three rows of this data?
- What method call will display the last three columns of this data? (Hint: you may need to change your view of the data.)
Solution
- We can check out the first five rows of
data_gtby executingdata_gt.head()which lets us view the beginning of the DataFrame. We can specify the number of rows we wish to see by specifying the parameternin our call todata_gt.head(). To view the first three rows, execute:
print(data_gt.head(n=3))
Unnamed: 0 ... 2026-02-01
0 Irreversible Damage ... 0.64
1 Sex Is a Funny Word: A Book about Bodies, Feel... ... 0.32
2 It's Perfectly Normal ... 0.32
[3 rows x 62 columns]
- To check out the last three rows of
data_gt, we would use the command,data_gt.tail(n=3), analogous tohead()used above. However, here we want to look at the last three columns so we need to change our view and then usetail(). To do so, we create a new DataFrame in which rows and columns are switched:
data_gt_flipped = data_gt.T
We can then view the last three columns of data_gt by viewing the last three rows of data_gt_flipped`:
print(data_gt_flipped.tail(n=3))
0 1 2 3 4 5 6 7
2025-12-01 0 0 0 0 0 0 0 0
2026-01-01 0.0 0.0 0.0 1.92 0.0 0.0 0.0 0.0
2026-02-01 0.64 0.32 0.32 1.44 0.0 0.32 0.0 0.32
Challenge 2: Reading Files in Other Directories
- create a subfolder called
datain your project folder - move the data files into
data - create a subfolder called
scriptsin your project folder (alongside thedatafolder, not in it) - move your script into
scripts
What value(s) should you pass to read_csv to read 2026-01_canadian-library-challenges-database_SUBSET.csv in your script?
Solution
We need to specify the path to the file of interest in the call to pandas.read_csv. We first need to ‘jump’ out of the scripts folder using ../ and then into the folder data using data/. Then we can specify the filename 2026-01_canadian-library-challenges-database_SUBSET.csv. The result is as follows:
data_microbes = pandas.read_csv('../data/2026-01_canadian-library-challenges-database_SUBSET.csv')
Challenge 3: Writing Data
As well as the read_csv function for reading data from a file, Pandas provides a to_csv function to write dataframes to files. Applying what you’ve learned about reading from files, write one of your dataframes to a file called data_gt_processed.csv. You can use help to get information on how to use to_csv.
Solution
In order to write the DataFrame data_gt to a file called processed.csv, execute the following command:
data_gt.to_csv('data_gt_processed.csv')
For help on read_csv or to_csv, you could execute, for example:
help(data_gt.to_csv)
help(pandas.read_csv)
Note that help(to_csv) or help(pandas.to_csv) throws an error! This is due to the fact that to_csv is not a global Pandas function, but a member function of DataFrames. This means you can only call it on an instance of a DataFrame e.g., data_gt.to_csv or data_oceania.to_csv
Key Points
- Use the Pandas library to get basic statistics out of tabular data.
- Use
index_colto specify that a column’s values should be used as row headings. - Use
DataFrame.infoto find out more about a dataframe. - The
DataFrame.columnsvariable stores information about the dataframe’s columns. - Use
DataFrame.Tto transpose a dataframe. - Use
DataFrame.describeto get summary statistics about data.
Loading last updated date...