Data Structures (cont.)
Answer to exercise at arrays:
Input
cnt_2019 = 0
with open('G20-2019.txt', 'r') as f:
data = f.read().split('\n')
for line in data:
if 'global' in line and 'warming' in line:
cnt_2019 += 1
cnt_2016 = 0
with open('G20-2016.txt', 'r') as f:
data = f.read().split('\n')
for line in data:
if 'global' in line and 'warming' in line:
cnt_2016 += 1
print("Was global warming a concern in 2016? ", cnt_2016)
print("Was global warming a concern in 2019? ", cnt_2019)
Do not underestimate this code snippet. Had we have all world leader speeches, and enough computational power, we could check things like who discusses more about the environment, economy, or war. You can apply this idea to your interview data, numerical data, etc.
Functions
It may be useful to wrap code around functions. You can read more about that here
Look at the answer for the array exercise. The only thing that changes is the open file line. We can define a function to avoid code repetition.
Input
def discusses_global_warming(speech):
cnt = 0
with open(speech, 'r') as f:
data = f.read().split('\n')
for line in data:
if 'global' in line and 'warming' in line:
cnt += 1
return cnt
cnt_greta = discusses_global_warming('greta.txt')
print(cnt_greta)
What if we also want to control the words that we look for?
Similarly to how we can split the file by lines .split('\n')
we can split a sentence by word using .split(' ')
We can then check if any word in our list of words are in the list of words that we look for, a.k.a topics
Input
def discusses_topic(speech, topic):
cnt = 0
with open(speech, 'r') as f:
data = f.read().split('\n')
for line in data:
words_in_sentence = line.split(' ')
if any(word in topic for word in words_in_sentence):
cnt += 1
return cnt
global_warming_topic = ['global', 'warming', 'environment', 'crisis']
cnt_greta = discusses_topic('greta.txt', global_warming_topic)
print(cnt_greta)
Dictionaries or Maps
Additional information available here:
Imagine that you have an excel or csv file, where each column contains some data. Certain columns have more rows than others and you want quick access to the data of a single column.
This analogy simplifies the meaning of a dictionary:
Input
data = dict(
speecher=['Greta Thunberg', 'Barack Obama', 'UN Secretary-General'],
transcript=['greta.txt', 'G20-2016.txt', 'G20-2019.txt'],
year=[2020, 2016, 2019],
updated_on=2020,
updated_by="Arthur"
)
Input
data
Output
{'speecher': ['Greta Thunberg', 'Barack Obama', 'UN Secretary-General'], 'transcript': ['greta.txt', 'G20-2016.txt', 'G20-2019.txt'], 'year': [2020, 2016, 2019], 'updated_on': 2020, 'updated_by': 'Arthur'}
We can also make the output more readable using the json
library
Input
print(json.dumps(data, indent=4, sort_keys=True))
Output
{
"updated_on": 2020,
"speecher": [
"Greta Thunberg",
"Barack Obama",
"UN Secretary-General"
],
"transcript": [
"greta.txt",
"G20-2016.txt",
"G20-2019.txt"
],
"updated_by": "Arthur",
"year": [
2020,
2016,
2019
]
}
Dictionary operations
Input
data['transcript']
Input
data['speecher']
Input
data['conference']
Input
data['updated_on'] = 2019 # overrides key with new value
Input
del data['year'] # deletes a key
Input
# creates a new key with a list as its value
data['conference'] = ['Climate Summit', 'G20', 'G20']
Input
# Modify existing keys
# We need to know the type of each field.
# Otherwise, problems might occur
data['conference'].append('UN annual summit')
data["updated_by"] += " Matty"
Input
data['conference'] += 'UN annual summit'
print(data['conference'])
Input
data['updated_by'] += 1
print(data['updated_by'])
Why do I need to know about Dictionaries? Back to our practical problem
Thus far, we used Python mostly for textual data. Let’s move to some numerical data.
Suppose we have at our disposal a csv file with three columns, year
, average temperature
, and average carbon dioxide emission
. We want to extract this data from the file, put it into memory, and check whether there is correlation between carbon dioxide emission earth’s average temperature.
- First create and empty dictionary with 3 keys. All our keys will be lists
Input
data = dict(
year=[],
temperature=[],
carbon_emission=[]
)
- Then, we will read the file line by line using
readlines()
. Ignore theif
conditional for now.
Input
current_line = 0
with open('year_temperature_carbon_emission.csv', 'r') as f:
lines = f.readlines()
for line in lines:
print(line)
current_line += 1
if current_line > 5:
break
Output
"","year","avg_temp","AverageCarbonEmission"
"1",1958,18.9421732093664,315.33
"2",1959,18.825205922865,315.981666666667
"3",1960,18.8870130853994,316.908333333333
"4",1961,18.9235654269972,317.645
"5",1962,18.6938474517906,318.453333333333
- It seems we have some pre-processing steps ahead:
- The extra line between each print means that there is a
\n
in the line variable. Remove it with.replace('\n', '')
- We need to split lines by comma
.split(',')
and store the result in a variableaux
- Create three variables year, temp, carbon whose value will be equal to
aux[some index]
- Append the variables to our dictionary keys, e.g.,
data['temperature'].append(temp)
- How do we skip the first line with the header?
- How do we convert strings to numerical data?
- The extra line between each print means that there is a
If we are on track, try to do the first steps on your own.
Don’t spoil the fun. The stick figure is watching you