Dataframes
Part 1: Pandas dataframes
Reading tabular data
In this section we will be reading datasets from data-python
. If you have not downloaded it in the previous section, in the terminal run:
wget http://bit.ly/pythfiles -O pfiles.zip
unzip pfiles.zip && rm pfiles.zip # this should unpack into the directory data-python/
Pandas is a widely-used Python library for working with tabular data. It borrows heavily from R’s dataframes and is built on top of NumPy. We will be reading the data we downloaded a minute ago into a pandas dataframe:
import pandas as pd
= pd.read_csv('data-python/gapminder_gdp_oceania.csv')
data print(data)
# this prints out the table nicely in Jupyter Notebook! data
# shape is a *member variable inside data*
data.shape # info is a *member method inside data* data.info()
Try reading a much bigger Jeopardy dataset from https://raw.githubusercontent.com/razoumov/publish/master/jeopardy.csv
. There are two ways to do this:
1. you can first download it using wget
and then read a local file, or
2. you can try doing this without downloading the file.
How many lines and columns does it have? What are the column names?
Use
dir(data)
to list all dataframe’s variables and methods. Then call one of them without()
, and if it’s a method it’ll tell you, so you’ll need to use()
.
You can think of rows as observations, and columns as the observed variables. You can add new observations at any time.
Currently the rows are indexed by number. Let’s index by country:
= pd.read_csv('data-python/gapminder_gdp_oceania.csv', index_col='country')
data
data# now 12 columns
data.shape # it's a dataframe! show row/column names, precision, memory usage
data.info() print(data.columns) # list all the columns
print(data.T) # this will transpose the dataframe; curously this is a variable
# will print some statistics of numerical columns (very useful for 1000s of rows!) data.describe()
How would you list all country names?
Hint: try data.T.columns
Read the data in gapminder_gdp_americas.csv
(which should be in the same directory as gapminder_gdp_oceania.csv
) into a variable called americas
and display its summary statistics.
Write a command to display the first three rows of the americas
data frame. What about the last three columns of this data frame?
The data for your current project is stored in a file called microbes.csv
, which is located in a folder called field_data
. You are doing analysis in a notebook called analysis.ipynb
in a sibling folder called thesis
:
your_home_directory
├── fieldData
│ └── microbes.csv
└── thesis └── analysis.ipynb
What value(s) should you pass to read_csv()
to read microbes.csv
in analysis.ipynb
?
As well as the pd.read_csv()
function for reading data from a file, Pandas provides a to_csv()
function to write data frames to files. Applying what you’ve learned about reading from files, write one of your data frames to a file called processed.csv
. You can use help to get information on how to use to_csv()
.
Subsetting
= pd.read_csv('data-python/gapminder_gdp_europe.csv', index_col='country')
data data.head()
Let’s rename the first column:
={'gdpPercap_1952': 'y1952'}) # this renames only one but does not change `data` data.rename(columns
Note: we could also name the column ‘1952’, but some Pandas operations don’t work with purely numerical column names.
Let’s go through all columns and assign the new names:
for col in data.columns:
print(col, col[-4:])
= data.rename(columns={col: 'y'+col[-4:]})
data
data
Pandas lets you subset elements using either their numerical indices or their row/column names. Long time ago Pandas used to have a single function to do both. Now there are two separate functions, iloc()
and loc()
. Let’s print one element:
0,0] # the very first element by position
data.iloc['Albania','y1952'] # exactly the same; the very first element by label data.loc[
Printing a row:
'Albania',:] # usual Python's slicing notation - show all columns in that row
data.loc['Albania'] # exactly the same
data.loc['Albania',] # exactly the same data.loc[
Printing a column:
'y1952'] # show all rows in that column
data.loc[:,'y1952'] # exactly the same; single index refers to columns
data[# most compact notation; does not work with numerical-only names data.y1952
Printing a range:
'Italy':'Poland','y1952':'y1967'] # select multiple rows/columns
data.loc[0:2,0:3] data.iloc[
Result of slicing can be used in further operations:
'Italy':'Poland','y1952':'y1967'].max() # max for each column
data.loc['Italy':'Poland','y1952':'y1967'].min() # min for each column data.loc[
Use comparisons to select data based on value:
= data.loc['Italy':'Poland', 'y1962':'y1972']
subset print(subset)
print(subset > 1e4)
Use a Boolean mask to print values (meeting the condition) or NaN (not meeting the condition):
= (subset > 1e4)
mask print(mask)
print(subset[mask]) # will print numerical values only if the corresponding elements in mask are True
NaN’s are ignored by statistical operations which is handy:
subset[mask].describe()max() subset[mask].
Assume Pandas has been imported into your notebook and the Gapminder GDP data for Europe has been loaded:
= pd.read_csv('data-python/gapminder_gdp_europe.csv', index_col='country') df
Write an expression to find the per capita GDP of Serbia in 2007.
Explain what each line in the following short program does, e.g. what is in the variables first
, second
, and so on:
= pd.read_csv('data-python/gapminder_all.csv', index_col='country')
first = first[first['continent'] == 'Americas']
second = second.drop('Puerto Rico')
third = third.drop('continent', axis = 1)
fourth 'result.csv') fourth.to_csv(
Explain in simple terms what idxmin()
and idxmax()
do in the short program below. When would you use these methods?
= pd.read_csv('data-python/gapminder_gdp_europe.csv', index_col='country')
data print(data.idxmin())
print(data.idxmax())
How do you create a dataframe from scratch? There are many ways; perhaps, the easiest is by defining columns:
= [1,2,3]
col1 = [4,5,6]
col2 'a': col1, 'b': col2}) # dataframe from a dictionary pd.DataFrame({
We can index (assign names to) the rows with this syntax:
'a': col1, 'b': col2}, index=['a1','a2','a3']) pd.DataFrame({
Example with a larger dataframe
Let’s try reading some public-domain data about Jeopardy questions with pandas
(31MB file, so it might take a while):
import pandas as pd
= pd.read_csv("https://raw.githubusercontent.com/razoumov/publish/master/jeopardy.csv")
data # 216930 rows, 7 columns
data.shape 10) # first 10 rows
data.head(# last 5 rows
data.tail() 2:5] # rows 2-4
data.iloc[# names of the columns
data.columns
'Category']
data['Category']=='HISTORY'
data['Category']=='HISTORY'].shape # 349 matches
data.loc[data['Category']=='HISTORY'].to_csv("history.csv") # write to a file data.loc[data[
Let’s check what time period is covered by these data:
"Air Date"]
data["Air Date"][0][-2:] # first row, last two digits is the year
data[= data["Air Date"].apply(lambda x: x[-2:]) # last two digits of the year from all rows
year min(); year.max() # '00' and '99' - not very informative, wraps at the turn of the century
year.
for y in range(100):
= str(y).zfill(2)
twoDigits print(twoDigits, sum(year==twoDigits))
This shows that this table covers years from 1984 to 2012.
Three solutions to a classification problem
In this section we will see that there are different techniques to process a Pandas dataframe, and some of them are more efficient than others. However, first we need to learn how to time execution of a block of Python code:
- Inside a Jupyter notebook, you can use
%%timeit
to time an entire cell. - Inside your code, you can call
time.time()
function:
import time
= time.time()
start
...= time.time()
end print("Time in seconds:", round(end-start,3))
Now, let’s go to our computational problem. “Fizz buzz” is a children’s game to practice divisions. Players take turn counting out loud while replacing: - any number divisible by 3 with the word “Fizz”, - any number divisible by 5 with the word “Buzz”, - any number divisible by both 3 and 5 with the word “FizzBuzz”.
Let’s implement this in pandas! First, create a simple dataframe from scratch:
import pandas as pd
= pd.DataFrame()
df = 10_000
size 'number'] = np.arange(1, size+1) df[
Define for pretty printing:
def show(frame):
print(df.tail(15).to_string(index=False)) # print last 15 rows without the row index
show(df)
Let’s built a new column response
containing either “Fizz” or “Buzz” or “FizzBuzz” or the original number, based on the number
value in that row. We start by processing a row:
def count(row):
if (row['number'] % 3 == 0) and (row['number'] % 5 == 0):
return 'FizzBuzz'
elif row['number'] % 3 == 0:
return 'Fizz'
elif row['number'] % 5 == 0:
return 'Buzz'
else:
return str(row['number'])
Here is how you would use this function:
print(df.iloc[2])
print(count(df.iloc[2]))
- We can apply this function to each row in a loop:
%%timeit
for index, row in df.iterrows():
'response'] = count(row) df.loc[index,
413 ms ± 11.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
show(df)
- We can use
df.apply()
to apply this function to each row:
%%timeit
'response'] = df.apply(count, axis=1) df[
69.1 ms ± 380 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
show(df)
- Or we could use a mask to only assign correct responses to the corresponding rows:
%%timeit
'response'] = df['number'].astype(str)
df['number'] % 3==0, 'response'] = 'Fizz'
df.loc[df['number'] % 5==0, 'response'] = 'Buzz'
df.loc[df['number'] % 3==0) & (df['number'] % 5==0), 'response'] = 'FizzBuzz' df.loc[(df[
718 µs ± 10.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
show(df)
Looping over data sets
Let’s say we want to read several files in data-python/
. We can use for to loop through their list:
for name in ['africa.csv', 'asia.csv']:
= pd.read_csv('data-python/gapminder_gdp_'+name, index_col='country')
data print(name+'\n', data.min(), sep='') # print min for each column
If we have many (10s or 100s) files, we want to specify them with a pattern:
from glob import glob
print('all csv files in data-python:', glob('data-python/*.csv')) # returns a list
print('all text files in data-python:', glob('data-python/*.txt')) # empty list
list = glob('data-python/*.csv')
len(list)
for filename in glob('data-python/gapminder*.csv'):
= pd.read_csv(filename)
data print(filename, data.gdpPercap_1952.min())
Which of these files is not matched by the expression glob('data-python/*as*.csv')
?
A. data-python/gapminder_gdp_africa.csv
B. data-python/gapminder_gdp_americas.csv
C. data-python/gapminder_gdp_asia.csv D. 1 and 2 are not matched
Modify this program so that it prints the number of records in the file that has the fewest records.
= ____
fewest for filename in glob('data-python/*.csv'):
= ____
fewest print('smallest file has', fewest, 'records')