Pandas dataframes
Reading CSV tabular data
Pandas is a widely-used Python library for working with tabular data, borrows heavily from R’s dataframes, built on top of NumPy. Let’s try reading some public-domain data about Jeopardy questions with pandas (31MB file, so it might take a while):
import pandas as pd
data = pd.read_csv("https://raw.githubusercontent.com/razoumov/publish/master/jeopardy.csv")
data.shape # shape is a member variable inside data => 216930 rows, 7 columns
print(data)
data # this prints out the table nicely in Jupyter Notebook!
data.info() # info is a *member method inside data*
data.head(10) # first 10 rows
data.tail() # last 5 rows
data.columns # names of the columnsLet’s download the same online data as a file into our current directory:
!wget https://raw.githubusercontent.com/razoumov/publish/master/jeopardy.csvThis time let’s read from this file naming the rows by their respective answer:
data = pd.read_csv("jeopardy.csv", index_col='Answer')
data
data.shape # one fewer column
data.columns
data.index # row namesSubsetting
Pandas lets you subset elements using either their numerical indices or their row/column names. Long time ago Pandas used to have a single function to do both. Now there are two separate functions, .iloc and .loc. Let’s print one element:
data.iloc[0,5] # using row/column numbers
data.loc["Sinking of the Titanic", "Question"] # using row/column namesPrinting a row:
data.loc['Sinking of the Titanic',:] # usual Python's slicing notation - show all columns in that row
data.loc['Sinking of the Titanic'] # exactly the same
data.loc['Sinking of the Titanic',] # exactly the samePrinting a column:
data.loc[:,'Category'] # show all rows in that column
data['Category'] # exactly the same; single index refers to columns
data.Category # most compact notation; does not work with numerical-only namesCombining .iloc and .loc – let’s say we want to retrieve a cell by its row number and its column name:
data.iloc[0] # returns a pandas series (with many rows) => use .loc to select its row
data.iloc[0].loc["Question"] # combining numbers and names
data.iloc[0]["Question"] # the samePrinting a range:
data.iloc[10:15] # rows 10,11,12,13,14
data.columns
data.iloc[10:15].loc[:,'Category':'Question'] # rows 10-14, columns 'Category' - 'Question'
data.iloc[10:15].loc[:,['Category','Question']] # rows 10-14, columns 'Category' and 'Question'
data.loc['Copernicus', 'Air Date':'Category'] # a range of columns ... why so many lines?
data.loc['Copernicus', ['Air Date','Question']] # print two selected columnsdata['Category'] # many different categories, truncated output ...
list(data['Category']) # print all categories ... huge list
len(set(data['Category'])) # 27,983 unique categories
data['Category']=='HISTORY' # return either True or False for each rowYou can use the last expression as a mask to return only those rows where Category is “HISTORY”:
data.loc[data['Category']=='HISTORY']
data.loc[data['Category']=='HISTORY'].shape # 349 matches
data.loc[data['Category']=='HISTORY'].to_csv("history.csv") # write to a fileLet’s take a look at the value column:
list(data['Value']) # some of these contain nan
data.shape # original table: 216,930 rows
clean = data.dropna() # drop rows with nan's
clean.shape # after dropping rows with missing values: 213,144 rows
clean['Value'] # one column
clean['Value'].apply(lambda x: type(x)) # show the type of each element (fixed for each column)
clean['Value'].apply(lambda x: x.replace('$','')) # remove all $ signs
values = clean['Value'].apply(lambda x: int(x.replace('$','').replace(',','')))
mask = values>5000
clean[mask] # only show rows with Value > $5000
clean[mask].shape # thera are 345 such rowsLet’s replace the “Values” column in-place – here using the same expression on the right-hand side as before:
clean.loc[:,'Value'] = clean['Value'].apply(lambda x: int(x.replace('$','').replace(',','')))
cleanNow clean’s Value column is all numerical.
Explain in simple terms what .idxmin() and .idxmax() do in the short program below. When would you use these methods?
clean.idxmin()
clean.idxmax()Hint: Try running these:
clean.loc["Freddie And The Dreamers"]
clean.loc["Suriname"]or use help pages. This simpler example could also help:
col1 = ["1","a","A"]
col2 = ["4","b","B"]
data = pd.DataFrame({'a': col1, 'b': col2}) # dataframe from a dictionary
data.idxmin()Finally, let’s check what time period is covered by these data:
data["Air Date"]
data["Air Date"][0][-2:] # first row, last two digits is the year
year = data["Air Date"].apply(lambda x: x[-2:]) # last two digits of the year from all rows
year.min(); year.max() # '00' and '99' - not very informative, wraps at the turn of the century
for y in range(100):
twoDigits = str(y).zfill(2)
print(twoDigits, sum(year==twoDigits))This shows that this table covers years from 1984 to 2012.
Creating a dataframe from scratch
How do you create a dataframe from scratch? There are many ways; perhaps, the easiest is by defining columns (as you saw in the last exercise):
col1 = [1,2,3]
col2 = [4,5,6]
pd.DataFrame({'a': col1, 'b': col2}) # dataframe from a dictionaryWe can index (assign names to) the rows with this syntax:
pd.DataFrame({'a': col1, 'b': col2}, index=['a1','a2','a3'])Three solutions to a classification problem
Fizz buzz is a children’s game to practice divisions. Players take turn counting out loud while replacing: - any number divisible by 3 with the word “Fizz”, - any number divisible by 5 with the word “Buzz”, - any number divisible by both 3 and 5 with the word “FizzBuzz”.
Let’s implement this in pandas! First, create a simple dataframe from scratch:
import pandas as pd
import numpy as np
df = pd.DataFrame()
size = 10_000
df['number'] = np.arange(1, size+1) # create a column called "number" containing 1,2,...,sizeDefine for pretty printing:
def show(frame):
print(df.tail(15).to_string(index=False)) # print last 15 rows without the row index
show(df)Let’s built a new column response containing either “Fizz” or “Buzz” or “FizzBuzz” or the original number, based on the number value in that row. Let’s start by defining a function to process a row:
def count(row):
if (row['number'] % 3 == 0) and (row['number'] % 5 == 0):
return 'FizzBuzz'
elif row['number'] % 3 == 0:
return 'Fizz'
elif row['number'] % 5 == 0:
return 'Buzz'
else:
return str(row['number'])This is how you would use this function:
count(df.iloc[2]) # returns 'Fizz'
count(df.iloc[14]) # returns 'FizzBuzz'- We can apply this function to each row in a loop:
%%timeit
for index, row in df.iterrows():
df.loc[index, 'response'] = count(row)413 ms ± 11.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
show(df)- We can use
df.apply()to apply this function to each row:
%%timeit
df['response'] = df.apply(count, axis=1) # axis=1 means apply along the column69.1 ms ± 380 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
show(df)- Or we could use a mask to only assign correct responses to the corresponding rows:
%%timeit
df['response'] = df['number'].astype(str)
df.loc[df['number'] % 3==0, 'response'] = 'Fizz'
df.loc[df['number'] % 5==0, 'response'] = 'Buzz'
df.loc[(df['number'] % 3==0) & (df['number'] % 5==0), 'response'] = 'FizzBuzz'718 µs ± 10.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
show(df)