iPython Notebook and PANDAS Cookbook

More and more of my research involves some degree of ‘Big Data’ — typically datasets with a million or so tweets. Getting these data prepped for analysis can involve massive amounts of data manipulation — anything from aggregating data to the daily or organizational level, to merging in additional variables, to generating data required for social network analysis. For all such steps I now almost exclusively use Python’s PANDAS library (‘Python Data Analysis Library’). In conjunction with the iPython Notebook interactive computing framework and NetworkX, you will have a powerful set of analysis tools at your disposal.

Given that I am now doing almost all of my dataset manipulation — and much of the analysis — in PANDAS, and how new I am to the framework, I created this page mostly as a handy reference for all those PANDAS commands I tend to forget or find particularly useful. But if it proves helpful to any others, great!

iPython Notebook Settings

Set width of columns for display:

pd.set_option(‘display.max_colwidth’, 200)

Set cell width:

pd.set_option(‘max_colwidth’,200)

Analysis

Cross-tab (can be saved as dataframe):

pd.crosstab(df[‘8-K filing_info’], df[‘8-K Item’])

Generating New Variables, Arrays, etc.

Create list from dataframe column:

screen_names = ticker_master[‘screen_name’].tolist()

Create list of unique column values in DF:

tickers_in_df = pd.unique(df.ticker.ravel()).tolist()

Convert string variable to float:

df[‘count_V2’] = df[‘count’].convert_objects(convert_numeric=True)

Convert float column to int (only works if there are no missing values):

df_2014[‘rt_count’] = df_2014[‘rt_count’].astype(int)

Convert column to string:

df.index = df.index.astype(str)

Create new variable as a slice of an existing one:

df[‘cusip8’] = df[‘cusip’].apply(lambda x: x[:8])

Replace a word within a column with another word:

df[‘8-K’] = df[‘8-K’].str.replace(‘Item’, ‘Section’)

Fill in missing values for one column with zero:

df_2014[‘rt_count’] = df_2014[‘rt_count’].fillna(0)

Get new list of unique items in a list:

screen_names_unique = list(set(screen_names))

Create dummy variable based on whether another column contains specific text (values will be ‘True’ and ‘False’):

df[‘retweeted_status_dummy’] = df[‘retweeted_status’].str.contains(‘THIS’, na=False)

Then convert to float (will convert ‘True’ and ‘False’ categories of above variable into ‘1’ and ‘0’, respectively):

df[‘retweeted_status_dummy’] = df[‘retweeted_status_dummy’].astype(float)

Replace values (here, replace ‘None’ with ‘0’):

df[‘reply_message’] = df[‘in_reply_to_screen_name’].replace([None], [‘0’])

Replace values (here, replace np.nan values with ‘0’):

df.ix[df.Number_of_retweets.isnull(), ‘Number_of_retweets’] = 0

Switch values of ‘0’ and ‘1’:

df[‘reply_message’] = df[‘reply_message’].replace([1], [99]).replace([0],[1]).replace([99],[0])

Create binary variable from count variable (if old var=0, assign value of 0; otherwise 1):

df[‘urls’] = np.where(df[‘entities_urls_count’]==0, 0, 1)

Change each unicode element in a list to string:

tickers_in_df = [x.encode(‘UTF8’) for x in tickers_in_df]

Change column values to upper case:

df[‘Name’] = df[‘Name’].apply(lambda x: x.upper())

Change column values to upper case:

sec[‘8-K Item’] = sec[‘8-K Item’].str.upper()

Find number of unique values:

len(pd.unique(compustat_2013.cusip.ravel()))

Add leading zeros to string variable (as many as needed to reach 10 characters):

df[‘cik_x’] = df[‘cik_x’].apply(lambda x: x.zfill(10))

Convert column to string and add leading zeros:

df[‘CIK_10’] = df[‘CIK’].astype(str).apply(lambda x: x.zfill(10))

Get a list of values from a DF column:

values_list = df[‘8-K’].value_counts().keys().tolist()

Find number of cases with fewer than the mean value:

sum(df[‘followers_count’] < df['followers_count'].mean())

I/O

Read in a pickled dataframe:

ticker_master = pd.read_pickle(‘valid_tickers_317.pkl’)

Read in JSON file:

f = open(‘valid_tickers_list_317.json’, ‘r’)
valid_tickers = simplejson.load(f)

Read in JSON file — method 2:

with open(‘my_list.json’, ‘r’) as fp:
my_list = json.load(fp)

Save a list as JSON file:

f = open(‘all_valid_screen_names.json’, ‘w’)
simplejson.dump(valid_twitter_accounts, f)
f.close()

Save a list as JSON file — method 2:
import json

with open(‘my_list.json’, ‘w’) as fp:
json.dump(my_list, fp)

Read in Excel file:

twitter_accounts = pd.read_excel(‘Twitter Account-level Database — non-excluded tickers and accounts only.xlsx’, ‘accounts’, header=0)

Write Excel file:

df.to_excel(‘df.xls’, sheet_name=’Sheet1′)

Looping

Looping over rows (note how to select a slice of rows):

for index, row in df[:10].iterrows():
print index, row[‘content’]

Loop over rows and update existing variable:

for index, row in df.iterrows():
df[‘count’] = count

Loop over rows and create new variable, method 1:

for index, row in df.iterrows():
#df.ix[df.index==index, ‘count’] = count #LONGER VERSION
df.ix[index, ‘count’] = count #SHORTER VERSION

Loop over rows and create new variable, method 2:

for index, row in df.iterrows():
df.loc[index,’count’] = count

Time Series

Weekdays:

weekdays_only = by_day2[by_day2[‘weekday’] < 5 ]

weekday_count = df.groupby(ts.index.weekday).apply(f)
by_day3[‘weekday’] = by_day[‘date’].apply(lambda x: x.weekday())

Add missing days (with zeros) for every day in a dataframe:

df_filled = df.unstack().fillna(0).stack()

Change specific columns’ values to missing based on value in another column:

df_filled.loc[df_filled[‘Number of Firm Tweets’] == 0, [‘Number of Firm Followers (start)’, ‘Number of Lists for Firm (start)’]] = np.nan

Set column to datetime:

df[‘created_at’] = pd.to_datetime(df[‘created_at’])

Convert datetime column to date:

df[‘date’] = [t.date() for t in df[‘datetime’]]

Generate lagged variable with multi-index DF:

df[‘Mentions [t-1]’] = df[‘Mentions’].unstack().shift(1).stack()

Generate variable aggregated over 3-day window lagged one day:

df[‘Mentions [sum of t-1:t-3]’] = pd.rolling_sum(df[‘Mentions’].unstack().shift(), window=3).stack()

Select date range for DF:

df_2014 = df[‘2014-1-1′:’2014-12-31’]

Indexing and Sorting

Set the index:

df = df.set_index([‘created_at’])

Reset Index:

df.reset_index(inplace=True)

Set Multi-Index:

df = df.set_index([‘ticker’, ‘day’])

Sort dataframe:

df.sort([‘ticker’, ‘day’], inplace=True)

Name Existing Multi-index columns:

df.index.names = [‘day’, ‘ticker’]

With multi-index df — get unique items per index level:

len(firm_day_counts_firm_tweets.index.levels[0])
len(firm_day_counts_firm_tweets.index.levels[1])

Swap levels on multi-index dataframe:

df_swapped = df.swaplevel(‘ticker’, ‘day’)

Get minimum date in DF:

df.index.min()

Complex conditional row selection during loop:

if (row[‘Form Type’] == ‘8-K’) and (row[‘8-K Item’]==”) and (row[‘8-K Item — V2’]==”) or (row[‘8-K Item’]==’Item that’) or (row[‘8-K Item’]==’Item fo’):

Missing Data

Interpolation with backfill and forward fill [n.b. – does not respect multi-index]

df[‘F_x)’]=df[‘F’].interpolate(method=’linear’).bfill().ffill()

Find rows with empty column:

missing = df[df[‘ticker’].isnull()]

Fill missing values in one column with values from another column:

sec.loc[sec[‘8-K Item_v2′].isnull(),’8-K Item_v2’] = sec[‘8-K Item’]

Custom Functions

Function for generating a series of new one-day lag variables in a dataframe:

def lag_one_day(df):
#df_cols = df.columns
df_cols = [u’Number of Firm Mentions’, u’Number of Firm Tweets’] for i in df_cols:
col = str(i)+str(‘[t-1]’)
if ‘[t-1]’ not in str(i):
df[col] = df[i].unstack().shift(1).stack()

Function for generating a series of new dataframe variables that aggregate over a multi-day period:

def lag_one_day(df):
#df_cols = df.columns
df_cols = [u’Number of Firm Mentions’, u’Number of Firm Tweets’] for i in df_cols:
col = str(i)+str(‘[sum t-1:t-3]’)
if ‘[t-1]’ not in str(i):
df[col] = pd.rolling_sum(df[i].unstack().shift(),
window=3).stack()

DataFrame Manipulations

Subset of DF — based on a condition in a column:

df[df[‘reply_message’] == 1]

Subset of DF — specific columns

df[[‘in_reply_to_screen_name’, ‘in_reply_to_status_id’, ‘reply_message’]].head()

Drop a column:

df = df.drop(‘entities_urls’,1)

Intersection of two lists:

pd.Series(np.intersect1d(pd.Series(tickers_in_df), pd.Series(valid_tickers)))

Difference between two lists (all different elements from either list):

set(tickers_in_df).symmetric_difference(valid_tickers_with_twitter)

Difference between two lists (elements from list1 missing from list2):

set(list1) – set(list2)

Create DF based on whether column value is in a list:

df = df[df.ticker.isin(mylist)]

Creat an empty dataframe:

columns = [‘ticker’, ‘month’, ‘degree’]

df = pd.DataFrame(columns=columns)

Add row (row 0) to empty dataframe:

df.loc[0] = pd.Series({‘ticker’:’Z’, ‘month’:’12’, ‘degree’: ”})

Change cell column values for a specific row (index=16458):

sec.ix[16458, “8-K Item — V2”] = ‘Item 2.01’

Create dataframe from list/dictionary:

months = [‘2014-1’, ‘2014-2’, ‘2104-3’]

data = {‘month’: months}
df = pd.DataFrame(data, columns=[‘month’, ‘degree’])

Add rows to a dataframe:

df = df.append(df_month)

Create dataframe from list of column names:

d = {‘Variable Name’: list(df.columns.values)}
variables = DataFrame(d)

Create dataframe by deleting all rows with null values in one column:

df2 = df[df[‘keep’].notnull()]

Rename a single column:

df.rename(columns={‘cusip’:’cusip_COMPUSTAT’}, inplace=True)

Create dataframe based on column value appearing in a list:

missing = df[df[‘cusip’].isin(mylist)]

Look for duplicates in a dataframe:

counts = merged.groupby(‘cusip’).size()
df2 = pd.DataFrame(counts, columns = [‘size’])
df2 = df2[df2.size>1]
df2

Create version of dataframe with conditions on two variables (for removing a duplicate firm):

merged = merged[~((merged[‘fyear’]==2012) & (merged[‘gvkey’]==’176103′))]

Select partial dataframe — complex conditions:

sec[(sec[‘Form Type’] == ‘8-K’) & (sec[‘8-K Item’]==”) & (sec[‘8-K Item — V2’]==”) | (sec[‘8-K Item’]==’Item fo’) & (sec[‘8-K Item — V2’]==”)| (sec[‘8-K Item’]==’Item that’) & (sec[‘8-K Item — V2’]==”)]

Merge two dataframes:

merged = pd.merge(fb, org_data, left_on=’feed_id’, right_on=’Org_ID’)

Deep copy of DF:

df2 = df.copy(deep=True)

Get a slice of dataframe (here, the two rows with the given indexes):

df.loc[11516:11517]

Custom Twitter Variables

Time on Twitter:

import datetime

df[‘start’]=np.datetime64(‘2016-04-01’)

df[‘created_at’] = pd.to_datetime(df[‘created_at’])

df[‘time_on_twitter’] = df[‘start’] – df[‘created_at’]

df[‘time_on_twitter_days’] = df[‘time_on_twitter’].astype(‘timedelta64[D]’)

df[‘time_on_twitter_days’] = df.time_on_twitter_days.astype(‘int’)