This page contains brief (generally one-liner) blocks of code for working with Python and PANDAS for data analytics. I created it as a handy reference for PANDAS commands I tended to forget when I was learning. I hope it proves useful to you, too! I also have a page with longer data analytics tutorials.
Table of Contents
Jupyter Notebook Settings
Set width of columns for display:
pd.set_option('display.max_columns', None)
Set cell width:
pd.set_option('max_colwidth', 150)
Working with Python Lists
Break list into chunks of 4 (needed for some APIs, for example)
mylist_chunked = [mylist[i:i+4] for i in xrange(0, len(mylist), 4)]
Finding duplicates in a list:
import collections
mylist = ['twitter', 'facebook','instagram', 'twitter']
[item for item, count in collections.Counter(mylist).items() if count > 1]
mylist = ['twitter', 'facebook','instagram', 'twitter']
[item for item, count in collections.Counter(mylist).items() if count > 1]
Remove list element:
mylist.remove('revenue')
Working with Python Dictionaries
Delete a key from a dictionary:
del dict[key]['key_to_delete']
Create sub-dictionary (from sub-set of keys):
subdict = {k: dict[k] for k in ('key1', 'key2')}
Analysis
Cross-tabulation:
pd.crosstab(df['var1'], df['var2'])
Generating New Variables, Arrays, etc.
Create list from dataframe column:
screen_names = df[‘screen_name’].tolist()
Create list of unique column values in DF:
tickers_in_df = pd.unique(df.ticker.ravel()).tolist()
Convert string variable to float:
df[‘count_V2’] = df[‘count’].convert_objects(convert_numeric=True)
Convert float column to int (only works if there are no missing values):
df_2014[‘rt_count’] = df_2014[‘rt_count’].astype(int)
Convert column to string:
df.index = df.index.astype(str)
Create new variable as a slice of an existing one:
df[‘cusip8’] = df[‘cusip’].apply(lambda x: x[:8])
Replace a word within a column with another word:
df[‘8-K’] = df[‘8-K’].str.replace(‘Item’, ‘Section’)
Fill in missing values for one column with zero:
df_2014[‘rt_count’] = df_2014[‘rt_count’].fillna(0)
Get new list of unique items in a list:
screen_names_unique = list(set(screen_names))
Create dummy variable based on whether another column contains specific text (values will be 'True' and 'False'):
df[‘retweeted_status_dummy’] = df[‘retweeted_status’].str.contains(‘THIS’, na=False)
Then convert to float (will convert 'True' and 'False' categories of above variable into '1' and '0', respectively):
df[‘retweeted_status_dummy’] = df[‘retweeted_status_dummy’].astype(float)
Replace values (here, replace 'None' with '0'):
df[‘reply_message’] = df[‘in_reply_to_screen_name’].replace([None], [‘0’])
Replace values (here, replace np.nan values with '0'):
df.ix[df.Number_of_retweets.isnull(), ‘Number_of_retweets’] = 0
Switch values of '0' and '1':
df[‘reply_message’] = df[‘reply_message’].replace([1], [99]).replace([0],[1]).replace([99],[0])
Create binary variable from count variable (if old var=0, assign value of 0; otherwise 1):
df[‘urls’] = np.where(df[‘entities_urls_count’]==0, 0, 1)
Change each unicode element in a list to string:
tickers_in_df = [x.encode(‘UTF8’) for x in tickers_in_df]
Change column values to upper case:
df[‘Name’] = df[‘Name’].apply(lambda x: x.upper())
Change column values to upper case:
sec[‘8-K Item’] = sec[‘8-K Item’].str.upper()
Find number of unique values:
len(pd.unique(compustat_2013.cusip.ravel()))
Add leading zeros to string variable (as many as needed to reach 10 characters):
df[‘cik_x’] = df[‘cik_x’].apply(lambda x: x.zfill(10))
Convert column to string and add leading zeros:
df[‘CIK_10’] = df[‘CIK’].astype(str).apply(lambda x: x.zfill(10))
Get a list of values from a DF column:
values_list = df[‘8-K’].value_counts().keys().tolist()
Find number of cases with fewer than the mean value:
sum(df[‘followers_count’] < df['followers_count'].mean())
I/O
Read in a pickled dataframe:
ticker_master = pd.read_pickle(‘valid_tickers_317.pkl’)
Read in JSON file:
f = open(‘valid_tickers_list_317.json’, ‘r’)
valid_tickers = simplejson.load(f)
valid_tickers = simplejson.load(f)
Read in JSON file -- method 2:
with open(‘my_list.json’, ‘r’) as fp:
my_list = json.load(fp)
my_list = json.load(fp)
Save a list as JSON file:
f = open(‘all_valid_screen_names.json’, ‘w’)
simplejson.dump(valid_twitter_accounts, f)
f.close()
simplejson.dump(valid_twitter_accounts, f)
f.close()
Save a list as JSON file -- method 2:
import json
with open(‘my_list.json’, ‘w’) as fp:
json.dump(my_list, fp)
json.dump(my_list, fp)
Read in Excel file:
twitter_accounts = pd.read_excel(‘Twitter Account-level Database — non-excluded tickers and accounts only.xlsx’, ‘accounts’, header=0)
Write Excel file:
df.to_excel(‘df.xls’, sheet_name=’Sheet1′)
Looping
Looping over rows (note how to select a slice of rows):
for index, row in df[:10].iterrows():
print index, row[‘content’]
print index, row[‘content’]
Loop over rows and update existing variable:
for index, row in df.iterrows():
df[‘count’] = count
df[‘count’] = count
Loop over rows and create new variable, method 1:
for index, row in df.iterrows():
#df.ix[df.index==index, ‘count’] = count #LONGER VERSION
df.ix[index, ‘count’] = count #SHORTER VERSION
#df.ix[df.index==index, ‘count’] = count #LONGER VERSION
df.ix[index, ‘count’] = count #SHORTER VERSION
Loop over rows and create new variable, method 2:
for index, row in df.iterrows():
df.loc[index,’count’] = count
df.loc[index,’count’] = count
Time Series
Weekdays:
weekdays_only = by_day2[by_day2[‘weekday’] < 5 ]
weekday_count = df.groupby(ts.index.weekday).apply(f)
by_day3['weekday'] = by_day['date'].apply(lambda x: x.weekday())
Add missing days (with zeros) for every day in a dataframe:
df_filled = df.unstack().fillna(0).stack()
Change specific columns
df_filled.loc[df_filled[‘Number of Firm Tweets’] == 0, [‘Number of Firm Followers (start)’, ‘Number of Lists for Firm (start)’]] = np.nan
Set column to datetime:
df[‘created_at’] = pd.to_datetime(df[‘created_at’])
Convert datetime column to date:
df[‘date’] = [t.date() for t in df[‘datetime’]]
Generate lagged variable with multi-index DF:
df[‘Mentions [t-1]’] = df[‘Mentions’].unstack().shift(1).stack()
Generate variable aggregated over 3-day window lagged one day:
df[‘Mentions [sum of t-1:t-3]’] = pd.rolling_sum(df[‘Mentions’].unstack().shift(), window=3).stack()
Select date range for DF:
df_2014 = df[‘2014-1-1′:’2014-12-31’]
Indexing and Sorting
Set the index:
df = df.set_index([‘created_at’])
Reset Index:
df.reset_index(inplace=True)
Set Multi-Index:
df = df.set_index([‘ticker’, ‘day’])
Sort dataframe:
df.sort([‘ticker’, ‘day’], inplace=True)
Name Existing Multi-index columns:
df.index.names = [‘day’, ‘ticker’]
With multi-index df -- get unique items per index level:
len(firm_day_counts_firm_tweets.index.levels[0])
len(firm_day_counts_firm_tweets.index.levels[1])
len(firm_day_counts_firm_tweets.index.levels[1])
Swap levels on multi-index dataframe:
df_swapped = df.swaplevel(‘ticker’, ‘day’)
Get minimum date in DF:
df.index.min()
Complex conditional row selection during loop:
if (row[‘Form Type’] == ‘8-K’) and (row[‘8-K Item’]==”) and (row[‘8-K Item — V2’]==”) or (row[‘8-K Item’]==’Item that’) or (row[‘8-K Item’]==’Item fo’):
Missing Data
Interpolation with backfill and forward fill (n.b. - does not respect multi-index):
df[‘F_x)’]=df[‘F’].interpolate(method=’linear’).bfill().ffill()
Find rows with empty column:
missing = df[df[‘ticker’].isnull()]
Fill missing values in one column with values from another column:
sec.loc[sec[‘8-K Item_v2′].isnull(),’8-K Item_v2’] = sec[‘8-K Item’]
Custom Functions
Function for generating a series of new one-day lag variables in a dataframe:
def lag_one_day(df):
#df_cols = df.columns
df_cols = [u’Number of Firm Mentions’, u’Number of Firm Tweets’] for i in df_cols:
col = str(i)+str(‘[t-1]’)
if ‘[t-1]’ not in str(i):
df[col] = df[i].unstack().shift(1).stack()
#df_cols = df.columns
df_cols = [u’Number of Firm Mentions’, u’Number of Firm Tweets’] for i in df_cols:
col = str(i)+str(‘[t-1]’)
if ‘[t-1]’ not in str(i):
df[col] = df[i].unstack().shift(1).stack()
Function for generating a series of new dataframe variables that aggregate over a multi-day period:
def lag_one_day(df):
#df_cols = df.columns
df_cols = [u’Number of Firm Mentions’, u’Number of Firm Tweets’] for i in df_cols:
col = str(i)+str(‘[sum t-1:t-3]’)
if ‘[t-1]’ not in str(i):
df[col] = pd.rolling_sum(df[i].unstack().shift(),
window=3).stack()
#df_cols = df.columns
df_cols = [u’Number of Firm Mentions’, u’Number of Firm Tweets’] for i in df_cols:
col = str(i)+str(‘[sum t-1:t-3]’)
if ‘[t-1]’ not in str(i):
df[col] = pd.rolling_sum(df[i].unstack().shift(),
window=3).stack()
DataFrame Manipulations
Subset of DF -- based on a condition in a column:
df[df[‘reply_message’] == 1]
Subset of DF -- specific columns:
df[[‘in_reply_to_screen_name’, ‘in_reply_to_status_id’, ‘reply_message’]].head()
Drop a column:
df = df.drop(‘entities_urls’,1)
Intersection of two lists:
pd.Series(np.intersect1d(pd.Series(tickers_in_df), pd.Series(valid_tickers)))
Difference between two lists (all different elements from either list):
set(tickers_in_df).symmetric_difference(valid_tickers_with_twitter)
Difference between two lists (elements from list1 missing from list2):
set(list1) – set(list2)
Create DF based on whether column value is in a list:
df = df[df.ticker.isin(mylist)]
Creat an empty dataframe:
columns = [‘ticker’, ‘month’, ‘degree’]
df = pd.DataFrame(columns=columns)
Add row (row 0) to empty dataframe:
df.loc[0] = pd.Series({‘ticker’:’Z’, ‘month’:’12’, ‘degree’: ”})
Change cell column values for a specific row (index=16458):
sec.ix[16458, “8-K Item — V2”] = ‘Item 2.01’
Create dataframe from list/dictionary:
months = [‘2014-1’, ‘2014-2’, ‘2104-3’]
data = {‘month’: months}
df = pd.DataFrame(data, columns=[‘month’, ‘degree’])
Add rows to a dataframe:
df = df.append(df_month)
Create dataframe from list of column names:
d = {‘Variable Name’: list(df.columns.values)}
variables = DataFrame(d)
variables = DataFrame(d)
Create dataframe by deleting all rows with null values in one column:
df2 = df[df[‘keep’].notnull()]
Rename a single column:
df.rename(columns={‘cusip’:’cusip_COMPUSTAT’}, inplace=True)
Create dataframe based on column value appearing in a list:
missing = df[df[‘cusip’].isin(mylist)]
Look for duplicates in a dataframe:
counts = merged.groupby(‘cusip’).size()
df2 = pd.DataFrame(counts, columns = [‘size’])
df2 = df2[df2.size>1]
df2
df2 = pd.DataFrame(counts, columns = [‘size’])
df2 = df2[df2.size>1]
df2
Create version of dataframe with conditions on two variables (for removing a duplicate firm):
merged = merged[~((merged[‘fyear’]==2012) & (merged[‘gvkey’]==’176103′))]
Select partial dataframe -- complex conditions:
sec[(sec[‘Form Type’] == ‘8-K’) & (sec[‘8-K Item’]==”) & (sec[‘8-K Item — V2’]==”) | (sec[‘8-K Item’]==’Item fo’) & (sec[‘8-K Item — V2’]==”)| (sec[‘8-K Item’]==’Item that’) & (sec[‘8-K Item — V2’]==”)]
Merge two dataframes:
merged = pd.merge(fb, org_data, left_on=’feed_id’, right_on=’Org_ID’)
Deep copy of DF:
df2 = df.copy(deep=True)
Get a slice of dataframe (here, the two rows with the given indexes):
df.loc[11516:11517]
Custom Twitter Variables
Time on Twitter:
import datetime
df[‘start’]=np.datetime64(‘2016-04-01’)
df[‘created_at’] = pd.to_datetime(df[‘created_at’])
df[‘time_on_twitter’] = df[‘start’] – df[‘created_at’] df[‘time_on_twitter_days’] = df[‘time_on_twitter’].astype(‘timedelta64[D]’)
df[‘time_on_twitter_days’] = df.time_on_twitter_days.astype(‘int’)
df[‘start’]=np.datetime64(‘2016-04-01’)
df[‘created_at’] = pd.to_datetime(df[‘created_at’])
df[‘time_on_twitter’] = df[‘start’] – df[‘created_at’] df[‘time_on_twitter_days’] = df[‘time_on_twitter’].astype(‘timedelta64[D]’)
df[‘time_on_twitter_days’] = df.time_on_twitter_days.astype(‘int’)
Working with MongoDB in PANDAS
Show first 2 documents:
for user in users.find()[:2]:
print user, ‘\n’
print user, ‘\n’
Get frequency counts:
from bson.son import SON
pipeline = [ {“$group”: {“_id”: “$FormType”, “count”: {“$sum”: 1}}} ] list(file_list.aggregate(pipeline))
pipeline = [ {“$group”: {“_id”: “$FormType”, “count”: {“$sum”: 1}}} ] list(file_list.aggregate(pipeline))
Loop over random sample of 100 with filtering:
for file in file_list.aggregate([
{ ‘$match’: {‘FormType’: { ‘$in’: [‘990PF’]}} },
{ ‘$sample’: {‘size’: 100} }
]):
print file
{ ‘$match’: {‘FormType’: { ‘$in’: [‘990PF’]}} },
{ ‘$sample’: {‘size’: 100} }
]):
print file
I hope you have found this helpful. If so, please spread the word, and happy coding!