• Skip to primary navigation
  • Skip to main content

Social Metrics

Social Media, Organizations, and Academic Research

  • Home
  • Blog
  • Publications
  • Teaching
  • Python
  • About Me
  • Show Search
Hide Search

Python PANDAS Code Bytes

This page contains brief (generally one-liner) blocks of code for working with Python and PANDAS for data analytics. I created it as a handy reference for PANDAS commands I tended to forget when I was learning. I hope it proves useful to you, too! I also have a page with longer data analytics tutorials.

Table of Contents

  • Jupyter Notebook Settings
  • Working with Python Lists
  • Working with Python Dictionaries
  • Analysis
  • Generating New Variables, Arrays, etc.
  • I/O
  • Looping
  • Time Series
  • Indexing and Sorting
  • Missing Data
  • Custom Functions
  • DataFrame Manipulations
  • Custom Twitter Variables
  • Working with MongoDB in Python

Jupyter Notebook Settings

Set width of columns for display:
pd.set_option('display.max_columns', None)
Set cell width:
pd.set_option('max_colwidth', 150)

Working with Python Lists

Break list into chunks of 4 (needed for some APIs, for example)
mylist_chunked = [mylist[i:i+4] for i in xrange(0, len(mylist), 4)]
Finding duplicates in a list:
import collections
mylist = ['twitter', 'facebook','instagram', 'twitter']
[item for item, count in collections.Counter(mylist).items() if count > 1]
Remove list element:
mylist.remove('revenue')

Working with Python Dictionaries

Delete a key from a dictionary:
del dict[key]['key_to_delete']
Create sub-dictionary (from sub-set of keys):
subdict = {k: dict[k] for k in ('key1', 'key2')}

Analysis

Cross-tabulation:
pd.crosstab(df['var1'], df['var2'])

Generating New Variables, Arrays, etc.

Create list from dataframe column:
screen_names = df[‘screen_name’].tolist()
Create list of unique column values in DF:
tickers_in_df = pd.unique(df.ticker.ravel()).tolist()
Convert string variable to float:
df[‘count_V2’] = df[‘count’].convert_objects(convert_numeric=True)
Convert float column to int (only works if there are no missing values):
df_2014[‘rt_count’] = df_2014[‘rt_count’].astype(int)
Convert column to string:
df.index = df.index.astype(str)
Create new variable as a slice of an existing one:
df[‘cusip8’] = df[‘cusip’].apply(lambda x: x[:8])
Replace a word within a column with another word:
df[‘8-K’] = df[‘8-K’].str.replace(‘Item’, ‘Section’)
Fill in missing values for one column with zero:
df_2014[‘rt_count’] = df_2014[‘rt_count’].fillna(0)
Get new list of unique items in a list:
screen_names_unique = list(set(screen_names))
Create dummy variable based on whether another column contains specific text (values will be 'True' and 'False'):
df[‘retweeted_status_dummy’] = df[‘retweeted_status’].str.contains(‘THIS’, na=False)
Then convert to float (will convert 'True' and 'False' categories of above variable into '1' and '0', respectively):
df[‘retweeted_status_dummy’] = df[‘retweeted_status_dummy’].astype(float)
Replace values (here, replace 'None' with '0'):
df[‘reply_message’] = df[‘in_reply_to_screen_name’].replace([None], [‘0’])
Replace values (here, replace np.nan values with '0'):
df.ix[df.Number_of_retweets.isnull(), ‘Number_of_retweets’] = 0
Switch values of '0' and '1':
df[‘reply_message’] = df[‘reply_message’].replace([1], [99]).replace([0],[1]).replace([99],[0])
Create binary variable from count variable (if old var=0, assign value of 0; otherwise 1):
df[‘urls’] = np.where(df[‘entities_urls_count’]==0, 0, 1)
Change each unicode element in a list to string:
tickers_in_df = [x.encode(‘UTF8’) for x in tickers_in_df]
Change column values to upper case:
df[‘Name’] = df[‘Name’].apply(lambda x: x.upper())
Change column values to upper case:
sec[‘8-K Item’] = sec[‘8-K Item’].str.upper()
Find number of unique values:
len(pd.unique(compustat_2013.cusip.ravel()))
Add leading zeros to string variable (as many as needed to reach 10 characters):
df[‘cik_x’] = df[‘cik_x’].apply(lambda x: x.zfill(10))
Convert column to string and add leading zeros:
df[‘CIK_10’] = df[‘CIK’].astype(str).apply(lambda x: x.zfill(10))
Get a list of values from a DF column:
values_list = df[‘8-K’].value_counts().keys().tolist()
Find number of cases with fewer than the mean value:
sum(df[‘followers_count’] < df['followers_count'].mean())

I/O

Read in a pickled dataframe:
ticker_master = pd.read_pickle(‘valid_tickers_317.pkl’)
Read in JSON file:
f = open(‘valid_tickers_list_317.json’, ‘r’)
valid_tickers = simplejson.load(f)
Read in JSON file -- method 2:
with open(‘my_list.json’, ‘r’) as fp:
    my_list = json.load(fp)
Save a list as JSON file:
f = open(‘all_valid_screen_names.json’, ‘w’)
simplejson.dump(valid_twitter_accounts, f)
f.close()
Save a list as JSON file -- method 2:
import json

with open(‘my_list.json’, ‘w’) as fp:
    json.dump(my_list, fp)
Read in Excel file:
twitter_accounts = pd.read_excel(‘Twitter Account-level Database — non-excluded tickers and accounts only.xlsx’, ‘accounts’, header=0)
Write Excel file:
df.to_excel(‘df.xls’, sheet_name=’Sheet1′)

Looping

Looping over rows (note how to select a slice of rows):
for index, row in df[:10].iterrows():
        print index, row[‘content’]
Loop over rows and update existing variable:
for index, row in df.iterrows():
        df[‘count’] = count
Loop over rows and create new variable, method 1:
for index, row in df.iterrows():
        #df.ix[df.index==index, ‘count’] = count #LONGER VERSION
        df.ix[index, ‘count’] = count #SHORTER VERSION
Loop over rows and create new variable, method 2:
for index, row in df.iterrows():
        df.loc[index,’count’] = count

Time Series

Weekdays:
weekdays_only = by_day2[by_day2[‘weekday’] < 5 ] weekday_count = df.groupby(ts.index.weekday).apply(f) by_day3['weekday'] = by_day['date'].apply(lambda x: x.weekday())
Add missing days (with zeros) for every day in a dataframe:
df_filled = df.unstack().fillna(0).stack()
Change specific columns
df_filled.loc[df_filled[‘Number of Firm Tweets’] == 0, [‘Number of Firm Followers (start)’, ‘Number of Lists for Firm (start)’]] = np.nan
Set column to datetime:
df[‘created_at’] = pd.to_datetime(df[‘created_at’])
Convert datetime column to date:
df[‘date’] = [t.date() for t in df[‘datetime’]]
Generate lagged variable with multi-index DF:
df[‘Mentions [t-1]’] = df[‘Mentions’].unstack().shift(1).stack()
Generate variable aggregated over 3-day window lagged one day:
df[‘Mentions [sum of t-1:t-3]’] = pd.rolling_sum(df[‘Mentions’].unstack().shift(), window=3).stack()
Select date range for DF:
df_2014 = df[‘2014-1-1′:’2014-12-31’]

Indexing and Sorting

Set the index:
df = df.set_index([‘created_at’])
Reset Index:
df.reset_index(inplace=True)
Set Multi-Index:
df = df.set_index([‘ticker’, ‘day’])
Sort dataframe:
df.sort([‘ticker’, ‘day’], inplace=True)
Name Existing Multi-index columns:
df.index.names = [‘day’, ‘ticker’]
With multi-index df -- get unique items per index level:
len(firm_day_counts_firm_tweets.index.levels[0])
len(firm_day_counts_firm_tweets.index.levels[1])
Swap levels on multi-index dataframe:
df_swapped = df.swaplevel(‘ticker’, ‘day’)
Get minimum date in DF:
df.index.min()
Complex conditional row selection during loop:
if (row[‘Form Type’] == ‘8-K’) and (row[‘8-K Item’]==”) and (row[‘8-K Item — V2’]==”) or (row[‘8-K Item’]==’Item that’) or (row[‘8-K Item’]==’Item fo’):

Missing Data

Interpolation with backfill and forward fill (n.b. - does not respect multi-index):
df[‘F_x)’]=df[‘F’].interpolate(method=’linear’).bfill().ffill()
Find rows with empty column:
missing = df[df[‘ticker’].isnull()]
Fill missing values in one column with values from another column:
sec.loc[sec[‘8-K Item_v2′].isnull(),’8-K Item_v2’] = sec[‘8-K Item’]

Custom Functions

Function for generating a series of new one-day lag variables in a dataframe:
def lag_one_day(df):
    #df_cols = df.columns
    df_cols = [u’Number of Firm Mentions’, u’Number of Firm Tweets’]     for i in df_cols:
        col = str(i)+str(‘[t-1]’)
        if ‘[t-1]’ not in str(i):
            df[col] = df[i].unstack().shift(1).stack()
Function for generating a series of new dataframe variables that aggregate over a multi-day period:
def lag_one_day(df):
    #df_cols = df.columns
    df_cols = [u’Number of Firm Mentions’, u’Number of Firm Tweets’]     for i in df_cols:
        col = str(i)+str(‘[sum t-1:t-3]’)
        if ‘[t-1]’ not in str(i):
            df[col] = pd.rolling_sum(df[i].unstack().shift(),
            window=3).stack()

DataFrame Manipulations

Subset of DF -- based on a condition in a column:
df[df[‘reply_message’] == 1]
Subset of DF -- specific columns:
df[[‘in_reply_to_screen_name’, ‘in_reply_to_status_id’, ‘reply_message’]].head()
Drop a column:
df = df.drop(‘entities_urls’,1)
Intersection of two lists:
pd.Series(np.intersect1d(pd.Series(tickers_in_df), pd.Series(valid_tickers)))
Difference between two lists (all different elements from either list):
set(tickers_in_df).symmetric_difference(valid_tickers_with_twitter)
Difference between two lists (elements from list1 missing from list2):
set(list1) – set(list2)
Create DF based on whether column value is in a list:
df = df[df.ticker.isin(mylist)]
Creat an empty dataframe:
columns = [‘ticker’, ‘month’, ‘degree’]

df = pd.DataFrame(columns=columns)

Add row (row 0) to empty dataframe:
df.loc[0] = pd.Series({‘ticker’:’Z’, ‘month’:’12’, ‘degree’: ”})
Change cell column values for a specific row (index=16458):
sec.ix[16458, “8-K Item — V2”] = ‘Item 2.01’
Create dataframe from list/dictionary:
months = [‘2014-1’, ‘2014-2’, ‘2104-3’]

data = {‘month’: months}
df = pd.DataFrame(data, columns=[‘month’, ‘degree’])

Add rows to a dataframe:
df = df.append(df_month)
Create dataframe from list of column names:
d = {‘Variable Name’: list(df.columns.values)}
variables = DataFrame(d)
Create dataframe by deleting all rows with null values in one column:
df2 = df[df[‘keep’].notnull()]
Rename a single column:
df.rename(columns={‘cusip’:’cusip_COMPUSTAT’}, inplace=True)
Create dataframe based on column value appearing in a list:
missing = df[df[‘cusip’].isin(mylist)]
Look for duplicates in a dataframe:
counts = merged.groupby(‘cusip’).size()
df2 = pd.DataFrame(counts, columns = [‘size’])
df2 = df2[df2.size>1]
df2
Create version of dataframe with conditions on two variables (for removing a duplicate firm):
merged = merged[~((merged[‘fyear’]==2012) & (merged[‘gvkey’]==’176103′))]
Select partial dataframe -- complex conditions:
sec[(sec[‘Form Type’] == ‘8-K’) & (sec[‘8-K Item’]==”) & (sec[‘8-K Item — V2’]==”) | (sec[‘8-K Item’]==’Item fo’) & (sec[‘8-K Item — V2’]==”)| (sec[‘8-K Item’]==’Item that’) & (sec[‘8-K Item — V2’]==”)]
Merge two dataframes:
merged = pd.merge(fb, org_data, left_on=’feed_id’, right_on=’Org_ID’)
Deep copy of DF:
df2 = df.copy(deep=True)
Get a slice of dataframe (here, the two rows with the given indexes):
df.loc[11516:11517]

Custom Twitter Variables

Time on Twitter:
import datetime
df[‘start’]=np.datetime64(‘2016-04-01’)
df[‘created_at’] = pd.to_datetime(df[‘created_at’])
df[‘time_on_twitter’] = df[‘start’] – df[‘created_at’] df[‘time_on_twitter_days’] = df[‘time_on_twitter’].astype(‘timedelta64[D]’)
df[‘time_on_twitter_days’] = df.time_on_twitter_days.astype(‘int’)

Working with MongoDB in PANDAS

Show first 2 documents:
for user in users.find()[:2]:
    print user, ‘\n’
Get frequency counts:
from bson.son import SON
pipeline = [ {“$group”: {“_id”: “$FormType”, “count”: {“$sum”: 1}}} ] list(file_list.aggregate(pipeline))
Loop over random sample of 100 with filtering:
for file in file_list.aggregate([
    { ‘$match’: {‘FormType’: { ‘$in’: [‘990PF’]}} },
    { ‘$sample’: {‘size’: 100} }
    ]):
    print file

I hope you have found this helpful. If so, please spread the word, and happy coding!

Share Button

Written by:
Gregory Saxton
Published on:
March 24, 2018
Thoughts:
No comments yet

Categories: Big Data, Data Analytics, jupyter_notebook, pandas, research, TwitterTags: Big Data, Data Analytics, PANDAS, python

Explore more

Contact Information

Gregory D. Saxton
Schulich School of Business
York University
Toronto, ON
gsaxton@yorku.ca

Footer

Copyright © 2023 · Navigation Pro on Genesis Framework · WordPress · Log in

Keep in Touch

  • Email
  • Facebook
  • GitHub
  • LinkedIn
  • Twitter
  • About Me
  • Media Mentions
  • Contact