More and more of my research involves some degree of ‘Big Data’ — typically datasets with a million or so tweets. Getting these data prepped for analysis can involve massive amounts of data manipulation — anything from aggregating data to the daily or organizational level, to merging in additional variables, to generating data required for social network analysis. For all such steps I now almost exclusively use Python’s PANDAS library (‘Python Data Analysis Library’). In conjunction with the iPython Notebook interactive computing framework and NetworkX, you will have a powerful set of analysis tools at your disposal.
Given that I am now doing almost all of my dataset manipulation — and much of the analysis — in PANDAS, and how new I am to the framework, I created this page mostly as a handy reference for all those PANDAS commands I tend to forget or find particularly useful. But if it proves helpful to any others, great!
iPython Notebook Settings
Set width of columns for display:
Set cell width:
Analysis
Cross-tab (can be saved as dataframe):
Generating New Variables, Arrays, etc.
Create list from dataframe column:
Create list of unique column values in DF:
Convert string variable to float:
Convert float column to int (only works if there are no missing values):
Convert column to string:
Create new variable as a slice of an existing one:
Replace a word within a column with another word:
Fill in missing values for one column with zero:
Get new list of unique items in a list:
Create dummy variable based on whether another column contains specific text (values will be ‘True’ and ‘False’):
Then convert to float (will convert ‘True’ and ‘False’ categories of above variable into ‘1’ and ‘0’, respectively):
Replace values (here, replace ‘None’ with ‘0’):
Replace values (here, replace np.nan values with ‘0’):
Switch values of ‘0’ and ‘1’:
Create binary variable from count variable (if old var=0, assign value of 0; otherwise 1):
Change each unicode element in a list to string:
Change column values to upper case:
Change column values to upper case:
Find number of unique values:
Add leading zeros to string variable (as many as needed to reach 10 characters):
Convert column to string and add leading zeros:
Get a list of values from a DF column:
Find number of cases with fewer than the mean value:
I/O
Read in a pickled dataframe:
Read in JSON file:
valid_tickers = simplejson.load(f)
Read in JSON file — method 2:
my_list = json.load(fp)
Save a list as JSON file:
simplejson.dump(valid_twitter_accounts, f)
f.close()
Save a list as JSON file — method 2:
import json
json.dump(my_list, fp)
Read in Excel file:
Write Excel file:
Looping
Looping over rows (note how to select a slice of rows):
print index, row[‘content’]
Loop over rows and update existing variable:
df[‘count’] = count
Loop over rows and create new variable, method 1:
#df.ix[df.index==index, ‘count’] = count #LONGER VERSION
df.ix[index, ‘count’] = count #SHORTER VERSION
Loop over rows and create new variable, method 2:
df.loc[index,’count’] = count
Time Series
Weekdays:
by_day3[‘weekday’] = by_day[‘date’].apply(lambda x: x.weekday())
Add missing days (with zeros) for every day in a dataframe:
Change specific columns’ values to missing based on value in another column:
Set column to datetime:
Convert datetime column to date:
Generate lagged variable with multi-index DF:
Generate variable aggregated over 3-day window lagged one day:
Select date range for DF:
Indexing and Sorting
Set the index:
Reset Index:
Set Multi-Index:
Sort dataframe:
Name Existing Multi-index columns:
With multi-index df — get unique items per index level:
len(firm_day_counts_firm_tweets.index.levels[1])
Swap levels on multi-index dataframe:
Get minimum date in DF:
Complex conditional row selection during loop:
Missing Data
Interpolation with backfill and forward fill [n.b. – does not respect multi-index]
Find rows with empty column:
Fill missing values in one column with values from another column:
Custom Functions
Function for generating a series of new one-day lag variables in a dataframe:
#df_cols = df.columns
df_cols = [u’Number of Firm Mentions’, u’Number of Firm Tweets’] for i in df_cols:
col = str(i)+str(‘[t-1]’)
if ‘[t-1]’ not in str(i):
df[col] = df[i].unstack().shift(1).stack()
Function for generating a series of new dataframe variables that aggregate over a multi-day period:
#df_cols = df.columns
df_cols = [u’Number of Firm Mentions’, u’Number of Firm Tweets’] for i in df_cols:
col = str(i)+str(‘[sum t-1:t-3]’)
if ‘[t-1]’ not in str(i):
df[col] = pd.rolling_sum(df[i].unstack().shift(),
window=3).stack()
DataFrame Manipulations
Subset of DF — based on a condition in a column:
Subset of DF — specific columns
Drop a column:
Intersection of two lists:
Difference between two lists (all different elements from either list):
Difference between two lists (elements from list1 missing from list2):
Create DF based on whether column value is in a list:
Creat an empty dataframe:
df = pd.DataFrame(columns=columns)
Add row (row 0) to empty dataframe:
Change cell column values for a specific row (index=16458):
Create dataframe from list/dictionary:
data = {‘month’: months}
df = pd.DataFrame(data, columns=[‘month’, ‘degree’])
Add rows to a dataframe:
Create dataframe from list of column names:
variables = DataFrame(d)
Create dataframe by deleting all rows with null values in one column:
Rename a single column:
Create dataframe based on column value appearing in a list:
Look for duplicates in a dataframe:
df2 = pd.DataFrame(counts, columns = [‘size’])
df2 = df2[df2.size>1]
df2
Create version of dataframe with conditions on two variables (for removing a duplicate firm):
Select partial dataframe — complex conditions:
Merge two dataframes:
Deep copy of DF:
Get a slice of dataframe (here, the two rows with the given indexes):
Custom Twitter Variables
Time on Twitter: