IRS 990 e-File Data (8) — Data Wrangling and Export to Stata




IRS 990 e-File Data (7) — Generate Codebook




IRS 990 e-File Data (6) — Download IRS 990, 990EZ, and 990PF Filings and Associated Schedules into MongoDB




IRS 990 e-File Data (part 5) — Download IRS 990 Filings and Associated Schedules into MongoDB




IRS 990 e-File Data (part 4) — Download IRS 990 Filings into MongoDB




IRS 990 e-File Data (part 3) — Load Index Files into PANDAS




IRS 990 e-File Data (part 2) — Load Index Data and Insert into MongoDB




IRS 990 e-File Data (part 1) — Set up AWS CLI credentials and grab index files




Tutorials for Sarbanes-Oxley Paper Data

Dan Neely (from University of Milwaukee-Wisconsin) and I just had the following article published at the Journal of Business Ethics:

Saxton, G. D., & Neely, D. G. (2018). The Relationship Between Sarbanes–Oxley Policies and Donor Advisories in Nonprofit Organizations. Journal of Business Ethics.

This page contains tutorials on how to download the IRS 990 e-file data that was used for the control variables in our study.

Tutorials

I hope you have found this helpful. If so, please spread the word, and happy coding!




Python PANDAS Code Bytes

This page contains brief (generally one-liner) blocks of code for working with Python and PANDAS for data analytics. I created it as a handy reference for PANDAS commands I tended to forget when I was learning. I hope it proves useful to you, too! I also have a page with longer data analytics tutorials.

Table of Contents

Jupyter Notebook Settings

Set width of columns for display:

Set cell width:

Working with Python Lists

Break list into chunks of 4 (needed for some APIs, for example)

Finding duplicates in a list:

Remove list element:

Working with Python Dictionaries

Delete a key from a dictionary:

Create sub-dictionary (from sub-set of keys):

Analysis

Cross-tabulation:

Generating New Variables, Arrays, etc.

Create list from dataframe column:

Create list of unique column values in DF:

Convert string variable to float:

Convert float column to int (only works if there are no missing values):

Convert column to string:

Create new variable as a slice of an existing one:

Replace a word within a column with another word:

Fill in missing values for one column with zero:

Get new list of unique items in a list:

Create dummy variable based on whether another column contains specific text (values will be 'True' and 'False'):

Then convert to float (will convert 'True' and 'False' categories of above variable into '1' and '0', respectively):

Replace values (here, replace 'None' with '0'):

Replace values (here, replace np.nan values with '0'):

Switch values of '0' and '1':

Create binary variable from count variable (if old var=0, assign value of 0; otherwise 1):

Change each unicode element in a list to string:

Change column values to upper case:

Change column values to upper case:

Find number of unique values:

Add leading zeros to string variable (as many as needed to reach 10 characters):

Convert column to string and add leading zeros:

Get a list of values from a DF column:

Find number of cases with fewer than the mean value:

I/O

Read in a pickled dataframe:

Read in JSON file:

Read in JSON file -- method 2:

Save a list as JSON file:

Save a list as JSON file -- method 2:

Read in Excel file:

Write Excel file:

Looping

Looping over rows (note how to select a slice of rows):

Loop over rows and update existing variable:

Loop over rows and create new variable, method 1:

Loop over rows and create new variable, method 2:

Time Series

Weekdays:

Add missing days (with zeros) for every day in a dataframe:

Change specific columns

Set column to datetime:

Convert datetime column to date:

Generate lagged variable with multi-index DF:

Generate variable aggregated over 3-day window lagged one day:

Select date range for DF:

Indexing and Sorting

Set the index:

Reset Index:

Set Multi-Index:

Sort dataframe:

Name Existing Multi-index columns:

With multi-index df -- get unique items per index level:

Swap levels on multi-index dataframe:

Get minimum date in DF:

Complex conditional row selection during loop:

Missing Data

Interpolation with backfill and forward fill (n.b. - does not respect multi-index):

Find rows with empty column:

Fill missing values in one column with values from another column:

Custom Functions

Function for generating a series of new one-day lag variables in a dataframe:

Function for generating a series of new dataframe variables that aggregate over a multi-day period:

DataFrame Manipulations

Subset of DF -- based on a condition in a column:

Subset of DF -- specific columns:

Drop a column:

Intersection of two lists:

Difference between two lists (all different elements from either list):

Difference between two lists (elements from list1 missing from list2):

Create DF based on whether column value is in a list:

Creat an empty dataframe:

Add row (row 0) to empty dataframe:

Change cell column values for a specific row (index=16458):

Create dataframe from list/dictionary:

Add rows to a dataframe:

Create dataframe from list of column names:

Create dataframe by deleting all rows with null values in one column:

Rename a single column:

Create dataframe based on column value appearing in a list:

Look for duplicates in a dataframe:

Create version of dataframe with conditions on two variables (for removing a duplicate firm):

Select partial dataframe -- complex conditions:

Merge two dataframes:

Deep copy of DF:

Get a slice of dataframe (here, the two rows with the given indexes):

Custom Twitter Variables

Time on Twitter:

Working with MongoDB in PANDAS

Show first 2 documents:

Get frequency counts:

Loop over random sample of 100 with filtering:

I hope you have found this helpful. If so, please spread the word, and happy coding!