IRS 990 e-File Data (part 3) — Load Index Files into PANDAS




IRS 990 e-File Data (part 2) — Load Index Data and Insert into MongoDB




IRS 990 e-File Data (part 1) — Set up AWS CLI credentials and grab index files




Tutorials for Sarbanes-Oxley Paper Data

Dan Neely (from University of Milwaukee-Wisconsin) and I just had the following article published at the Journal of Business Ethics:

Saxton, G. D., & Neely, D. G. (2018). The Relationship Between Sarbanes–Oxley Policies and Donor Advisories in Nonprofit Organizations. Journal of Business Ethics.

This page contains tutorials on how to download the IRS 990 e-file data that was used for the control variables in our study.

Tutorials

I hope you have found this helpful. If so, please spread the word, and happy coding!




Python PANDAS Code Bytes

This page contains brief (generally one-liner) blocks of code for working with Python and PANDAS for data analytics. I created it as a handy reference for PANDAS commands I tended to forget when I was learning. I hope it proves useful to you, too! I also have a page with longer data analytics tutorials.

Table of Contents

Jupyter Notebook Settings

Set width of columns for display:

Set cell width:

Working with Python Lists

Break list into chunks of 4 (needed for some APIs, for example)

Finding duplicates in a list:

Remove list element:

Working with Python Dictionaries

Delete a key from a dictionary:

Create sub-dictionary (from sub-set of keys):

Analysis

Cross-tabulation:

Generating New Variables, Arrays, etc.

Create list from dataframe column:

Create list of unique column values in DF:

Convert string variable to float:

Convert float column to int (only works if there are no missing values):

Convert column to string:

Create new variable as a slice of an existing one:

Replace a word within a column with another word:

Fill in missing values for one column with zero:

Get new list of unique items in a list:

Create dummy variable based on whether another column contains specific text (values will be 'True' and 'False'):

Then convert to float (will convert 'True' and 'False' categories of above variable into '1' and '0', respectively):

Replace values (here, replace 'None' with '0'):

Replace values (here, replace np.nan values with '0'):

Switch values of '0' and '1':

Create binary variable from count variable (if old var=0, assign value of 0; otherwise 1):

Change each unicode element in a list to string:

Change column values to upper case:

Change column values to upper case:

Find number of unique values:

Add leading zeros to string variable (as many as needed to reach 10 characters):

Convert column to string and add leading zeros:

Get a list of values from a DF column:

Find number of cases with fewer than the mean value:

I/O

Read in a pickled dataframe:

Read in JSON file:

Read in JSON file -- method 2:

Save a list as JSON file:

Save a list as JSON file -- method 2:

Read in Excel file:

Write Excel file:

Looping

Looping over rows (note how to select a slice of rows):

Loop over rows and update existing variable:

Loop over rows and create new variable, method 1:

Loop over rows and create new variable, method 2:

Time Series

Weekdays:

Add missing days (with zeros) for every day in a dataframe:

Change specific columns

Set column to datetime:

Convert datetime column to date:

Generate lagged variable with multi-index DF:

Generate variable aggregated over 3-day window lagged one day:

Select date range for DF:

Indexing and Sorting

Set the index:

Reset Index:

Set Multi-Index:

Sort dataframe:

Name Existing Multi-index columns:

With multi-index df -- get unique items per index level:

Swap levels on multi-index dataframe:

Get minimum date in DF:

Complex conditional row selection during loop:

Missing Data

Interpolation with backfill and forward fill (n.b. - does not respect multi-index):

Find rows with empty column:

Fill missing values in one column with values from another column:

Custom Functions

Function for generating a series of new one-day lag variables in a dataframe:

Function for generating a series of new dataframe variables that aggregate over a multi-day period:

DataFrame Manipulations

Subset of DF -- based on a condition in a column:

Subset of DF -- specific columns:

Drop a column:

Intersection of two lists:

Difference between two lists (all different elements from either list):

Difference between two lists (elements from list1 missing from list2):

Create DF based on whether column value is in a list:

Creat an empty dataframe:

Add row (row 0) to empty dataframe:

Change cell column values for a specific row (index=16458):

Create dataframe from list/dictionary:

Add rows to a dataframe:

Create dataframe from list of column names:

Create dataframe by deleting all rows with null values in one column:

Rename a single column:

Create dataframe based on column value appearing in a list:

Look for duplicates in a dataframe:

Create version of dataframe with conditions on two variables (for removing a duplicate firm):

Select partial dataframe -- complex conditions:

Merge two dataframes:

Deep copy of DF:

Get a slice of dataframe (here, the two rows with the given indexes):

Custom Twitter Variables

Time on Twitter:

Working with MongoDB in PANDAS

Show first 2 documents:

Get frequency counts:

I hope you have found this helpful. If so, please spread the word, and happy coding!




SQLite vs. MongoDB for Big Data

In my latest tutorial I walked readers through a Python script designed to download tweets by a set of Twitter users and insert them into an SQLite database. In this post I will provide my own thoughts on the pros and cons of using a relational database such as SQLite vs. a “noSQL” database such as MongoDB. These are my two go-to databases for downloading and managing Big Data and there are definite advantages and disadvantages to each.

The caveat is that this discussion is for researchers. Businesses will almost definitely not want to use SQLite for anything but simple applications.

The Pros and Cons of SQLite

SQLite has a lot going for it. I much prefer SQLite over, say, SQL. SQLite is the easiest of all relational databases. Accordingly, for someone gathering data for research SQLite is a great option.

For one thing, it is pre-installed when you install Anaconda Python (my recommended installation). There’s none of typical set-up with a MySQL installation, either — steps such as setting up users and passwords, etc. With Anaconda Python you’re good to go.

Moreover, SQLite is portable. Everything is contained in a single file that can be moved around your own computer or shared with others. There’s nothing complicated about it. Your SQLite database is just a regular file. Not so with MySQL, for instance, which would need to be installed separately, have user permissions set up, etc., and is definitely not so readily portable.

So, what’s the downside? Two things. One, there is the set-up. To get the most out of your SQLite database, you need to pre-define every column (variable) you’re going to use in the database. Every tweet, for instance, will need to have the exact same variables or else your code will break. For an example of this see my recent tutorial on downloading tweets into an SQLite database.

The other shortcoming flows from the pre-defining process. Some social media platforms, such as Twitter, have relatively stable APIs, which means you access the same variables the same way year in and year out. Other platforms, though (that’s you, Facebook), seem to change their API constantly, which means your code to insert Facebook posts into your SQLite database will also constantly break.

Here’s a screenshot of what your SQLite database might look like:

As you can see, it’s set up like a typical flat database like an Excel spreadsheet or PANDAS or R dataframe. The columns are all pre-defined.

The Pros and Cons of MongoDB

The SQLite approach contrasts starkly with the “noSQL” approach represented by MongoDB. A primary benefit is that MongoDB is tailor-made for inserting the types of data returned by a social media platform’s API — particularly JSON.

For instance, the Twitter API returns a JSON object for each tweet. In a prior tutorial I provide an overview of this. The code block below shows the first five lines of JSON (one line per variable) for a typical tweet object returned by the Twitter API:

{
“_id” : ObjectId(“595a71173ffc5a01d8f27de7”),
“contributors” : null,
“quoted_status_id” : NumberLong(880805966375202816),
“text” : “RT @FL_Bar_Found: Thank you for your support, Stephanie! https://t.co/2vxXe3VnTU”,
“time_date_inserted” : “12:30:15_03/07/2017”,
….
}

And to see the full 416 lines of JSON code for a single tweet object click on expand source below:

Here is where MongoDB excels. All we need to do is grab the tweet object and tell MongoDB to insert it into our database. Do you have different columns in each tweet? MongoDB doesn’t care — it will just take whatever JSON you throw at it and insert it into your database. So if you are working with JSON objects that have different variables or different numbers of columns — or if Facebook changes its API again — you will not need to update your code and your script will not break because of it.

Here’s a screenshot of what the first 40 objects (tweets) in your MongoDB database might look like. You can see that the number of fields (variables) is not the same for each tweet — some have 29, some have 30, or 31, or 32:

And here’s what the first tweet looks like after expanding the first object:

As you can see, it looks like the JSON object returned by the Twitter API.

In effect, MongoDB is great in situations where you would like to quickly grab all the data available and quickly throw it into a database. The downside of this approach is that you will have to do the defining of your data later — before you can analyze it. I find this to be less and less problematic, however, since PANDAS has come around. I would much rather extract my data from MongoDB (one line of code) and do my data and variable manipulations in PANDAS rather than mess around with SQLAlchemy before even downloading the data into an SQLite database.

A final benefit of MongoDB is its scalability. You have 10 million tweets to download? What about 100 million? No issues with MongoDB. With SQLite, in contrast, let’s say 1 million tweets would be a good upper limit before performance drags considerably.

MongoDB does have its downsides, though. Much like MySQL, MongoDB needs to be “running” before you insert data into it. If your server is running 24/7 that is no issue. Otherwise you’ll have to remember to restart your MongoDB server each time you want to either insert data into your database or extract data you’ve already inserted. MongoDB also has higher “start-up” costs; it is not as easy to install as SQLite and you may or may not run into disk permissions issues, username and password issues, etc. Cross your fingers and will only take you half an hour — once — and then you’re good to go from then on.

Finally, a MongoDB database is not a “file” like an SQLite database. This makes moving or sharing your database more troublesome. Not terribly onerous but a few extra steps. Again, if you are importing your MongoDB database into PANDAS and then using PANDAS for your data manipulations, etc., then this should not be an issue. You can easily share or move your PANDAS databases or export to CSV or Excel.

Summary

Here is a summary of the pros and cons of SQLite and MongoDB for use as a Big Data-downloading database.

 SQLiteMongoDB
PortabilityEasy to share/move an SQLite database. Considerably more complicated. May not be an issue for you if you're work process is to export your data into PANDAS.
Ease of useSQLite is simple. The database is just a single file that does not need to be 'running' 24/7.More complicated than SQLite. The MongoDB server needs to be running before your Python script can insert the data.
Ease of SetupVery easy. If you have installed Anaconda Python you are good to go.Considerably more complicated set-up. But it is a one-time process. If you are lucky or are comfortable with the Terminal this one-time set-up process should not take more than an hour.
ScalabilityBeyond a certain limit your SQLite database will become unwieldy. I've have up to a million tweets without too much difficulty, however.Can be as small or as big as you'd like.
Setting up code to insert tweets Needs to be detailed. Every column needs to be defined in your code and accounted for in each tweet.Easy. MongoDB will take whatever JSON Twitter throws at it and insert it into the database.
Robustness to API ChangesNot robust. The Facebook API, for instance, changes almost constantly. Your database code will have to be updated each time or it will break when it tries to insert into SQLite.Extremely robust. Easy. MongoDB will take whatever JSON you throw at it and insert it into the database.

If you’ve found this post helpful please share on your favorite social media site.

I my next post I will provide a tutorial of how to download tweets into a MongoDB database. Until then, happy coding!