Tutorials for Sarbanes-Oxley Paper Data

Dan Neely (from University of Milwaukee-Wisconsin) and I just had the following article published at the Journal of Business Ethics:

Saxton, G. D., & Neely, D. G. (2018). The Relationship Between Sarbanes–Oxley Policies and Donor Advisories in Nonprofit Organizations. Journal of Business Ethics.

This page contains tutorials on how to download the IRS 990 e-file data that was used for the control variables in our study.


I hope you have found this helpful. If so, please spread the word, and happy coding!

Using Your Twitter API Key

Below is an embedded version of an iPython notebook I have made publicly available on nbviewer. To download a copy of the code, click on the icon with three horizontal lines at the top right of the notebook (just below this paragraph) and select “Download Notebook.” I hope you find it helpful. If so, please share, and happy coding!

Setting up Access to the Twitter API


The Twitter API (application programming interface) is your gateway to accessing Twitter data. The image above shows a screenshot of Twitter’s Search API, just one of the key parts of the API you might be interested in. To access any of them you’ll need to have a password. So, in this post I’m going to walk you through getting access to the Twitter API. By the end you’ll have a password that you’ll use in your Python code to access Twitter data.

Sign up for a Twitter Account and Create an App

Most social media platforms follow a similar set of steps that you’ll go through here: you’ll go to the Developer page, set up an ‘app’, and then generate a set of passwords that grant you access to the API.

On Twitter, the first thing you’ll have to do is have a Twitter account. Once you have that, go to Twitter’s ‘developer’ page: https://dev.twitter.com. Once you’re logged into the developer page you’ll then have to create an ‘app’. Click on the My Apps link or go directly to: https://apps.twitter.com This will take you to the following screen, where you can see I have already created three apps. Click on ‘Create New App’.


Create your App

You’ll then be taken to the screen shown in the following image. You might be wondering why it’s called an ‘app’ and why you have to create one. The short answer is that Twitter and other social media platforms allow access to their data mainly for developers, or people creating apps that interact with the Twitter data. Academics and researchers are not the main targets but we access the data the same way.

You’ll need to fill in three things as shown in the image. For the ‘Name’ just put in anything you want — I chose ‘ARNOVA2016’ here. As long as it makes sense to you you’re fine. You’ll also have to put in a brief description. Here I typically put in something about academic or not-for-profit research. Finally, you’ll put in a website address (hopefully you have something you can use) and click ‘Create your Twitter Application.’


Successfully Created App

You’ll then be taken to the following screen, which indicates a successfully created app. Click on Keys and Access Tokens:


Generate Access Tokens

On this screen you’ll see the first two parts of your four-part password: the API KEY and the API SECRET. You still need to generate the final two parts, so click on ‘Regenerate Consumer Key and Secret.’


Copy the Four Parts of Your Password

You’ll then be taken to the final page as shown in the image below. You now have all four parts to your ‘password’ to accessing the Twitter API: the API KEY, the API SECRET, the ACCESS TOKEN, and the ACCESS SECRET (I’ve pixelated or obscured mine here). Keep these in a safe place — you’ll be using them in any code in which you want to access the Twitter API.


You’re done! You now have your Twitter API library card and are ready to go hunting for data. In an upcoming post I’ll show you how to actually use your password to access the data.

Analyzing Big Data with Python PANDAS

This is a series of iPython notebooks for analyzing Big Data — specifically Twitter data — using Python’s powerful PANDAS (Python Data Analysis) library. Through these tutorials I’ll walk you through how to analyze your raw social media data using a typical social science approach.

The target audience is those who are interested in covering key steps involved in taking a social media dataset and moving it through the stages needed to deliver a valuable research product. I’ll show you how to import your data, aggregate tweets by organization and by time, how to analyze hashtags, how to create new variables, how to produce a summary statistics table for publication, how to analyze audience reaction (e.g., # of retweets) and, finally, how to run a logistic regression to test your hypotheses. Collectively, these tutorials cover essential steps needed to move from the data collection to the research product stage.


I’ve put these tutorials in a GitHub repository called PANDAS. For these tutorials I am assuming you have already downloaded some data and are now ready to begin examining it. In the first notebook I will show you how to set up your ipython working environment and import the Twitter data we have downloaded. If you are new to Python, you may wish to go through a series of tutorials I have created in order.

If you want to skip the data download and just use the sample data, but don’t yet have Python set up on your computer, you may wish to go through the tutorial “Setting up Your Computer to Use My Python Code”.

Also note that we are using the iPython notebook interactive computing framework for running the code in this tutorial. If you’re unfamiliar with this see this tutorial “Four Ways to Run your Code”.

For a more general set of PANDAS notebook tutorials, I’d recommend this cookbook by Julia Evans. I also have a growing list of “recipes” that contains frequently used PANDAS commands.

As you may know from my other tutorials, I am a big fan of the free Anaconda version of Python 2.7. It contains all of the prerequisites you need and will save you a lot of headaches getting your system set up.


At the GitHub site you’ll find the following chapters in the tutorial set:

Chapter 1 – Import Data, Select Cases and Variables, Save DataFrame.ipynb
Chapter 2 – Aggregating and Analyzing Data by Twitter Account.ipynb
Chapter 3 – Analyzing Twitter Data by Time Period.ipynb
Chapter 4 – Analyzing Hashtags.ipynb
Chapter 5 – Generating New Variables.ipynb
Chapter 6 – Producing a Summary Statistics Table for Publication.ipynb
Chapter 7 – Analyzing Audience Reaction on Twitter.ipynb
Chapter 8 – Running, Interpreting, and Outputting Logistic Regression.ipynb

I hope you find these tutorials helpful; please acknowledge the source in your own research papers if you’ve found them useful:

    Saxton, Gregory D. (2015). Analyzing Big Data with Python. Buffalo, NY: http://social-metrics.org

Also, please share and spread the word to help build a vibrant community of PANDAS users.

Happy coding!

Producing a Summary Statistics Table in iPython using PANDAS

Below is an embedded version of an iPython notebook I have made publicly available on nbviewer. To download a copy of the code, click on the icon with three horizontal lines at the top right of the notebook (just below this paragraph) and select “Download Notebook.” I hope you find it helpful. If so, please share, and happy coding!

Setting up Your Computer to Use My Python Code for Downloading Twitter Data

I frequently get requests for how to download social media data in general, as well as for help on how to run code I have written to download and analyze the data I analyzed for a particular piece of research. Often, these requests are from people who are excited about doing social media research but have yet to gain much experience in using computer programming. For this reason, I have created a set of tutorials designed precisely for such users.

I am always happy to share the code I’ve used in my research. That said, there are barriers to actually using someone else’s code. One of the key barriers is getting your own computer “set up” to actually run the code. The aim of this post is to walk you through the steps needed to run and modify the code I’ve written to download and analyze social media data.

Step One: Download and Install Python

As I write about here, for Unix, Windows, and Mac users alike I’d recommend you install Anaconda Python 2.7. This distribution of Python is free and easy to install. Moreover, it includes most of the add-on packages necessary for scientific computing, including Numpy, Pandas, iPython, Statsmodels, Sqlalchemy, and Matplotlib.

Go to this tutorial for instructions on how to install and run Anaconda Python.

Step Two: Install Python Add-On Packages

Anaconda Python comes pre-installed with almost everything you need. There are a couple of modules you will have to install manually:

Twython — for accessing the Twitter data


simplejson — for parsing the JSON data that is returned by the Twitter API (Application Programming Interface).

Assuming you are on a Mac and using Anaconda Python, the simplest way is to use pip. On a Mac or Linux machine, you would simply open the Terminal and type pip install Twython and pip install simplejson. If you’re on a PC, please take a look at Wayne Xu’s tutorial (see Slide #8).

Step Three: The Database

I generally download my Twitter data into an SQLite database. SQLite is a common relational database. It is lightweight and easy to use, and comes preinstalled in Anaconda Python.

You may already know other ways of downloading social media data, such as NodeXL in Excel for Windows. Why then would you want to using SQLite? There are two reasons. First, SQLite is better plugged into the Python architecture, which will come in handy when you are ready to actually manipulate and analyze the data. Second, if you are downloading tweets more than once, a database is the much better solution for a simple reason: it allows you to write a check for duplicates in the database and stop them from being inserted. This is an almost essential feature that you cannot easily implement outside of a database.

Also know that once you have downloaded the data into an SQLite database, you can view and edit the data in the same manner as an Excel file, and even export the data into CSV format for viewing in Excel. To do this, simply download and install Database Browser for SQLite. If you use Firefox, you can alternatively use a plug-in called SQLite Manager.

Step Four: Accessing the Twitter API

Almost all of my Twitter code grabs data from the Twitter API, which sets up procedures for reliably accessing the Twitter data. Beginning in 2013 Twitter made it more difficult to access its APIs. Now OAuth authentication is needed for almost everything. This means you need to go on Twitter and create an ‘app.’ You won’t actually use the app for anything — you just need the password and authentication code. You can create your app here. For more detailed instructions on creating the app take a look at slides 4 through 6 of Wayne Xu’s (my excellent former PhD student) tutorial tutorial.

Step Five: Start Using the Code

Once you’ve completed the above four steps you will have all the necessary building blocks for successfully running my Python scripts for downloading and parsing Twitter data. Happy coding!

#ARNOVA14 – Tag cloud of ARNOVA 2014 Tweets

Word clouds are generally not all that helpful given how the words are taken out of their context (sentences). In certain settings, however, they do provide meaningful information. Hashtags are one of those contexts — they are meant to be single words. The tags denote ideas or topics or places. By examining the hashtags, we can thus gain an appreciation for the most salient topics ARNOVANs are tweeting about. With that, here is a word cloud generated using all of the hashtags included in all 739 tweets sent with the #arnova14 hashtag as of Sunday, November 23rd @ 1pm EST.

Here is a first cloud, with the #arnova14 tag excluded.

Screen Shot 2014-11-23 at 1.10.14 PM

The larger the tag, the more frequently it was used. You can see that two tags predominated, #nonprofit and #allianceR2P. To help see other topics, here’s a final tag cloud without these two tags.

Screen Shot 2014-11-23 at 1.10.40 PM

It was also refreshing to see linguistic diversity in the hashtags — especially a large number of Arabic tags. Unfortunately, the word cloud visualizer I used (Wordle) could not recognize these (if anyone knows of a good workaround please let me know).

If anyone is interested in the Python code used to generate this just shoot me an email.

I’ll leave the analysis up to you. Some interesting things here! I hope you all had a great conference and see you next year!

#ICA14 – Tag cloud of ICA 2014 Tweets

Word clouds are generally not all that helpful given how the words are taken out of their context (sentences). In certain settings, however, they do provide meaningful information. Hashtags are one of those contexts — they are meant to be single words. With that, here is a word cloud generated using all of the hashtags included in all 9,010 tweets sent with the #ica14 hashtag as of Monday, May 26th @ 1pm EST.

Here is a first cloud, with the #ica14 tag excluded.

Screen Shot 2014-05-26 at 2.18.19 PM

The larger the tag, the more frequently it was used. You can see that ICA section-related tags predominated, with #ica_cat and #ica_glbt leading the pack. (Note that, after downloading and processing the Twitter data in Python, I used Wordle to generate the word cloud. A quirk of Wordle is that it will split words with an underscore in them, so I’ve replaced underscores with hyphens. So, read “ica-cat” as “#ica_cat,” etc.)

To help highlight non-section tags, here is a version omitting any tag with “ica” in it.

Screen Shot 2014-05-26 at 1.48.27 PM

The #qualpolcomm taggers were highly active. To help see other topics, here’s a final tag cloud without #qualpolcomm.

Screen Shot 2014-05-26 at 1.49.01 PM

I’ll leave the analysis up to you. Some interesting patterns here!

If anyone is interested in the Python used to generate this let me know.

Replication Data

One of the core tenets of scientific research is replication. This gets at the reproducibility standard of scientific research. Despite calls for more replication in the social sciences, replication studies are still rather rare. In part, this is the product of journal editors’ and reviewers’ strong preferences for original research. It is also due to scholars not making their data publicly available. Many of my colleagues in academia, especially those who conduct experimental (lab experiments) research, do not typically make their data publicly available, though even here anonymized data should be available.

Replication datasets are not valuable solely for replication studies. In any dataset there are unused variables. A budding scholar or a research-oriented practitioner might be interested in your “leftover” variables or data points. You can’t foresee what others will find interesting.

What You Can Do

If you have data, share it. Not only is this being generous, but there is some evidence it may even be good for your career (citations, etc.). If you don’t have the capacity to warehouse it yourself, there are archives available for you. A good choice is Gary King’s Dataverse Network Project.

My Data

In the spirit of replication and extension, I would like to let people know which data sources I have available. If you’d like any of it, shoot me a message and we’ll figure out a way to get it to you.

Spanish Nationalist Event Data, 1977-1996

First, there is a replication archive of data I used in my dissertation. If you’re interested in Spanish nationalist contentious politics — specifically, data on violent and non-violent nationalist protests — check out http://contentiouspolitics.social-metrics.org/. This site was set up to make publicly available the data used in my dissertation (2000) and subsequent publications. There you will find background information on the project, codebooks, data, and copies of articles published using the data. You can browse and search the data and view various interactive graphs. The entire dataset is also available for downloading.

Twitter Data

Twitter data are generally publicly available. However, if you have a pre-defined set of users you can only grab their latest 3,200 tweets, which in some cases is only one year’s worth of data. And in other cases, especially if you want to follow a specific hashtag or collect user mentions or retweets, you can only go back one week in time. For this reason, sometimes it can very helpful if someone else has the historical data you may need. Here are some of the historical data I have, showing the sample of organizations, date range for which data are available, and citations for articles that used the data. If you are interested in it for your own research purposes let me know.

[bibshow file=saxton.bib, format=apa template=av-bibtex-modified]

  • Nonprofit Times 100 organizations — 2009
  • 145 advocacy nonprofit organizations — April 2012
  • 38 US community foundations (tweets as well as mentions) — July-August 2011

Facebook Data

Facebook data for organizations is typically public and can be downloaded via the Facebook Graph API. That said, I have some data available on a sample of large nonprofit organizations.

  • Nonprofit Times 100 organizations — December 2009
  • Nonprofit Times 100 organizations — April-May 2013

Website Data

Website data. Historical data can often be gathered from the Internet Archive Wayback Machine, but “robots exclusions” and other errors can prevent this. The following datasets are available:

  • 117 US community foundations (transparency and accountability data) – fall 2005 (Saxton, Guo, & Brown, 2007; Saxton & Guo, 2011)[bibcite key=Saxton2007][bibcite key=Saxton2011]
  • 400 random US nonprofit organizations – fall 2007 (Saxton, Guo, & Neely, 2014)[bibcite key=Saxton2014]

This is only a partial list of the data I have available. I’ll add to this as more data become cleaned and available.


Why I Use Python for Academic Research


Academics and other researchers have to choose from a variety of research skills. Most social scientists do not add computer programming into their skill set. As a strong proponent of the value of learning a programming language, I will lay out how this has proven to be useful for me. A budding programmer could choose from a number of good options — including perl, C++, Java, PHP, or others — but Python has a reputation as being one of the most accessible and intuitive. I obviously like it.

No matter your choice of language, there are variety of ways learning programming will be useful for social scientists and other data scientists. The most important areas are data gathering, data manipulation, and data visualization and analysis.

Data Gathering

When I started learning Python four years ago, I kept a catalogue of the various scripts I wrote. Going over these scripts, I have personally written Python code to gather the following data:

  • Download lender and borrower information for thousands of donation transactions on kiva.org.
  • Download tweets from a list of 100 large nonprofit organizations.
  • Download Twitter profile information from a 150 advocacy nonprofits.
  • Scrape the ‘Walls’ from 65 organizations’ Facebook accounts.
  • Download @messages sent to 38 community foundations.
  • Traverse and download html files for thousands of webpages on large accounting firms’ websites.
  • Scrape data from 1,000s of organizational profiles on a charity rating site.
  • Scrape data from several thousand organizations raising money on the crowdfunding site Indiegogo.
  • Download hundreds of YouTube videos used in Indiegogo fundraising campaigns.
  • Gather data available through the InfoChimps API.
  • Scrape pinning and re-pinning data from health care organizations’ Pinterest accounts.
  • Tap into the Facebook Graph API to download status updates and number of likes, comments and shares for 100 charities.

This is just a sample. The point is that you can use a programming language like Python to get just about any data from the Web. When the website or social media platform makes available an API (application programming interface), accessing the data is easy. Twitter is fantastic for this very reason. In other cases — including most websites — you will have to scrape the data through creative use of programming. Either way, you can gain access to valuable data.

There’s no need to be an expert to obtain real-world benefits from programming. I started learning Python four years ago (I now consider myself an intermediate-level programmer) and gained substantive benefits right from the start.

Data Manipulation

Budding researchers often seem to under-estimate how much time they will be spending on manipulating, reshaping, and processing their data. Python excels at data munging. I have recently used Python code to

  • Loop over hundreds of thousands of tweets and modify characters, convert date formats, etc.
  • Identify and delete duplicate entries in an SQL database.
  • Loop over 74 nonprofit organizations’ Twitter friend-follower lists to create a 74 x 74 friendship network.
  • Read in and write text and CSV data.
  • Countless grouping, merging, and aggregation functions.
  • Automatically count the number of “negative” words in thousands of online donation appeals.
  • Loop over hundreds of thousands of tweets to create an edge list for a retweet network.
  • Compute word counts for a word-document matrix from thousands of crowdfunding appeals.
  • Create text files combining all of an organizations’ tweets for use in creating word clouds.
  • Download images included in a set of tweets.
  • Merging text files.
  • Count number of Facebook statuses per organization.
  • Loop over hundreds of thousands of rows of tweets in an SQLite database and create additional variables for future analysis.
  • Dealing with missing data.
  • Creating dummy variables.
  • Find the oldest entry for each organization in a Twitter database.
  • Use pandas (Python Data Analysis Library) to aggregate Twitter data to the daily, weekly, and monthly level.
  • Create a text file of all hashtags in a Twitter database.

Data Visualization and Analysis

With the proliferation of scientific computing modules such as pandas and statsmodels and scikit-learn, Python’s data analysis capabilities have gotten much more powerful over the past few years. With such tools Python can now compete in many areas with devoted statistical programs such as or Stata, which I have traditionally used for most of my data analysis and visualization. Lately I’m doing more and more of this work directly in Python. Here are some of the analyses I have run recently using Python:

  • Implement a naive Bayesian classifier to classify the sentiment in hundreds of thousands of tweets.
  • Linguistic analysis of donation appeals and tweets using Python’s Natural Language Tool Kit.
  • Create plots of number of tweets, retweets, and public reply messages per day, week, and month.
  • Run descriptive statistics and multiple regressions.


Learning a programming language is a challenge. Of that there is little doubt. Yet the payoff in improved productivity alone can be substantial. Add to that the powerful analytical and data visualization capabilities that open up to the researcher who is skilled in a programming language. Lastly, leaving aside the buzzword “Big Data,” programming opens up a world of new data found on websites, social media platforms, and online data repositories. I would thus go so far as to say that any researcher interested in social media is doing themselves a great disservice by not learning some programming. For this very reason, one of my goals on this site is to provide guidance to those who are interested in getting up and running on Python for conducting academic and social media research.