1

Why I Use Python for Academic Research

1206711_41147487

Academics and other researchers have to choose from a variety of research skills. Most social scientists do not add computer programming into their skill set. As a strong proponent of the value of learning a programming language, I will lay out how this has proven to be useful for me. A budding programmer could choose from a number of good options — including perl, C++, Java, PHP, or others — but Python has a reputation as being one of the most accessible and intuitive. I obviously like it.

No matter your choice of language, there are variety of ways learning programming will be useful for social scientists and other data scientists. The most important areas are data gathering, data manipulation, and data visualization and analysis.

Data Gathering

When I started learning Python four years ago, I kept a catalogue of the various scripts I wrote. Going over these scripts, I have personally written Python code to gather the following data:

  • Download lender and borrower information for thousands of donation transactions on kiva.org.
  • Download tweets from a list of 100 large nonprofit organizations.
  • Download Twitter profile information from a 150 advocacy nonprofits.
  • Scrape the ‘Walls’ from 65 organizations’ Facebook accounts.
  • Download @messages sent to 38 community foundations.
  • Traverse and download html files for thousands of webpages on large accounting firms’ websites.
  • Scrape data from 1,000s of organizational profiles on a charity rating site.
  • Scrape data from several thousand organizations raising money on the crowdfunding site Indiegogo.
  • Download hundreds of YouTube videos used in Indiegogo fundraising campaigns.
  • Gather data available through the InfoChimps API.
  • Scrape pinning and re-pinning data from health care organizations’ Pinterest accounts.
  • Tap into the Facebook Graph API to download status updates and number of likes, comments and shares for 100 charities.

This is just a sample. The point is that you can use a programming language like Python to get just about any data from the Web. When the website or social media platform makes available an API (application programming interface), accessing the data is easy. Twitter is fantastic for this very reason. In other cases — including most websites — you will have to scrape the data through creative use of programming. Either way, you can gain access to valuable data.

There’s no need to be an expert to obtain real-world benefits from programming. I started learning Python four years ago (I now consider myself an intermediate-level programmer) and gained substantive benefits right from the start.

Data Manipulation

Budding researchers often seem to under-estimate how much time they will be spending on manipulating, reshaping, and processing their data. Python excels at data munging. I have recently used Python code to

  • Loop over hundreds of thousands of tweets and modify characters, convert date formats, etc.
  • Identify and delete duplicate entries in an SQL database.
  • Loop over 74 nonprofit organizations’ Twitter friend-follower lists to create a 74 x 74 friendship network.
  • Read in and write text and CSV data.
  • Countless grouping, merging, and aggregation functions.
  • Automatically count the number of “negative” words in thousands of online donation appeals.
  • Loop over hundreds of thousands of tweets to create an edge list for a retweet network.
  • Compute word counts for a word-document matrix from thousands of crowdfunding appeals.
  • Create text files combining all of an organizations’ tweets for use in creating word clouds.
  • Download images included in a set of tweets.
  • Merging text files.
  • Count number of Facebook statuses per organization.
  • Loop over hundreds of thousands of rows of tweets in an SQLite database and create additional variables for future analysis.
  • Dealing with missing data.
  • Creating dummy variables.
  • Find the oldest entry for each organization in a Twitter database.
  • Use pandas (Python Data Analysis Library) to aggregate Twitter data to the daily, weekly, and monthly level.
  • Create a text file of all hashtags in a Twitter database.

Data Visualization and Analysis

With the proliferation of scientific computing modules such as pandas and statsmodels and scikit-learn, Python’s data analysis capabilities have gotten much more powerful over the past few years. With such tools Python can now compete in many areas with devoted statistical programs such as or Stata, which I have traditionally used for most of my data analysis and visualization. Lately I’m doing more and more of this work directly in Python. Here are some of the analyses I have run recently using Python:

  • Implement a naive Bayesian classifier to classify the sentiment in hundreds of thousands of tweets.
  • Linguistic analysis of donation appeals and tweets using Python’s Natural Language Tool Kit.
  • Create plots of number of tweets, retweets, and public reply messages per day, week, and month.
  • Run descriptive statistics and multiple regressions.

Summary

Learning a programming language is a challenge. Of that there is little doubt. Yet the payoff in improved productivity alone can be substantial. Add to that the powerful analytical and data visualization capabilities that open up to the researcher who is skilled in a programming language. Lastly, leaving aside the buzzword “Big Data,” programming opens up a world of new data found on websites, social media platforms, and online data repositories. I would thus go so far as to say that any researcher interested in social media is doing themselves a great disservice by not learning some programming. For this very reason, one of my goals on this site is to provide guidance to those who are interested in getting up and running on Python for conducting academic and social media research.




How Organizations Use Social Media: Engaging the Public

The research I’ve done on organizations’ use of social media suggests there are three main types of messages that organizations send on social media: informational, community-building, and “action” (promotional & mobilizational) messages.

Each type constitutes a different way of engaging with the intended audience:

  • Informational messages serve to inform — about the organization’s activities or anything of interest to the organization’s audience. One-way communication from organization to public. The audience is in the role of learner.
  • Community-building messages serve to build a relationship with the audience through engaging in dialogue or making a network connection. Two-way communication. Audience is in the role of discussant or connector.
  • Promotional & mobilizational messages serve to ask the audience to do something for the organization — attend an event, make a donation, engage in a protest, volunteer, or serve as an advocate, etc. One-way mobilizational communication. Audience is in the role of actor.

[bibshow file=I-C-A.bib, format=apa template=av-bibtex-modified]

This framework originated in a small “Cybermetrics” graduate seminar I taught several years ago that involved inductive analyses of nonprofit organizations’ messages on Twitter (working with one PhD student, Kristen Lovejoy), and Facebook (working with another PhD student, I-hsuan Chiu). This collaborative work resulted in two publications that layed out the basic framework (Lovejoy & Saxton, 2012; Saxton, Guo, Chiu, & Feng, 2011).[bibcite key=Lovejoy2012][bibcite key=Saxton2011b]

Why was this framework innovative or important? Public relations theory had a “relational turn” in the late 1990s, where the focused shifted from an emphasis on strategic one-way communications to building relationships (Broom, Casey, & Ritchey, 1997; Hon & Grunig, 1999; Kent & Taylor, 1998, 2002; Ledingham, 2003; Ledingham & Bruning, 1998).[bibcite key=Broom1997][bibcite key=Hon1999][bibcite key=Kent1998][bibcite key=Kent2002][bibcite key=Ledingham2003][bibcite key=Ledingham1998] These studies were highly influential and helped re-shape the field of public relations to this date. Around the same time they were published, new media began to take off. The effect was that public relations and communication scholars began to focus on ways organizations were employing relationship-building and dialogic strategies in their new media efforts, contrasting these co-creational and dialogic efforts with one-way “informational” communication. In brief, by the time I started this research there was a substantial body of work on the informational and community-building efforts of organizations on new media.

Yet two key things were missing. One, scholars had yet to examine and code the key tool used by organizations on social media — the actual messages, the tweets and Facebook statuses they organizations were sending. Prior social media studies had looked at static profiles and the like. Two, in focusing on informational vs. dialogic communication, scholars had not recognized the considerable mobilizational element of organizations’ social media messages. Our study helped build on prior research and fill in both of these gaps. Our inductive study zeroed in on the messages and revealed the substantial use of tweets as a “call to action” for the organizations’ constituents, whether this was a call for volunteers, for donations, for social action, for retweeting a message, for attending an event or, indeed, for anything where the organization asked its constituents to “do something” for the organization. We labeled these tweets “promotional and mobilizational” messages or, for short, action messages.

I think this “I-C-A” (information-community-action) framework is a useful way of examining organizations’ messages, and have continued to use it in my research on nonprofit organizations, including studies of advocacy organizations (Guo & Saxton, 2014),[bibcite key=Guo2014] of the determinants of social media use (Nah & Saxton, 2013),[bibcite key=Nah2013] and of the effectiveness of organizational messages (Waters & Saxton, 2014).[bibcite key=Saxton2014b]

I am also honored that the framework is also finding itself useful by scholars working in other fields, including those working in the health field (Thackeray, Neiger, Burton, & Thackeray, 2013)[bibcite key=Thackeray2013] and political communication (Xu, Sang, Blasiola, & Park, 2014).[bibcite key=Xu2014]

If you’re a social media manager and are wondering about the practical significance of this research, it is important to understand the differences between these different messages, and to have an appropriate mix of each type. Informational, mobilizational, and community-building messages each have a different intended audience orientation that should be tailored to the needs of both the audience and the organization. Don’t rely only on the ‘megaphone’ (informational messages), and don’t ‘mobilize’ (action messages) too often. Most effective will be organizations that actively seek to build relationships with their target audience members. Ultimately, the appropriate mix will depend heavily on the organization’s social media strategy — and if you don’t have one, you should.

I’ve created an infographic that shows the differences:

References
[/bibshow]




Using Python to Grab Twitter User Data

1072645_98618032

I often get requests to explain how I obtained the data I used in a particular piece of academic research. I am always happy to share my code along with my data (and frankly, I think academics who are unwilling to share should be forced to take remedial Kindergarten). The problem is, many of those who would like to use the code don’t know where to start. There are too many new steps involved for the process to be accessible. So, I’ll try to walk you through the basic steps here through periodic tutorials.

To start, Python is a great tool for grabbing data from the Web. Generally speaking, you’ll get your data by either accessing an API (Application Programming Interface) or by ‘scraping’ the data off a webpage. The easiest scenario is when a site makes available an API. Twitter is such a site. Accordingly, as an introductory example I’ll walk you through the basic steps of using Python to access the Twitter API, read and manipulate the data returned, and save the output.

In any given project I will run a number of different scripts to grab all of the relevant data. We’ll start with a simple example. This script is designed to grab the  information on a set of Twitter users. First, as stated above, what we’re doing to get the data is tapping into the Twitter API. For our purposes, think of the Twitter API as a set of routines Twitter has set up for allowing us to access specific chunks of data. I use Python for this, given its many benefits, though any programming language will work. If you are really uninterested in programming and have more limited data needs, you can use NodeXL  (if you’re on a Windows machine) or other services for gathering the data. If you do go the Python route, I highly recommend you install Anaconda Python 2.7 — it’s free, it works on Mac and PC, and includes most of the add-on packages necessary for scientific computing. In short, you pick a programming language and learn some of it and then develop code that will extract and process the data for you. Even though you can start with my code as a base, it is still useful to understand the basics, so I highly recommend doing some of the many excellent tutorials now available online for learning how to use and run Python. A great place to start is Codeacademy.

Accessing the Twitter API

Almost all of my Twitter code grabs data from the Twitter API. The first step is to determine which part of the Twitter API you’ll need to access to get the type of data you want — there are different API methods for accessing information on tweets, retweets, users, following relationships, etc. The code we’re using here plugs into the users/lookup part of the Twitter API, which allows for the bulk downloading of Twitter user information. You can see a description of this part of the API here, along with definitions for the variables returned. Here is a list of the most useful of the variables returned by the API for each user (modified descriptions taken from the Twitter website):

FieldDescription
created_atThe UTC datetime that the user account was created on Twitter.
descriptionThe user-defined UTF-8 string describing their account.
entitiesEntities which have been parsed out of the url or description fields defined by the user.
favourites_countThe number of tweets this user has favorited in the account's lifetime. British spelling used in the field name for historical reasons.
followers_countThe number of followers this account currently has. We can also get a list of these followers by using different parts of the API.
friends_countThe number of users this account is following (AKA their "followings"). We can also get a list of these friends using other API methods.
idThe integer representation of the unique identifier for this User. This number is greater than 53 bits and some programming languages may have difficulty/silent defects in interpreting it. Using a signed 64 bit integer for storing this identifier is safe. Use id_str for fetching the identifier to stay on the safe side. See Twitter IDs, JSON and Snowflake.
id_strThe string representation of the unique identifier for this User. Implementations should use this rather than the large, possibly un-consumable integer in id.
langThe BCP 47 code for the user's self-declared user interface language.
listed_countThe number of public lists that this user is a member of.
locationThe user-defined location for this account's profile. Not necessarily a location nor parseable.
nameThe name of the user, as they've defined it. Not necessarily a person's name.
screen_nameThe screen name, handle, or alias that this user identifies themselves with. screen_names are unique but subject to change. Use id_str as a user identifier whenever possible. Typically a maximum of 15 characters long, but some historical accounts may exist with longer names.
statuses_countThe number of tweets (including retweets) issued by the user to date.
time_zoneA string describing the Time Zone this user declares themselves within.
urlA URL provided by the user in association with their profile.
withheld_in_countriesWhen present, indicates a textual representation of the two-letter country codes this user is withheld from. See New Withheld Content Fields in API Responses.
withheld_scopeWhen present, indicates whether the content being withheld is the "status" or a "user." See New Withheld Content Fields in API Responses.

 

Second, beginning in 2013 Twitter made it more difficult to access the API. Now OAuth authentication is needed for almost everything. This means you need to go on Twitter and create an ‘app.’ You won’t actually use the app for anything — you just need the password and authentication code. You can create your app here. For more detailed instructions on creating the app take a look at this presentation.

Third, as a Python ‘wrapper’ around the Twitter API I use Twython. This is a package that is an add-on to Python. You will need to install this as well as simplejson (for parsing the JSON data that is returned by the API). Assuming you installed Anaconda Python, the simplest way is to use pip. On a Mac or Linux machine, you would simply open the Terminal and type pip install Twython and pip install simplejson. 

The above steps can be a bit of a pain depending on your familiarity with UNIX, but you’ll only have to do them once. It may take you a while. But once they’re all set up you won’t need to do it again.

Understanding the Code

At the end of this post I’ll show the entire script. For now, I’ll go over it in sections. The first line in the code is the shebang — you’ll find this in all Python code.


 

Lines 3 – 10 contain the docstring — also a Python convention. This is a multi-line comment that describes the code. For single-line comments, use the # symbol at the start of the line.


 

Next we’ll import several Python packages needed to run the code.


 

In lines 18-22 we will create day, month, and year variables to be used for naming the output file.


 

Modify the Code

There are two areas you’ll need to modify. First, you’ll need to add your OAuth tokens to lines 26-30.


 

Second, you’ll need to modify lines 32-35 with the ids from your set of Twitter users. If you don’t have user_ids for these, you can use screen_names and change line 39 to ‘screen_name = ids’


 

Line 39 is where we actually access the API and grab the data. If you’ve read over the description of users/lookup API, you know that this method allows you to grab user information on up to 100 Twitter IDs with each API call.


 

Understanding JSON

Now, a key step to this is understanding the data that are returned by the API. As is increasingly common with Web data, this API call returns data in JSON format. Behind the scenes, Python has grabbed this JSON file, which has data on the 32 Twitter users listed above in the variable ids. Each user is an object in the JSON file; objects are delimited by left and right curly braces, as shown here for one of the 32 users:


 

JSON output can get messy, so it’s useful to bookmark a JSON viewer for formatting JSON output. What you’re seeing above is 38 different variables returned by the API — one for each row — and arranged in key: value (or variable: value) pairs. For instance, the value for the screen_name variable for this user is GPforEducation. Now, we do not always want to use all of these variables, so what we’ll do is pick and label those that are most useful for us.

So, we first initialize the output file, putting in the day/month/year in the file name, which is useful if you’re regularly downloading this user information:


 

We then create a variable with the names for the variables (columns) we’d like to include in our output file, open the output file, and write the header row:


 

Recall that in line 39 we grabbed the user information on the 32 users and assigned these data to the variable users. The final block of code in lines 55-90 loops over each of these IDs (each one a different object in the JSON file), creates the relevant variables, and writes a new row of output. Here’s the first few rows:


 

If you compare this code to the raw JSON output shown earlier, what we’re doing here is creating an empty Python dictionary, which we’ll call ‘r’, to hold our data for each user, creating variables called id and screen_name, and assigning the values held in the entry[‘id’] and entry[‘screen_name’] elements of the JSON output to those two respective variables. This is all placed inside a Python for loop — we could have called ‘entry’ anything so long as we’re consistent.

Now let’s put the whole thing together. To recap, what this entire script does is to loop over each of the Twitter accounts in the ids variable — and for each one it will grab its profile information and add that to a row of the output file (a text file that can be imported into Excel, etc.). The filename given to the output file varies according to the date. Now you can download this script, modify the lines noted above, and be on your way to downloading your own Twitter data!