I often get requests to explain how I obtained the data I used in a particular piece of academic research. I am always happy to share my code along with my data (and frankly, I think academics who are unwilling to share should be forced to take remedial Kindergarten). The problem is, many of those who would like to use the code don’t know where to start. There are too many new steps involved for the process to be accessible. So, I’ll try to walk you through the basic steps here through periodic tutorials.
To start, Python is a great tool for grabbing data from the Web. Generally speaking, you’ll get your data by either accessing an API (Application Programming Interface) or by ‘scraping’ the data off a webpage. The easiest scenario is when a site makes available an API. Twitter is such a site. Accordingly, as an introductory example I’ll walk you through the basic steps of using Python to access the Twitter API, read and manipulate the data returned, and save the output.
In any given project I will run a number of different scripts to grab all of the relevant data. We’ll start with a simple example. This script is designed to grab the information on a set of Twitter users. First, as stated above, what we’re doing to get the data is tapping into the Twitter API. For our purposes, think of the Twitter API as a set of routines Twitter has set up for allowing us to access specific chunks of data. I use Python for this, given its many benefits, though any programming language will work. If you are really uninterested in programming and have more limited data needs, you can use NodeXL (if you’re on a Windows machine) or other services for gathering the data. If you do go the Python route, I highly recommend you install Anaconda Python 2.7 — it’s free, it works on Mac and PC, and includes most of the add-on packages necessary for scientific computing. In short, you pick a programming language and learn some of it and then develop code that will extract and process the data for you. Even though you can start with my code as a base, it is still useful to understand the basics, so I highly recommend doing some of the many excellent tutorials now available online for learning how to use and run Python. A great place to start is Codeacademy.
Accessing the Twitter API
Almost all of my Twitter code grabs data from the Twitter API. The first step is to determine which part of the Twitter API you’ll need to access to get the type of data you want — there are different API methods for accessing information on tweets, retweets, users, following relationships, etc. The code we’re using here plugs into the users/lookup part of the Twitter API, which allows for the bulk downloading of Twitter user information. You can see a description of this part of the API here, along with definitions for the variables returned. Here is a list of the most useful of the variables returned by the API for each user (modified descriptions taken from the Twitter website):
|created_at||The UTC datetime that the user account was created on Twitter.
|description||The user-defined UTF-8 string describing their account.
|entities||Entities which have been parsed out of the url or description fields defined by the user.
|favourites_count||The number of tweets this user has favorited in the account's lifetime. British spelling used in the field name for historical reasons.
|followers_count||The number of followers this account currently has. We can also get a list of these followers by using different parts of the API.
|friends_count||The number of users this account is following (AKA their "followings"). We can also get a list of these friends using other API methods.
|id||The integer representation of the unique identifier for this User. This number is greater than 53 bits and some programming languages may have difficulty/silent defects in interpreting it. Using a signed 64 bit integer for storing this identifier is safe. Use id_str for fetching the identifier to stay on the safe side. See Twitter IDs, JSON and Snowflake.
|id_str||The string representation of the unique identifier for this User. Implementations should use this rather than the large, possibly un-consumable integer in id.
|lang||The BCP 47 code for the user's self-declared user interface language.
|listed_count||The number of public lists that this user is a member of.
|location||The user-defined location for this account's profile. Not necessarily a location nor parseable.
|name||The name of the user, as they've defined it. Not necessarily a person's name.
|screen_name||The screen name, handle, or alias that this user identifies themselves with. screen_names are unique but subject to change. Use id_str as a user identifier whenever possible. Typically a maximum of 15 characters long, but some historical accounts may exist with longer names.
|statuses_count||The number of tweets (including retweets) issued by the user to date.
|time_zone||A string describing the Time Zone this user declares themselves within.
|url||A URL provided by the user in association with their profile.
|withheld_in_countries||When present, indicates a textual representation of the two-letter country codes this user is withheld from. See New Withheld Content Fields in API Responses.
|withheld_scope||When present, indicates whether the content being withheld is the "status" or a "user." See New Withheld Content Fields in API Responses.
Second, beginning in 2013 Twitter made it more difficult to access the API. Now OAuth authentication is needed for almost everything. This means you need to go on Twitter and create an ‘app.’ You won’t actually use the app for anything — you just need the password and authentication code. You can create your app here. For more detailed instructions on creating the app take a look at this presentation.
Third, as a Python ‘wrapper’ around the Twitter API I use Twython. This is a package that is an add-on to Python. You will need to install this as well as simplejson (for parsing the JSON data that is returned by the API). Assuming you installed Anaconda Python, the simplest way is to use pip. On a Mac or Linux machine, you would simply open the Terminal and type pip install Twython and pip install simplejson.
The above steps can be a bit of a pain depending on your familiarity with UNIX, but you’ll only have to do them once. It may take you a while. But once they’re all set up you won’t need to do it again.
Understanding the Code
At the end of this post I’ll show the entire script. For now, I’ll go over it in sections. The first line in the code is the shebang — you’ll find this in all Python code.
Lines 3 – 10 contain the docstring — also a Python convention. This is a multi-line comment that describes the code. For single-line comments, use the # symbol at the start of the line.
Next we’ll import several Python packages needed to run the code.
In lines 18-22 we will create day, month, and year variables to be used for naming the output file.
Modify the Code
There are two areas you’ll need to modify. First, you’ll need to add your OAuth tokens to lines 26-30.
Second, you’ll need to modify lines 32-35 with the ids from your set of Twitter users. If you don’t have user_ids for these, you can use screen_names and change line 39 to ‘screen_name = ids’
Line 39 is where we actually access the API and grab the data. If you’ve read over the description of users/lookup API, you know that this method allows you to grab user information on up to 100 Twitter IDs with each API call.
Now, a key step to this is understanding the data that are returned by the API. As is increasingly common with Web data, this API call returns data in JSON format. Behind the scenes, Python has grabbed this JSON file, which has data on the 32 Twitter users listed above in the variable ids. Each user is an object in the JSON file; objects are delimited by left and right curly braces, as shown here for one of the 32 users:
JSON output can get messy, so it’s useful to bookmark a JSON viewer for formatting JSON output. What you’re seeing above is 38 different variables returned by the API — one for each row — and arranged in key: value (or variable: value) pairs. For instance, the value for the screen_name variable for this user is GPforEducation. Now, we do not always want to use all of these variables, so what we’ll do is pick and label those that are most useful for us.
So, we first initialize the output file, putting in the day/month/year in the file name, which is useful if you’re regularly downloading this user information:
We then create a variable with the names for the variables (columns) we’d like to include in our output file, open the output file, and write the header row:
Recall that in line 39 we grabbed the user information on the 32 users and assigned these data to the variable users. The final block of code in lines 55-90 loops over each of these IDs (each one a different object in the JSON file), creates the relevant variables, and writes a new row of output. Here’s the first few rows:
If you compare this code to the raw JSON output shown earlier, what we’re doing here is creating an empty Python dictionary, which we’ll call ‘r’, to hold our data for each user, creating variables called id and screen_name, and assigning the values held in the entry[‘id’] and entry[‘screen_name’] elements of the JSON output to those two respective variables. This is all placed inside a Python for loop — we could have called ‘entry’ anything so long as we’re consistent.
Now let’s put the whole thing together. To recap, what this entire script does is to loop over each of the Twitter accounts in the ids variable — and for each one it will grab its profile information and add that to a row of the output file (a text file that can be imported into Excel, etc.). The filename given to the output file varies according to the date. Now you can download this script, modify the lines noted above, and be on your way to downloading your own Twitter data!