1

Python Tutorials for Downloading Twitter Data

1072645_98618032

I often get requests to explain how I obtained the data I used in a particular piece of academic research. I am always happy to share my code along with my data. Having been through the learning process myself about 5 years ago, I understand the confusion and frustration that can go along with learning a programming language. There are a number of new concepts going on all at once, which can make it difficult to wrap your head around.

For this reason, I have created a number of tutorials. In the table below I list all of these tutorials to date. They are designed primarily for the user who is a complete neophyte to computer programming, though intermediate and some advanced users may find some of the actual code helpful. For those of you who are new to Python, I recommend going through these tutorials in order. Each introduces only one or two new concepts at a time, which will make the learning process easier.

I should also mention that the focus of these tutorials is for those who want to download and analyze Twitter data. For social media researchers this is the best place to start. After you are comfortable with that then it is easier to branch out to studying Facebook and other social media platforms.

TutorialDescription
Why I Use Python for Academic ResearchOverview of how I use Python for data gathering, data manipulation, and data visualization and analysis.
Python: Where to Start?Overview of how to think about using and learning Python. Recommendation on which version of Python to install.
Running your first codeRunning your first Python code using my preferred installation: Anaconda Python 2.7
Four ways to run your codeShows my preferred methods for running Python code.
Setting up Your Computer to Use My Python CodeCovers pre-requisites for successfully running my Twitter code -- additional modules to install, database set-up, etc.
Setting up Access to the Twitter APIGetting your Twitter API 'library card' (your API key password)
Using Python to Grab Twitter User DataThe first Twitter code you should run. Introduction to how to access the Twitter API and then read the JSON data that is returned. For users' Twitter account data (such as when the account was created, how many followers, etc.). Follow-up tutorials cover how to download tweets.
Tag Cloud TutorialHow to create a tag cloud from a set of downloaded tweets.
Downloading Hashtag TweetsHow to download tweets with a given hashtag and save into an SQLite database. Wayne Xu's tutorial.
Downloading Tweets by a List of UsersDownloading tweets sent by a number of different Twitter users. For those who are new to Python and/or downloading data from the Twitter API. Wayne Xu's tutorial.
Downloading Tweets-Take IIDownloading tweets sent by a number of different Twitter users: Take II. My own, more detailed explanation of SQLite, JSON, and the Twitter API, along with a full copy of the Python code.

 

I hope this helps. Happy coding!




Setting up Your Computer to Use My Python Code for Downloading Twitter Data

I frequently get requests for how to download social media data in general, as well as for help on how to run code I have written to download and analyze the data I analyzed for a particular piece of research. Often, these requests are from people who are excited about doing social media research but have yet to gain much experience in using computer programming. For this reason, I have created a set of tutorials designed precisely for such users.

I am always happy to share the code I’ve used in my research. That said, there are barriers to actually using someone else’s code. One of the key barriers is getting your own computer “set up” to actually run the code. The aim of this post is to walk you through the steps needed to run and modify the code I’ve written to download and analyze social media data.

Step One: Download and Install Python

As I write about here, for Unix, Windows, and Mac users alike I’d recommend you install Anaconda Python 2.7. This distribution of Python is free and easy to install. Moreover, it includes most of the add-on packages necessary for scientific computing, including Numpy, Pandas, iPython, Statsmodels, Sqlalchemy, and Matplotlib.

Go to this tutorial for instructions on how to install and run Anaconda Python.

Step Two: Install Python Add-On Packages

Anaconda Python comes pre-installed with almost everything you need. There are a couple of modules you will have to install manually:

Twython — for accessing the Twitter data

and 

simplejson — for parsing the JSON data that is returned by the Twitter API (Application Programming Interface).

Assuming you are on a Mac and using Anaconda Python, the simplest way is to use pip. On a Mac or Linux machine, you would simply open the Terminal and type pip install Twython and pip install simplejson. If you’re on a PC, please take a look at Wayne Xu’s tutorial (see Slide #8).

Step Three: The Database

I generally download my Twitter data into an SQLite database. SQLite is a common relational database. It is lightweight and easy to use, and comes preinstalled in Anaconda Python.

You may already know other ways of downloading social media data, such as NodeXL in Excel for Windows. Why then would you want to using SQLite? There are two reasons. First, SQLite is better plugged into the Python architecture, which will come in handy when you are ready to actually manipulate and analyze the data. Second, if you are downloading tweets more than once, a database is the much better solution for a simple reason: it allows you to write a check for duplicates in the database and stop them from being inserted. This is an almost essential feature that you cannot easily implement outside of a database.

Also know that once you have downloaded the data into an SQLite database, you can view and edit the data in the same manner as an Excel file, and even export the data into CSV format for viewing in Excel. To do this, simply download and install Database Browser for SQLite. If you use Firefox, you can alternatively use a plug-in called SQLite Manager.

Step Four: Accessing the Twitter API

Almost all of my Twitter code grabs data from the Twitter API, which sets up procedures for reliably accessing the Twitter data. Beginning in 2013 Twitter made it more difficult to access its APIs. Now OAuth authentication is needed for almost everything. This means you need to go on Twitter and create an ‘app.’ You won’t actually use the app for anything — you just need the password and authentication code. You can create your app here. For more detailed instructions on creating the app take a look at slides 4 through 6 of Wayne Xu’s (my excellent former PhD student) tutorial tutorial.

Step Five: Start Using the Code

Once you’ve completed the above four steps you will have all the necessary building blocks for successfully running my Python scripts for downloading and parsing Twitter data. Happy coding!




Tag Cloud Tutorial

1072645_98618032

In this post I’ll provide a brief tutorial on how to create a tag cloud, as seen here.

First, this assumes you have downloaded a set of tweets into an SQLite database. If you are using a different database please modify accordingly. Also, to get to this stage, work through the first 8 tutorials listed here (you can skip over the seventh tutorial on downloading tweets by a list of users).

Here is the full code for processing the Twitter data you’ve downloaded so that you can generate a tag cloud:

Now I’ll try to walk you through the basic steps here. For those of you who are completely new to Python, you should work through some of my other tutorials.

Understanding the Code

The first line in the code above is the shebang — you’ll find this in all Python code.


 

Lines 3 – 6 contain the docstring — also a Python convention. This is a multi-line comment that describes the code. For single-line comments, use the # symbol at the start of the line.


 

Next we’ll import several Python packages needed to run the code.


 

In lines 14-16 we create a connection with the SQLite database, make a query to select all of the tweets in the database, and assign the returned tweets to the variable tweets.


 

Line 21 creates an empty dictionary in which we will place all of the hashtags from each tweet.


 

In lines 23-36 we loop over each tweet. First we identify the two specific columns in the database we’re interested in (the tweet id and the hashtags column), then add the tags to the all_text variable created earlier.


 

Finally, in lines 43-46 we translate the all_text variable from a dictionary to a string, then output it to a text file.


 

Once you’ve got this text file, open it, copy all of the text, and use it to create your own word cloud on Wordle.

I hope this helps. If you need help with actually getting the tweets into your database, take a look at some of the other tutorials I’ve posted. If you have any questions, let me know, and have fun with the data!




#ARNOVA14 – Tag cloud of ARNOVA 2014 Tweets

Word clouds are generally not all that helpful given how the words are taken out of their context (sentences). In certain settings, however, they do provide meaningful information. Hashtags are one of those contexts — they are meant to be single words. The tags denote ideas or topics or places. By examining the hashtags, we can thus gain an appreciation for the most salient topics ARNOVANs are tweeting about. With that, here is a word cloud generated using all of the hashtags included in all 739 tweets sent with the #arnova14 hashtag as of Sunday, November 23rd @ 1pm EST.

Here is a first cloud, with the #arnova14 tag excluded.

Screen Shot 2014-11-23 at 1.10.14 PM

The larger the tag, the more frequently it was used. You can see that two tags predominated, #nonprofit and #allianceR2P. To help see other topics, here’s a final tag cloud without these two tags.

Screen Shot 2014-11-23 at 1.10.40 PM

It was also refreshing to see linguistic diversity in the hashtags — especially a large number of Arabic tags. Unfortunately, the word cloud visualizer I used (Wordle) could not recognize these (if anyone knows of a good workaround please let me know).

If anyone is interested in the Python code used to generate this just shoot me an email.

I’ll leave the analysis up to you. Some interesting things here! I hope you all had a great conference and see you next year!




Downloading Tweets by a List of Users

This post is a brief, temporary attempt at pointing people in the right direction for a common task: downloading tweets sent by a number of different Twitter users. It is directed at those who are new to Python and/or downloading data from the Twitter API.

OK, so I am assuming you have made it through (or intend to) the following tutorials, which should help you get set up and working with Python:

Another detailed tutorial I have created, Usint Python to Grab Twitter User Data, is intended to serve as an introduction to how to access the Twitter API and then read the JSON data that is returned. This gets you the account-level data for a Twitter user (such as when the account was created), but not the actual tweets. That is a more difficult process, but not a huge leap once you’ve made it through all of the above steps.

I have yet to upload a tutorial showing how to use my code to download tweets by a list of Twitter users, but fortunately my PhD student Wayne Xu has. This tutorial helps fill in the blanks. It will help walk you through the steps of getting a Twitter developer account (so you can access the API), of which database to use (SQLite), and then how to access and store data from the Twitter user_timeline API, which allows you to download up to the last 3,200 tweets sent by each Twitter user.

I hope this helps. Happy coding!




How Many Tags is Too Much?

Including a hashtag in a social media message can increase its reach. The question is, what is the ideal number of tags to include?

To answer this question, I examine 60,919 original tweets sent in 2014 by 99 for-profit and nonprofit member organizations of a large US health advocacy coalition.

First, the following table shows the distribution of the number of hashtags included in the organizations’ tweets. As shown in the table, almost a third (n = 19,747) of tweets do not have a hashtag, almost 39% (n = 23,493) have one hashtag, 19% include two hashtags (n = 11,836), 7% include three (n = 4,381), and 2% (n = 1,161) include 4. Few tweets contain more than 4 tags, though one tweet included a total of 10 different hashtags.

Frequency of Hashtags in 60,919 Original Tweets

# of HashtagsFrequency
019,747
123,493
211,836
34,381
41,161
5227
649
713
84
97
101
Total60,919

Now let’s look at the effectiveness of messages with different numbers of hashtags. A good proxy for message effectiveness is retweetability, or how frequently audience members share the message with their followers. The following graph shows the average number of retweets received by tweets with different numbers of hashtags included.

Untitled

What we see is that more hashtags are generally better, but there are diminishing returns. Excluding the 25 tweets with more than 6 hashtags, the effectiveness of hashtag use peaks at 2 hashtags, with more than 3 hashtags being only as effective or less effective than no hashtags.

The evidence isn’t conclusive — especially given the anomalous findings for the few tweets with 7-10 tags — but there is strong support here that, if you want your message to reach the biggest possible audience, limit your tweets to 1-2 hashtags.




#ICA14 – Tag cloud of ICA 2014 Tweets

Word clouds are generally not all that helpful given how the words are taken out of their context (sentences). In certain settings, however, they do provide meaningful information. Hashtags are one of those contexts — they are meant to be single words. With that, here is a word cloud generated using all of the hashtags included in all 9,010 tweets sent with the #ica14 hashtag as of Monday, May 26th @ 1pm EST.

Here is a first cloud, with the #ica14 tag excluded.

Screen Shot 2014-05-26 at 2.18.19 PM

The larger the tag, the more frequently it was used. You can see that ICA section-related tags predominated, with #ica_cat and #ica_glbt leading the pack. (Note that, after downloading and processing the Twitter data in Python, I used Wordle to generate the word cloud. A quirk of Wordle is that it will split words with an underscore in them, so I’ve replaced underscores with hyphens. So, read “ica-cat” as “#ica_cat,” etc.)

To help highlight non-section tags, here is a version omitting any tag with “ica” in it.

Screen Shot 2014-05-26 at 1.48.27 PM

The #qualpolcomm taggers were highly active. To help see other topics, here’s a final tag cloud without #qualpolcomm.

Screen Shot 2014-05-26 at 1.49.01 PM

I’ll leave the analysis up to you. Some interesting patterns here!

If anyone is interested in the Python used to generate this let me know.




Replication Data

One of the core tenets of scientific research is replication. This gets at the reproducibility standard of scientific research. Despite calls for more replication in the social sciences, replication studies are still rather rare. In part, this is the product of journal editors’ and reviewers’ strong preferences for original research. It is also due to scholars not making their data publicly available. Many of my colleagues in academia, especially those who conduct experimental (lab experiments) research, do not typically make their data publicly available, though even here anonymized data should be available.

Replication datasets are not valuable solely for replication studies. In any dataset there are unused variables. A budding scholar or a research-oriented practitioner might be interested in your “leftover” variables or data points. You can’t foresee what others will find interesting.

What You Can Do

If you have data, share it. Not only is this being generous, but there is some evidence it may even be good for your career (citations, etc.). If you don’t have the capacity to warehouse it yourself, there are archives available for you. A good choice is Gary King’s Dataverse Network Project.

My Data

In the spirit of replication and extension, I would like to let people know which data sources I have available. If you’d like any of it, shoot me a message and we’ll figure out a way to get it to you.

Spanish Nationalist Event Data, 1977-1996

First, there is a replication archive of data I used in my dissertation. If you’re interested in Spanish nationalist contentious politics — specifically, data on violent and non-violent nationalist protests — check out http://contentiouspolitics.social-metrics.org/. This site was set up to make publicly available the data used in my dissertation (2000) and subsequent publications. There you will find background information on the project, codebooks, data, and copies of articles published using the data. You can browse and search the data and view various interactive graphs. The entire dataset is also available for downloading.

Twitter Data

Twitter data are generally publicly available. However, if you have a pre-defined set of users you can only grab their latest 3,200 tweets, which in some cases is only one year’s worth of data. And in other cases, especially if you want to follow a specific hashtag or collect user mentions or retweets, you can only go back one week in time. For this reason, sometimes it can very helpful if someone else has the historical data you may need. Here are some of the historical data I have, showing the sample of organizations, date range for which data are available, and citations for articles that used the data. If you are interested in it for your own research purposes let me know.

[bibshow file=saxton.bib, format=apa template=av-bibtex-modified]

  • Nonprofit Times 100 organizations — 2009
  • 145 advocacy nonprofit organizations — April 2012
  • 38 US community foundations (tweets as well as mentions) — July-August 2011

Facebook Data

Facebook data for organizations is typically public and can be downloaded via the Facebook Graph API. That said, I have some data available on a sample of large nonprofit organizations.

  • Nonprofit Times 100 organizations — December 2009
  • Nonprofit Times 100 organizations — April-May 2013

Website Data

Website data. Historical data can often be gathered from the Internet Archive Wayback Machine, but “robots exclusions” and other errors can prevent this. The following datasets are available:

  • 117 US community foundations (transparency and accountability data) – fall 2005 (Saxton, Guo, & Brown, 2007; Saxton & Guo, 2011)[bibcite key=Saxton2007][bibcite key=Saxton2011]
  • 400 random US nonprofit organizations – fall 2007 (Saxton, Guo, & Neely, 2014)[bibcite key=Saxton2014]

This is only a partial list of the data I have available. I’ll add to this as more data become cleaned and available.

References
[/bibshow]




Does Twitter Matter?

5176609404_7cde3c5ee6_b

Twitter is not the Gutenberg Press. The ‘Big Data’ revolution is over-hyped. Nevertheless, Twitter is significant in a number of ways:

  • For identifying trends.
  • For rapid, near real-time dissemination of news.
  • It has been used to track the progression of the flu and other infectious diseases.
  • It has played a mobilizational role in the Arab Spring and other social movement activities.
  • For on-the-ground reporting of news and events.
  • It allows you to decentralize research; what Nigel Cameron calls mutual curation and others call social curation.
  • For looking into the global cocktail party that is Twitter.
  • Allows one to take the pulse of the community on almost any given topic.
  • Twitter is the Big Data source.
  • It constitutes a coordination and communication tool for post-disaster mobilizations.
  • Twitter facilitates the rapid diffusion of ideas, rumors, opinion, sentiment, and news.
  • For professionals and organizations alike, it facilitates networkingrelationship-building, and exposure.
  • Twitter is proving to be a powerful dialogic tool — for initiating and engaging in conversations.
  • Unlike other social media (e.g., Facebook), Twitter has a largely open model, allowing anyone to follow anyone else.
  • Social chatter has become a powerful tool for
    • Hedge fund managers listen in on social media conversations in making their decisions.
    • Tracking and identifying terrorists and extremists.
  • Facilitates the leveraging of what Granovetter (1973) calls weak ties.
  • Can be a force for good (e.g., Twestivals).

In short, Twitter is not just for sharing pictures of your lunch. In addition to all the silliness, Twitter has come to be the world’s premier message network. It These messages are used in a wide variety of settings and for a broad range of purposes. And researchers are able to listen in — a boon to anyone interested in messages, conversations, networks, information, mobilization, diffusion, or any number of social science phenomena.




Using Python to Grab Twitter User Data

1072645_98618032

I often get requests to explain how I obtained the data I used in a particular piece of academic research. I am always happy to share my code along with my data (and frankly, I think academics who are unwilling to share should be forced to take remedial Kindergarten). The problem is, many of those who would like to use the code don’t know where to start. There are too many new steps involved for the process to be accessible. So, I’ll try to walk you through the basic steps here through periodic tutorials.

To start, Python is a great tool for grabbing data from the Web. Generally speaking, you’ll get your data by either accessing an API (Application Programming Interface) or by ‘scraping’ the data off a webpage. The easiest scenario is when a site makes available an API. Twitter is such a site. Accordingly, as an introductory example I’ll walk you through the basic steps of using Python to access the Twitter API, read and manipulate the data returned, and save the output.

In any given project I will run a number of different scripts to grab all of the relevant data. We’ll start with a simple example. This script is designed to grab the  information on a set of Twitter users. First, as stated above, what we’re doing to get the data is tapping into the Twitter API. For our purposes, think of the Twitter API as a set of routines Twitter has set up for allowing us to access specific chunks of data. I use Python for this, given its many benefits, though any programming language will work. If you are really uninterested in programming and have more limited data needs, you can use NodeXL  (if you’re on a Windows machine) or other services for gathering the data. If you do go the Python route, I highly recommend you install Anaconda Python 2.7 — it’s free, it works on Mac and PC, and includes most of the add-on packages necessary for scientific computing. In short, you pick a programming language and learn some of it and then develop code that will extract and process the data for you. Even though you can start with my code as a base, it is still useful to understand the basics, so I highly recommend doing some of the many excellent tutorials now available online for learning how to use and run Python. A great place to start is Codeacademy.

Accessing the Twitter API

Almost all of my Twitter code grabs data from the Twitter API. The first step is to determine which part of the Twitter API you’ll need to access to get the type of data you want — there are different API methods for accessing information on tweets, retweets, users, following relationships, etc. The code we’re using here plugs into the users/lookup part of the Twitter API, which allows for the bulk downloading of Twitter user information. You can see a description of this part of the API here, along with definitions for the variables returned. Here is a list of the most useful of the variables returned by the API for each user (modified descriptions taken from the Twitter website):

FieldDescription
created_atThe UTC datetime that the user account was created on Twitter.
descriptionThe user-defined UTF-8 string describing their account.
entitiesEntities which have been parsed out of the url or description fields defined by the user.
favourites_countThe number of tweets this user has favorited in the account's lifetime. British spelling used in the field name for historical reasons.
followers_countThe number of followers this account currently has. We can also get a list of these followers by using different parts of the API.
friends_countThe number of users this account is following (AKA their "followings"). We can also get a list of these friends using other API methods.
idThe integer representation of the unique identifier for this User. This number is greater than 53 bits and some programming languages may have difficulty/silent defects in interpreting it. Using a signed 64 bit integer for storing this identifier is safe. Use id_str for fetching the identifier to stay on the safe side. See Twitter IDs, JSON and Snowflake.
id_strThe string representation of the unique identifier for this User. Implementations should use this rather than the large, possibly un-consumable integer in id.
langThe BCP 47 code for the user's self-declared user interface language.
listed_countThe number of public lists that this user is a member of.
locationThe user-defined location for this account's profile. Not necessarily a location nor parseable.
nameThe name of the user, as they've defined it. Not necessarily a person's name.
screen_nameThe screen name, handle, or alias that this user identifies themselves with. screen_names are unique but subject to change. Use id_str as a user identifier whenever possible. Typically a maximum of 15 characters long, but some historical accounts may exist with longer names.
statuses_countThe number of tweets (including retweets) issued by the user to date.
time_zoneA string describing the Time Zone this user declares themselves within.
urlA URL provided by the user in association with their profile.
withheld_in_countriesWhen present, indicates a textual representation of the two-letter country codes this user is withheld from. See New Withheld Content Fields in API Responses.
withheld_scopeWhen present, indicates whether the content being withheld is the "status" or a "user." See New Withheld Content Fields in API Responses.

 

Second, beginning in 2013 Twitter made it more difficult to access the API. Now OAuth authentication is needed for almost everything. This means you need to go on Twitter and create an ‘app.’ You won’t actually use the app for anything — you just need the password and authentication code. You can create your app here. For more detailed instructions on creating the app take a look at this presentation.

Third, as a Python ‘wrapper’ around the Twitter API I use Twython. This is a package that is an add-on to Python. You will need to install this as well as simplejson (for parsing the JSON data that is returned by the API). Assuming you installed Anaconda Python, the simplest way is to use pip. On a Mac or Linux machine, you would simply open the Terminal and type pip install Twython and pip install simplejson. 

The above steps can be a bit of a pain depending on your familiarity with UNIX, but you’ll only have to do them once. It may take you a while. But once they’re all set up you won’t need to do it again.

Understanding the Code

At the end of this post I’ll show the entire script. For now, I’ll go over it in sections. The first line in the code is the shebang — you’ll find this in all Python code.


 

Lines 3 – 10 contain the docstring — also a Python convention. This is a multi-line comment that describes the code. For single-line comments, use the # symbol at the start of the line.


 

Next we’ll import several Python packages needed to run the code.


 

In lines 18-22 we will create day, month, and year variables to be used for naming the output file.


 

Modify the Code

There are two areas you’ll need to modify. First, you’ll need to add your OAuth tokens to lines 26-30.


 

Second, you’ll need to modify lines 32-35 with the ids from your set of Twitter users. If you don’t have user_ids for these, you can use screen_names and change line 39 to ‘screen_name = ids’


 

Line 39 is where we actually access the API and grab the data. If you’ve read over the description of users/lookup API, you know that this method allows you to grab user information on up to 100 Twitter IDs with each API call.


 

Understanding JSON

Now, a key step to this is understanding the data that are returned by the API. As is increasingly common with Web data, this API call returns data in JSON format. Behind the scenes, Python has grabbed this JSON file, which has data on the 32 Twitter users listed above in the variable ids. Each user is an object in the JSON file; objects are delimited by left and right curly braces, as shown here for one of the 32 users:


 

JSON output can get messy, so it’s useful to bookmark a JSON viewer for formatting JSON output. What you’re seeing above is 38 different variables returned by the API — one for each row — and arranged in key: value (or variable: value) pairs. For instance, the value for the screen_name variable for this user is GPforEducation. Now, we do not always want to use all of these variables, so what we’ll do is pick and label those that are most useful for us.

So, we first initialize the output file, putting in the day/month/year in the file name, which is useful if you’re regularly downloading this user information:


 

We then create a variable with the names for the variables (columns) we’d like to include in our output file, open the output file, and write the header row:


 

Recall that in line 39 we grabbed the user information on the 32 users and assigned these data to the variable users. The final block of code in lines 55-90 loops over each of these IDs (each one a different object in the JSON file), creates the relevant variables, and writes a new row of output. Here’s the first few rows:


 

If you compare this code to the raw JSON output shown earlier, what we’re doing here is creating an empty Python dictionary, which we’ll call ‘r’, to hold our data for each user, creating variables called id and screen_name, and assigning the values held in the entry[‘id’] and entry[‘screen_name’] elements of the JSON output to those two respective variables. This is all placed inside a Python for loop — we could have called ‘entry’ anything so long as we’re consistent.

Now let’s put the whole thing together. To recap, what this entire script does is to loop over each of the Twitter accounts in the ids variable — and for each one it will grab its profile information and add that to a row of the output file (a text file that can be imported into Excel, etc.). The filename given to the output file varies according to the date. Now you can download this script, modify the lines noted above, and be on your way to downloading your own Twitter data!