Python Tutorials for Downloading Twitter Data

1072645_98618032

I often get requests to explain how I obtained the data I used in a particular piece of academic research. I am always happy to share my code along with my data. Having been through the learning process myself about 5 years ago, I understand the confusion and frustration that can go along with learning a programming language. There are a number of new concepts going on all at once, which can make it difficult to wrap your head around.

For this reason, I have created a number of tutorials. In the table below I list all of these tutorials to date. They are designed primarily for the user who is a complete neophyte to computer programming, though intermediate and some advanced users may find some of the actual code helpful. For those of you who are new to Python, I recommend going through these tutorials in order. Each introduces only one or two new concepts at a time, which will make the learning process easier.

I should also mention that the focus of these tutorials is for those who want to download and analyze Twitter data. For social media researchers this is the best place to start. After you are comfortable with that then it is easier to branch out to studying Facebook and other social media platforms.

TutorialDescription
Why I Use Python for Academic ResearchOverview of how I use Python for data gathering, data manipulation, and data visualization and analysis.
Python: Where to Start?Overview of how to think about using and learning Python. Recommendation on which version of Python to install.
Running your first codeRunning your first Python code using my preferred installation: Anaconda Python 2.7
Four ways to run your codeShows my preferred methods for running Python code.
Setting up Your Computer to Use My Python CodeCovers pre-requisites for successfully running my Twitter code -- additional modules to install, database set-up, etc.
Setting up Access to the Twitter APIGetting your Twitter API 'library card' (your API key password)
Using Python to Grab Twitter User DataThe first Twitter code you should run. Introduction to how to access the Twitter API and then read the JSON data that is returned. For users' Twitter account data (such as when the account was created, how many followers, etc.). Follow-up tutorials cover how to download tweets.
Tag Cloud TutorialHow to create a tag cloud from a set of downloaded tweets.
Downloading Hashtag TweetsHow to download tweets with a given hashtag and save into an SQLite database. Wayne Xu's tutorial.
Downloading Tweets by a List of UsersDownloading tweets sent by a number of different Twitter users. For those who are new to Python and/or downloading data from the Twitter API. Wayne Xu's tutorial.
Downloading Tweets-Take IIDownloading tweets sent by a number of different Twitter users: Take II. My own, more detailed explanation of SQLite, JSON, and the Twitter API, along with a full copy of the Python code.

 

I hope this helps. Happy coding!




How to Download Tweets with a Specific Hashtag

While I work on my own version of the tutorial, my excellent co-author and former PhD student Weiai Xu has put together an informative tutorial for how to download tweets with a specific hashtag.

Five steps to search and store tweets by keywords from Weiai Wayne Xu

Wayne’s tutorial also contains his version of my Twitter code. So, if you follow his excellent instructions you will see it is compatible with the other tutorials I’ve written.

For more, check out his website.




Setting up Your Computer to Use My Python Code for Downloading Twitter Data

I frequently get requests for how to download social media data in general, as well as for help on how to run code I have written to download and analyze the data I analyzed for a particular piece of research. Often, these requests are from people who are excited about doing social media research but have yet to gain much experience in using computer programming. For this reason, I have created a set of tutorials designed precisely for such users.

I am always happy to share the code I’ve used in my research. That said, there are barriers to actually using someone else’s code. One of the key barriers is getting your own computer “set up” to actually run the code. The aim of this post is to walk you through the steps needed to run and modify the code I’ve written to download and analyze social media data.

Step One: Download and Install Python

As I write about here, for Unix, Windows, and Mac users alike I’d recommend you install Anaconda Python 2.7. This distribution of Python is free and easy to install. Moreover, it includes most of the add-on packages necessary for scientific computing, including Numpy, Pandas, iPython, Statsmodels, Sqlalchemy, and Matplotlib.

Go to this tutorial for instructions on how to install and run Anaconda Python.

Step Two: Install Python Add-On Packages

Anaconda Python comes pre-installed with almost everything you need. There are a couple of modules you will have to install manually:

Twython — for accessing the Twitter data

and 

simplejson — for parsing the JSON data that is returned by the Twitter API (Application Programming Interface).

Assuming you are on a Mac and using Anaconda Python, the simplest way is to use pip. On a Mac or Linux machine, you would simply open the Terminal and type pip install Twython and pip install simplejson. If you’re on a PC, please take a look at Wayne Xu’s tutorial (see Slide #8).

Step Three: The Database

I generally download my Twitter data into an SQLite database. SQLite is a common relational database. It is lightweight and easy to use, and comes preinstalled in Anaconda Python.

You may already know other ways of downloading social media data, such as NodeXL in Excel for Windows. Why then would you want to using SQLite? There are two reasons. First, SQLite is better plugged into the Python architecture, which will come in handy when you are ready to actually manipulate and analyze the data. Second, if you are downloading tweets more than once, a database is the much better solution for a simple reason: it allows you to write a check for duplicates in the database and stop them from being inserted. This is an almost essential feature that you cannot easily implement outside of a database.

Also know that once you have downloaded the data into an SQLite database, you can view and edit the data in the same manner as an Excel file, and even export the data into CSV format for viewing in Excel. To do this, simply download and install Database Browser for SQLite. If you use Firefox, you can alternatively use a plug-in called SQLite Manager.

Step Four: Accessing the Twitter API

Almost all of my Twitter code grabs data from the Twitter API, which sets up procedures for reliably accessing the Twitter data. Beginning in 2013 Twitter made it more difficult to access its APIs. Now OAuth authentication is needed for almost everything. This means you need to go on Twitter and create an ‘app.’ You won’t actually use the app for anything — you just need the password and authentication code. You can create your app here. For more detailed instructions on creating the app take a look at slides 4 through 6 of Wayne Xu’s (my excellent former PhD student) tutorial tutorial.

Step Five: Start Using the Code

Once you’ve completed the above four steps you will have all the necessary building blocks for successfully running my Python scripts for downloading and parsing Twitter data. Happy coding!




Tag Cloud Tutorial

1072645_98618032

In this post I’ll provide a brief tutorial on how to create a tag cloud, as seen here.

First, this assumes you have downloaded a set of tweets into an SQLite database. If you are using a different database please modify accordingly. Also, to get to this stage, work through the first 8 tutorials listed here (you can skip over the seventh tutorial on downloading tweets by a list of users).

Here is the full code for processing the Twitter data you’ve downloaded so that you can generate a tag cloud:

Now I’ll try to walk you through the basic steps here. For those of you who are completely new to Python, you should work through some of my other tutorials.

Understanding the Code

The first line in the code above is the shebang — you’ll find this in all Python code.


 

Lines 3 – 6 contain the docstring — also a Python convention. This is a multi-line comment that describes the code. For single-line comments, use the # symbol at the start of the line.


 

Next we’ll import several Python packages needed to run the code.


 

In lines 14-16 we create a connection with the SQLite database, make a query to select all of the tweets in the database, and assign the returned tweets to the variable tweets.


 

Line 21 creates an empty dictionary in which we will place all of the hashtags from each tweet.


 

In lines 23-36 we loop over each tweet. First we identify the two specific columns in the database we’re interested in (the tweet id and the hashtags column), then add the tags to the all_text variable created earlier.


 

Finally, in lines 43-46 we translate the all_text variable from a dictionary to a string, then output it to a text file.


 

Once you’ve got this text file, open it, copy all of the text, and use it to create your own word cloud on Wordle.

I hope this helps. If you need help with actually getting the tweets into your database, take a look at some of the other tutorials I’ve posted. If you have any questions, let me know, and have fun with the data!




#ARNOVA14 – Tag cloud of ARNOVA 2014 Tweets

Word clouds are generally not all that helpful given how the words are taken out of their context (sentences). In certain settings, however, they do provide meaningful information. Hashtags are one of those contexts — they are meant to be single words. The tags denote ideas or topics or places. By examining the hashtags, we can thus gain an appreciation for the most salient topics ARNOVANs are tweeting about. With that, here is a word cloud generated using all of the hashtags included in all 739 tweets sent with the #arnova14 hashtag as of Sunday, November 23rd @ 1pm EST.

Here is a first cloud, with the #arnova14 tag excluded.

Screen Shot 2014-11-23 at 1.10.14 PM

The larger the tag, the more frequently it was used. You can see that two tags predominated, #nonprofit and #allianceR2P. To help see other topics, here’s a final tag cloud without these two tags.

Screen Shot 2014-11-23 at 1.10.40 PM

It was also refreshing to see linguistic diversity in the hashtags — especially a large number of Arabic tags. Unfortunately, the word cloud visualizer I used (Wordle) could not recognize these (if anyone knows of a good workaround please let me know).

If anyone is interested in the Python code used to generate this just shoot me an email.

I’ll leave the analysis up to you. Some interesting things here! I hope you all had a great conference and see you next year!