Downloading Tweets, Take III – MongoDB

In this tutorial I walk you through how to use Python and MongoDB to download tweets from a list of Twitter users.

This tutorial builds on several recents posts on how to use Python to download Twitter data. Specifically, in a previous post I showed you how to download tweets using Python and an SQLite database — a type of traditional relational database. More and more people are interested in noSQL databases such as MongoDB, so in a follow-up post I talked about the advantages and disadvantages of using SQLite vs MongoDB to download social media data for research purposes. Today I go into detail about how to actually use MongoDB to download your data and I point out the differences from the SQLite approach along the way.

Overview

This tutorial is directed at those who are new to Python, MongoDB, and/or downloading data from the Twitter API. We will be using Python to download the tweets and will be inserting the tweets into a MongoDB database. This code will allow you to download up to the latest 3,200 tweets sent by each Twitter user. I will not go over the script line-by-line but will instead attempt to provide you a ‘high-level’ understanding of what we are doing — just enough so that you can run the script successfully yourself.

Before running this script, you will need to:

  • Have Anaconda Python 2.7 installed
  • Have your Twitter API details handy
  • Have MongoDB installed and running
  • Have created a CSV file (e.g., in Excel) containing the Twitter handles you wish to download. Below is a sample you can download and use for this tutorial. Name it accounts.csv and place it in the same directory as the Python script.

https://gist.github.com/gdsaxton/1825a5d455e61732eff69dc8cc17dd59

If you are completely new to Python and the Twitter API, you should first make your way through the following tutorials, which will help you get set up and working with Python:

Another detailed tutorial I have created, Python Code Tutorial, is intended to serve as an introduction to how to access the Twitter API and then read the JSON data that is returned. It will be helpful for understanding what we’re doing in this script.

Also, if you are not sure you want to use MongoDB as your database, take a look at this post, which covers the advantages and disadvantages of using SQLite vs MongoDB to download social media data. As noted in that post, MongoDB has a more detailed installation process.

At the end of this post I’ll show the entire script. For now, I’ll go over it in sections. The code is divided into seven parts:

Part I: Importing Necessary Python Packages

The first line in the code is the shebang — you’ll find this in all Python code.


 

Lines 3 – 23 contain the docstring — also a Python convention. This is a multi-line comment that describes the code. For single-line comments, use the # symbol at the start of the line.


 

In lines 26 – 31 we’ll import some Python packages needed to run the code. Twython can be installed by opening your Terminal and installing by entering pip install Twython. For more details on this process see this blog post.

Part II: Import Twython and Twitter App Key and Access Token

Lines 37-42 is where you will enter your Twitter App Key and Access Token (lines 40-41). If you have yet to do this you can refer to the tutorial on Setting up access to the Twitter API.

Part III: Define a Function for Getting Twitter Data

In this block of code we are creating a Python function. The function sets up which part of the Twitter API we wish to access (specifically, it is the get user timeline API), the number of tweets we want to get per page (I have chosen the maximum of 200), and whether we want to include retweets. We will call this function later on in the code.

Part IV: Set up MongoDB Database and Collections (Tables)

Lines 72-111 are where you set up your MongoDB database and ‘collections’ (tables).

This is where you’ll see the first major differences from an SQLite implementation of this code. First, unlike SQLite, you will need to make sure MongoDB is running by typing mongod or sudo mongod in the terminal. So, that’s one extra step you have to take with MongoDB. If you’re running the code on a machine that is running 24/7 that is no issue; if not you’ll just have to remember.

There is a big benefit to MongoDB here, however. Unlike with the SQLite implementation, there is no need to pre-define every column in our database tables. As you can see in the SQLite version, we devoted 170 lines of code to defining and naming database columns.

Below, in contrast, we are simply making a connection to MongoDB, creating our database, then our database tables, then indexes on those tables. Note that, if this is the first time you’re running this code, the database and tables and indexes will be created; if not, the code will simply access the database and tables. Note also that MongoDB refers to database tables as ‘collections’ and refers to columns or variables as ‘fields.’

One thing that is similar to the SQLite version is that we are setting indexes on our database tables. This means that no two tweets with the same index value — the tweet’s ID string (id_str) — can be inserted into our database. This is to avoid duplicate entries.

One last point: we are setting up two tables, one for the tweets and one to hold the Twitter account names for which we wish to download tweets.

Part V: Read in Twitter Accounts (and add to MongoDB database if first run)

In Lines 117-139 we are creating a Python list of Twitter handles for which we want to download tweets. The first part of the code (lines 119-130) is to check if this is the first time you’re running the code. If so, it will read the Twitter handle data from your local CSV file and insert it into the accounts table in your MongoDB database. In all subsequent runs of the code the script will skip over this block and go directly to line 137 — that creates a list called twitter_accounts that we’ll loop over in Part VI of the code.

Part VI: Main Loop: Loop Over Each of the Twitter Handles in the Accounts Table and Download Tweets

In lines 144-244 we are at the last important step.

This code is much shorter here as well compared to the SQLite version. As noted in my previous post comparing SQLite to MongoDB, in MongoDB we do not need to define all of the columns we wish to insert into our database. MongoDB will just take whatever columns you throw at it and insert. In the SQLite version, in contrast, we had to devote 290 lines of code just specifying what specific parts of the Twitter data we are grabbing and how they relate to our pre-defined variable names.

After stripping out all of those details, the core of this code is the same as in the SQLite version. At line 151 we begin a for loop where we are looping over each Twitter ID (as indicated by the Twitter_handle variable in our accounts database).

Note that within this for loop we have a while loop (lines 166-238). What we are doing here is, for each Twitter ID, we are grabbing up to 16 pages’ worth of tweets; this is the maximum allowed for by the Twitter API. It is in this loop (line 170) that we call our get_data_user_timeline_all_pages function, which on the first loop will grab page 1 for the Twitter ID, then page 2, then page 3, …. up to page 16 so long as there are data to return.

Lines 186-205 contains code for writing the data into our MongoDB database table. We have defined our variable d to contain the result of calling our get_data_user_timeline_all_pages function — this means that, if successful, d will contain 200 tweets’ worth of data. The for loop starting on line 187 will loop over each tweet, add three variables to each tweet — date_inserted, time_date_inserted, and screen_name — and then insert the tweet into our tweets collection.

One last thing I’d like to point out here is the API limit checks I’ve written in lines 221-238. What this code is doing is checking how many remaining API calls you have. If it is too low, the code will pause for 5 minutes.

Part VII: Print out Number of Tweets in Database per Account

This final block of code will print out a summary of how many tweets there are per account in your tweets database.

Now let’s put the whole thing together. To recap, what this entire script does is to loop over each of the Twitter accounts in the accounts table of your MongoDB database — and for each one it will grab up to 3,200 tweets and insert the tweets into the tweets table of your database.

Below is the entire script — download it and save it as tweets.py (or something similar) in the same directory as your accounts.csv file. Add in your Twitter API account details and you’ll be good to go! For a refresher on the different ways you can run the script see this earlier post.

If you’ve found this post helpful please share on your favorite social media site.

You’re on your way to downloading your own Twitter data! Happy coding!

https://gist.github.com/gdsaxton/0702e7c716e01c0306a3321428b7a79a




SQLite vs. MongoDB for Big Data

In my latest tutorial I walked readers through a Python script designed to download tweets by a set of Twitter users and insert them into an SQLite database. In this post I will provide my own thoughts on the pros and cons of using a relational database such as SQLite vs. a “noSQL” database such as MongoDB. These are my two go-to databases for downloading and managing Big Data and there are definite advantages and disadvantages to each.

The caveat is that this discussion is for researchers. Businesses will almost definitely not want to use SQLite for anything but simple applications.

The Pros and Cons of SQLite

SQLite has a lot going for it. I much prefer SQLite over, say, SQL. SQLite is the easiest of all relational databases. Accordingly, for someone gathering data for research SQLite is a great option.

For one thing, it is pre-installed when you install Anaconda Python (my recommended installation). There’s none of typical set-up with a MySQL installation, either — steps such as setting up users and passwords, etc. With Anaconda Python you’re good to go.

Moreover, SQLite is portable. Everything is contained in a single file that can be moved around your own computer or shared with others. There’s nothing complicated about it. Your SQLite database is just a regular file. Not so with MySQL, for instance, which would need to be installed separately, have user permissions set up, etc., and is definitely not so readily portable.

So, what’s the downside? Two things. One, there is the set-up. To get the most out of your SQLite database, you need to pre-define every column (variable) you’re going to use in the database. Every tweet, for instance, will need to have the exact same variables or else your code will break. For an example of this see my recent tutorial on downloading tweets into an SQLite database.

The other shortcoming flows from the pre-defining process. Some social media platforms, such as Twitter, have relatively stable APIs, which means you access the same variables the same way year in and year out. Other platforms, though (that’s you, Facebook), seem to change their API constantly, which means your code to insert Facebook posts into your SQLite database will also constantly break.

Here’s a screenshot of what your SQLite database might look like:

As you can see, it’s set up like a typical flat database like an Excel spreadsheet or PANDAS or R dataframe. The columns are all pre-defined.

The Pros and Cons of MongoDB

The SQLite approach contrasts starkly with the “noSQL” approach represented by MongoDB. A primary benefit is that MongoDB is tailor-made for inserting the types of data returned by a social media platform’s API — particularly JSON.

For instance, the Twitter API returns a JSON object for each tweet. In a prior tutorial I provide an overview of this. The code block below shows the first five lines of JSON (one line per variable) for a typical tweet object returned by the Twitter API:

{
“_id” : ObjectId(“595a71173ffc5a01d8f27de7”),
“contributors” : null,
“quoted_status_id” : NumberLong(880805966375202816),
“text” : “RT @FL_Bar_Found: Thank you for your support, Stephanie! https://t.co/2vxXe3VnTU”,
“time_date_inserted” : “12:30:15_03/07/2017”,
….
}

And to see the full 416 lines of JSON code for a single tweet object click on expand source below:

Here is where MongoDB excels. All we need to do is grab the tweet object and tell MongoDB to insert it into our database. Do you have different columns in each tweet? MongoDB doesn’t care — it will just take whatever JSON you throw at it and insert it into your database. So if you are working with JSON objects that have different variables or different numbers of columns — or if Facebook changes its API again — you will not need to update your code and your script will not break because of it.

Here’s a screenshot of what the first 40 objects (tweets) in your MongoDB database might look like. You can see that the number of fields (variables) is not the same for each tweet — some have 29, some have 30, or 31, or 32:

And here’s what the first tweet looks like after expanding the first object:

As you can see, it looks like the JSON object returned by the Twitter API.

In effect, MongoDB is great in situations where you would like to quickly grab all the data available and quickly throw it into a database. The downside of this approach is that you will have to do the defining of your data later — before you can analyze it. I find this to be less and less problematic, however, since PANDAS has come around. I would much rather extract my data from MongoDB (one line of code) and do my data and variable manipulations in PANDAS rather than mess around with SQLAlchemy before even downloading the data into an SQLite database.

A final benefit of MongoDB is its scalability. You have 10 million tweets to download? What about 100 million? No issues with MongoDB. With SQLite, in contrast, let’s say 1 million tweets would be a good upper limit before performance drags considerably.

MongoDB does have its downsides, though. Much like MySQL, MongoDB needs to be “running” before you insert data into it. If your server is running 24/7 that is no issue. Otherwise you’ll have to remember to restart your MongoDB server each time you want to either insert data into your database or extract data you’ve already inserted. MongoDB also has higher “start-up” costs; it is not as easy to install as SQLite and you may or may not run into disk permissions issues, username and password issues, etc. Cross your fingers and will only take you half an hour — once — and then you’re good to go from then on.

Finally, a MongoDB database is not a “file” like an SQLite database. This makes moving or sharing your database more troublesome. Not terribly onerous but a few extra steps. Again, if you are importing your MongoDB database into PANDAS and then using PANDAS for your data manipulations, etc., then this should not be an issue. You can easily share or move your PANDAS databases or export to CSV or Excel.

Summary

Here is a summary of the pros and cons of SQLite and MongoDB for use as a Big Data-downloading database.

 SQLiteMongoDB
PortabilityEasy to share/move an SQLite database. Considerably more complicated. May not be an issue for you if you're work process is to export your data into PANDAS.
Ease of useSQLite is simple. The database is just a single file that does not need to be 'running' 24/7.More complicated than SQLite. The MongoDB server needs to be running before your Python script can insert the data.
Ease of SetupVery easy. If you have installed Anaconda Python you are good to go.Considerably more complicated set-up. But it is a one-time process. If you are lucky or are comfortable with the Terminal this one-time set-up process should not take more than an hour.
ScalabilityBeyond a certain limit your SQLite database will become unwieldy. I've have up to a million tweets without too much difficulty, however.Can be as small or as big as you'd like.
Setting up code to insert tweets Needs to be detailed. Every column needs to be defined in your code and accounted for in each tweet.Easy. MongoDB will take whatever JSON Twitter throws at it and insert it into the database.
Robustness to API ChangesNot robust. The Facebook API, for instance, changes almost constantly. Your database code will have to be updated each time or it will break when it tries to insert into SQLite.Extremely robust. Easy. MongoDB will take whatever JSON you throw at it and insert it into the database.

If you’ve found this post helpful please share on your favorite social media site.

I my next post I will provide a tutorial of how to download tweets into a MongoDB database. Until then, happy coding!




Downloading Tweets – Take II

The goal of this post is to walk you through a Python script designed to download tweets by a set of Twitter users and insert them into an SQLite database.

In a previous post I supplied a brief, temporary attempt at providing an overview of how to download tweets sent by a list of Twitter users — but I ended the post by pointing pointing people to a good tutorial written by my former PhD student and now co-author Wayne Xu.

Now I am finally getting around to posting my own tutorial on how to download tweets sent by a list of different Twitter users. It is directed at those who are new to Python and/or downloading data from the Twitter API. We will be using Python to download the tweets and will be inserting the tweets into an SQLite database. This code will allow you to download up to the latest 3,200 tweets sent by each Twitter user. I will not go over the script line-by-line but will instead attempt to provide you a ‘high-level’ understanding of what we are doing — just enough so that you can run the script successfully yourself.

Before running this script, you will need to:

  • Have Anaconda Python 2.7 installed
  • Have your Twitter API details handy
  • Have created a CSV file (e.g., in Excel) containing the Twitter handles you wish to download. Below is a sample you can download and use for this tutorial. Name it accounts.csv and place it in the same directory as the Python script.

https://gist.github.com/gdsaxton/1825a5d455e61732eff69dc8cc17dd59

If you are completely new to Python and the Twitter API, you should first make your way through the following tutorials, which will help you get set up and working with Python:

Another detailed tutorial I have created, Python Code Tutorial, is intended to serve as an introduction to how to access the Twitter API and then read the JSON data that is returned. It will be helpful for understanding what we’re doing in this script.

At the end of this post I’ll show the entire script. For now, I’ll go over it in sections. The code is divided into six parts:

Part I: Overview and Importing Necessary Python Packages

The first line in the code is the shebang — you’ll find this in all Python code.


 

Lines 4 – 24 contain the docstring — also a Python convention. This is a multi-line comment that describes the code. For single-line comments, use the # symbol at the start of the line.


 

In lines 29 – 44 we’ll import some Python packages needed to run the code. Twython and simplejson can be installed by opening your Terminal and installing by entering pip install simplejson and then pip install Twython. For more details on this process see this blog post.

Part II: Import Twython and Twitter App Key and Access Token

Lines 50-55 is where you will enter your Twitter App Key and Access Token (lines 53-54). If you have yet to do this you can refer to the tutorial on Setting up access to the Twitter API.

Part III: Define a Function for Getting Twitter Data

In this block of code we are creating a Python function. The function sets up which part of the Twitter API we wish to access (specifically, it is the get user timeline API), the number of tweets we want to get per page (I have chosen the maximum of 200), and whether we want to include retweets. We will call this function later on in the code.

Part IV: Set up SQLite Database Tables

Lines 85-255 are where you set up columns for the database to allow use of SQLite and SQLAlchemy. For the reasons why we’re using SQLite you can refer to this post.

This is a very long, mechanical block of code. We are defining two tables for our SQLite database — one for the tweets and one for the Twitter account names we want to download. Within each table we are defining variable names and variable types. That is, we are naming every variable (column) we wish to create and saying whether it is a string variable (i.e., text) or an integer variable.

Why do we need to do this? We don’t actually have to go into such detail, but doing this ‘set-up’ work now will make managing and analyzing the data later easier. When I use SQLite I specifically like to use SQLAlchemy (we imported it earlier), which makes working with an SQLite database even easier. Though it is long, it is a standard piece of code that you can cut-and-paste into any tweet download script. I usually split this block of code into a separate file and then import it, but I’m including all code here in a single script to make it easier to understand all of the moving parts.

Part V: Write function for parsing and storing data returned by Twitter API

In Lines 263-554 we are writing a long function to help write the data to the SQLite database. At this point in the code we have still not downloaded any data. We are just setting up a function that we’ll call later.

I won’t go into all of the details of what we’re doing here. The ‘big picture’ is that with this function we are looping over all 200 tweets in each page. For each tweet, we are grabbing certain bits of information returned by the Twitter API and then assigning those bits of information to the database variables we set up earlier.

Now, a key step to this is understanding the data that are returned by the API. As is increasingly common with Web data, this API call returns data in JSON format. Behind the scenes, Python has grabbed a JSON file, which will have data on 200 tweets (one page of tweets). Each tweet is an object in the JSON file, and we are looping over each object. If you’re interested in more details, the code we’re using here plugs into the user_timeline part of the Twitter API; follow the link for a description of what the API does. You can also go here to see a list of the definitions for the variables returned by the API. To understand how this relates to JSON, you can take a look at my earlier post on downloading Twitter user data.

I’ll give you two examples here: In line 279 we are taking the data included in the object’s [‘id’] variable and assigning that to a variable called tweet_id. In line 358 we are creating a variable called num_characters whose value is the number of characters in the tweet. We are doing this for every variable we have set up in Part IV earlier.

After we’ve created these variables, we now need to write them to our database. In lines 534-547 we tell our SQLite database that we want to update our TWEET database and which variables to include. You’ll see these are the same variable names we used in Part IV. Line 549 contains the command to add the tweet data to our database.

Here comes a key part: lines 550-554 contain a try…except loop designed to catch duplicate tweets. Simply put, if the tweet already exists in the database it will skip over it. This is really important and one of the best reasons to use a database for downloading tweets. If you wanted, you could simply download your tweets to an Excel spreadsheet. However, each time you ran the script you would likely be downloading a truckload of duplicates. You don’t want that.

The key to this lies in how we set up our TWEET database earlier; specifically, line 99 contains a unique constraint.

There can only be one entry with a given tweet_id_str value. Given that every tweet has a unique tweet ID, this is the best variable on which to make a unique constraint.

Part VI: Main Loop: Loop over each of the Twitter handles in the Accounts table and Download Tweets.

In lines 560-627 we are at the last step. In this block the first important thing we do is name our SQLite database on line 579.

Then in line 571 we query the ACCOUNT table in our database and assign that to a variable all_ids. This will work for the second time we run the script and all subsequent times.

The first time we run the script, though, our ACCOUNT table will be empty. Lines 576-583 check whether it is empty and, if so, will grab the details from our accounts.csv file and insert it into our database.

At line 586 we then begin a for loop — we will be looping over each Twitter ID (as indicated by the Twitter_handle variable in our ACCOUNT database).

Note that within this for loop we have a while loop (lines 596-618). What we are doing here is, for each Twitter ID, we are grabbing up to 16 pages’ worth of tweets; this is the maximum allowed for by the Twitter API. It is in this loop that we actually call our functions. On line 598 we invoke our get_data_user_timeline_all_pages function, which on the first loop will grab page 1 for the Twitter ID, then page 2, then page 3, …. up to page 16 so long as there are data to return. Line 606 invokes the write_data function for each page.

Now let’s put the whole thing together. To recap, what this entire script does is to loop over each of the Twitter accounts in the Accounts table of our database — and for each one it will grab up to 3,200 tweets and insert the tweets into an SQLite database.

Below is the entire script — download it and save it as tweets.py (or something similar) in the same directory as your accounts.csv file. Add in your Twitter API account details and you’ll be good to go! For a refresher on the different ways you can run the script see this earlier post.

If you’ve found this post helpful please share on your favorite social media site.

You’re on your way to downloading your own Twitter data! Happy coding!

https://gist.github.com/gdsaxton/b0d36c10bbdb80e26b692a1d1a3e11de




Setting up Access to the Twitter API

search-api_v2

The Twitter API (application programming interface) is your gateway to accessing Twitter data. The image above shows a screenshot of Twitter’s Search API, just one of the key parts of the API you might be interested in. To access any of them you’ll need to have a password. So, in this post I’m going to walk you through getting access to the Twitter API. By the end you’ll have a password that you’ll use in your Python code to access Twitter data.

Sign up for a Twitter Account and Create an App

Most social media platforms follow a similar set of steps that you’ll go through here: you’ll go to the Developer page, set up an ‘app’, and then generate a set of passwords that grant you access to the API.

On Twitter, the first thing you’ll have to do is have a Twitter account. Once you have that, go to Twitter’s ‘developer’ page: https://dev.twitter.com. Once you’re logged into the developer page you’ll then have to create an ‘app’. Click on the My Apps link or go directly to: https://apps.twitter.com This will take you to the following screen, where you can see I have already created three apps. Click on ‘Create New App’.

twitter_developer_home_page_v2

Create your App

You’ll then be taken to the screen shown in the following image. You might be wondering why it’s called an ‘app’ and why you have to create one. The short answer is that Twitter and other social media platforms allow access to their data mainly for developers, or people creating apps that interact with the Twitter data. Academics and researchers are not the main targets but we access the data the same way.

You’ll need to fill in three things as shown in the image. For the ‘Name’ just put in anything you want — I chose ‘ARNOVA2016’ here. As long as it makes sense to you you’re fine. You’ll also have to put in a brief description. Here I typically put in something about academic or not-for-profit research. Finally, you’ll put in a website address (hopefully you have something you can use) and click ‘Create your Twitter Application.’

app_creation_page_v2

Successfully Created App

You’ll then be taken to the following screen, which indicates a successfully created app. Click on Keys and Access Tokens:

app_creation_success_page_v2

Generate Access Tokens

On this screen you’ll see the first two parts of your four-part password: the API KEY and the API SECRET. You still need to generate the final two parts, so click on ‘Regenerate Consumer Key and Secret.’

keys_and_access_tokens_page_v2

Copy the Four Parts of Your Password

You’ll then be taken to the final page as shown in the image below. You now have all four parts to your ‘password’ to accessing the Twitter API: the API KEY, the API SECRET, the ACCESS TOKEN, and the ACCESS SECRET (I’ve pixelated or obscured mine here). Keep these in a safe place — you’ll be using them in any code in which you want to access the Twitter API.

generated_access_tokens_v2

You’re done! You now have your Twitter API library card and are ready to go hunting for data. In an upcoming post I’ll show you how to actually use your password to access the data.




Analyzing Big Data with Python PANDAS

This is a series of iPython notebooks for analyzing Big Data — specifically Twitter data — using Python’s powerful PANDAS (Python Data Analysis) library. Through these tutorials I’ll walk you through how to analyze your raw social media data using a typical social science approach.

The target audience is those who are interested in covering key steps involved in taking a social media dataset and moving it through the stages needed to deliver a valuable research product. I’ll show you how to import your data, aggregate tweets by organization and by time, how to analyze hashtags, how to create new variables, how to produce a summary statistics table for publication, how to analyze audience reaction (e.g., # of retweets) and, finally, how to run a logistic regression to test your hypotheses. Collectively, these tutorials cover essential steps needed to move from the data collection to the research product stage.

Prerequisites

I’ve put these tutorials in a GitHub repository called PANDAS. For these tutorials I am assuming you have already downloaded some data and are now ready to begin examining it. In the first notebook I will show you how to set up your ipython working environment and import the Twitter data we have downloaded. If you are new to Python, you may wish to go through a series of tutorials I have created in order.

If you want to skip the data download and just use the sample data, but don’t yet have Python set up on your computer, you may wish to go through the tutorial “Setting up Your Computer to Use My Python Code”.

Also note that we are using the iPython notebook interactive computing framework for running the code in this tutorial. If you’re unfamiliar with this see this tutorial “Four Ways to Run your Code”.

For a more general set of PANDAS notebook tutorials, I’d recommend this cookbook by Julia Evans. I also have a growing list of “recipes” that contains frequently used PANDAS commands.

As you may know from my other tutorials, I am a big fan of the free Anaconda version of Python 2.7. It contains all of the prerequisites you need and will save you a lot of headaches getting your system set up.

Chapters:

At the GitHub site you’ll find the following chapters in the tutorial set:

Chapter 1 – Import Data, Select Cases and Variables, Save DataFrame.ipynb
Chapter 2 – Aggregating and Analyzing Data by Twitter Account.ipynb
Chapter 3 – Analyzing Twitter Data by Time Period.ipynb
Chapter 4 – Analyzing Hashtags.ipynb
Chapter 5 – Generating New Variables.ipynb
Chapter 6 – Producing a Summary Statistics Table for Publication.ipynb
Chapter 7 – Analyzing Audience Reaction on Twitter.ipynb
Chapter 8 – Running, Interpreting, and Outputting Logistic Regression.ipynb

I hope you find these tutorials helpful; please acknowledge the source in your own research papers if you’ve found them useful:

    Saxton, Gregory D. (2015). Analyzing Big Data with Python. Buffalo, NY: http://social-metrics.org

Also, please share and spread the word to help build a vibrant community of PANDAS users.

Happy coding!




Do I Need to Learn Programming to Download Big Data?

140747995_b3102758d1_o

You want to download and analyze “Big Data” — such as messages or network data from Twitter or Facebook or Instagram. But you’ve never done it before, and you’re wondering, “Do I need to learn computer programming?” Here are some decision rules, laid out in the form of brief case studies.

One-Shot Download with Limited Analysis

Let’s say you have one organization you’re interested in studying on Twitter and want to download all of its tweets. You are doing only basic analyses in a spreadsheet like Excel. In this case, if you have a PC, you can likely get away with something like NodeXL — an add-on to Excel. VERDICT: COMPUTER PROGRAMMING LIKELY NOT NECESSARY

One-Shot Download with Analysis in Other Software

Let’s start with the same data needs as above: a one-shot download from one (or several) organizations on Twitter. You wish to undertake extensive analyses of the data but can rely on some other software to handle the heavy lifting — maybe a qualitative analysis tool such as ATLAS or statistical software such as SAS, R, or Stata. Each of those tools has its own programming capabilities, so if you’re proficient in one of those tools — and your data-gathering needs are relatively straightforward — you might be able to get away with not learning programming. VERDICT: COMPUTER PROGRAMMING MAY BE UNNECESSARY

Anything Else

In almost any other situation, I would recommend learning a programming language. Why is this necessary? For one case, let’s say you wish to download tweets for a given hashtag over the course of an event. In this case you’ll want to use a database — even a simple database like SQLite — to avert duplicates from being downloaded. The programming language, meanwhile, helps you download the tweets and “talk” to the database. In short, if you are downloading tweets more than once for the same sample of organizations, you should probably jump to learning a programming language. Similarly, if you have any need at all for manipulating the data you download — merging, annotating, reformulating, adding new variables, collapsing by time or organization, etc. — then a programming language becomes highly desirable. Finally, if you have any interest in or need of medium- to advanced-level analysis of the data, then a programming language is similarly highly desirable. VERDICT: PICK A PROGRAMMING LANGUAGE AND LEARN IT

Conclusion

Not everyone needs to learn a programming language to accomplish their social media data downloading objectives. If your needs fall into one of the simple cases noted above then you may wish to skip it and focus on other things. On the other hand, if you are going to be doing data downloads again in the future, or if you have anything beyond basic downloading needs, or if you want to tap into sophisticated data manipulation and data analysis capabilities, then you should seriously consider learning to program.

Learning a programming language is a challenge. Of that there is little doubt. Yet the payoff in improved productivity alone can be substantial. Add to that the powerful analytical and data visualization capabilities that open up to the researcher who is skilled in a programming language. Lastly, leaving aside the buzzword “Big Data,” programming opens up a world of new data found on websites, social media platforms, and online data repositories. I would thus go so far as to say that any researcher interested in social media is doing themselves a great disservice by not learning some programming. For this very reason, one of my goals on this site is to provide guidance to those who are interested in getting up and running on Python for conducting academic and social media research. If you are a beginner, I’d recommend you work through the tutorials listed here in order.




Levels of Analysis in Big Data

Big Data

So you want to download “Big Data.” You could be a social scientist wanting to take your first stab at downloading and analyzing 100 organizations’ worth of tweets. Or a marketing or public relations practitioner interested in analyzing Facebook or Instagram or Pinterest YouTube activity by your competitors. Or a budding data scientist interested in getting your toes wet and doing your first Big Data download.

This post is the first in a series designed to help you understand at a conceptual level the main moving parts you’ll have to grasp in order to successfully get the data you need. This one deals with levels of analysis in Big Data. It is critical to have a basic understanding of this concept if you are to understand how to correctly get the data you need.

What is a Level of Analysis?

In the abstract, the term level of analysis refers to the scale of your research project. More concretely, it refers to the level at which your analyses are conducted. For instance, a political scientist would generally conduct research at one of three levels of analysis: the individual, the state, or the system. A communication scholar, in turn, might study, among others, the individual, the message, or the conversation. And a finance scholar might study the trader, the firm, the transaction, the security, the stock exchange, or the country.

‘Big Data’ can derive from many sources, but for the purposes of this post I’m assuming you’re interested in capturing some form of social media data — such as Tumblr, Twitter, Facebook, Pinterest, or Instagram. What is important to realize is that on all social media sites there are three fundamental levels of analysis — the account, the message, and the connections — and that these correspond to the three basic building blocks of social media engagement. Importantly, the social media sites generally allow you (with limits) to access their data, and the data are organized according to the level of analysis.

Screen Shot 2015-05-25 at 8.39.11 PM

To demonstrate these ideas I will use the example of the Community Foundation for Greater Buffalo’s Twitter page.

Twitter - CFGB (account-level)

Level One: The Account Level

The first level is the account level. Take a look at the screenshot below. As I’ve indicated on the image, there are a variety of account-level data — the images the organization has uploaded, its description, its location, its website address, and the date it joined Twitter. You can also see how many tweets it has sent to date (194), how many other Twitter users it follows (220), how many other users are following the organization (1,526), and how many tweets the organization has ‘favorited’ or archived (93). All of these data are at the account level of analysis. They are, effectively, characteristics of the organization’s account at a snapshot in time.

Twitter_-_CFGB_-_account-level

If we are being strict with social scientific language, we would say that the account is the unit of observation for our data here. But we can save the distinction between unit of observation and unit/level of analysis for a future post. For now, the key takeaway is to understand that these account-level data are at a higher level than the tweet, or friendship, or conversation level. They are characteristics of the organization — or more specifically, the organization’s Twitter account.

In line with what I’ve noted earlier, you can access all of the above account-level data through a specific portion of the Twitter application programming interface (API). Specifically, the users/lookup part of the Twitter API allows for the bulk downloading of Twitter user information. You can see a description of this part of the API here, along with definitions for the variables returned. For an overview of how to gather such data, take a look at this tutorial I’ve written.

There are good reasons why you would want to gather these data. For instance, you might want to track how many followers an organization has over time. In all of my studies using Twitter data I always start the data-gathering process by downloading these account-level data. But you should note that the account-level data are typically the least interesting. It is at this level that we see what I refer to as the static architecture of an organization’s engagement efforts on social media (see the first figure in this post). Think of this as the venue in which customer or stakeholder engagement can take place. On Twitter, the ability to modify the architecture is limited: the organization can add pictures, write a compelling description, and include a link to other social media accounts, but it cannot change the nature of any interactions that take place — those are hard-coded into the Twitter platform. Other social media sites allow for more customizable architecture. For instance, Facebook allows page administrators to change options for fan commenting while allowing greater customizability in the static architecture via apps.

Level Two: The Messages

The second is the message level. Here is where we really get to the heart of the social media data. No matter the social media platform, the heart of an organization’s engagement efforts occurs not through static architectural elements but through dynamic engagement efforts — through the day-to-day messages the organization sends and the daily connecting actions it takes. The messages are the heart of any social media platform, though they go by different names according to the platform. On Twitter, they are tweets. On Facebook, it’s statuses. On YouTube, videos. On Pinterest, it’s pins. On Instagram, it’s photos. Despite the different names, the point is that at their core all social media platforms stress dynamic communication, as manifested in the discrete visual or textual messages an organization sends to its followers.

Twitter_-_CFGB_-_message-level

Take a look at the above screenshot. You’ll see I’ve indicated the message-level data. These are the tweets. Through accessing Twitter’s user_timeline API, you can access all of the information seen in the screenshot — the full text of the tweet, whether it was a retweet, how many times the tweet has been retweeted and favorited, links to included photos, etc. In almost any Twitter study you will want to gather these data. And fortunately, you can generally acquire the last 3,200 tweets sent by a Twitter user. If you have 100 organizations in your sample, this means you could easily — in a single day — build a database with 320,000 tweets.

Level Three: The Connections

Finally, there is the connections level. You can’t see these immediately on the Community Foundation’s Twitter page, but by clicking on ‘Following’ or ‘Followers’ you can get details on the other users the organization is following or followed by, respectively.

Twitter - CFGB (connections1)

For instance, clicking on ‘Following’ you will get what is shown in the screenshot below. Here you’ll see the first six Twitter users that the Community Foundation for Greater Buffalo follows. After the messages (tweets), this is the second most-important set of data — once again, here we can see the results of the organization’s dynamic engagement efforts — the formal social network connections it is making with other Twitter users.

Twitter - CFGB (connections)

As with the account-level and message-level data, these data are also available for download. To get a list of users the organization follows, access Twitter’s GET friends/ids API, while to get a list of users that follow the organization, access the GET followers/ids API. This is where you would go to get the data for a social network analysis of a sample of organizations’ friend and follower networks. Future tutorials will cover how to do this.

Summary

Here are the two key takeaways. One, I hope you now understand that all social media sites have three fundamental elements that individuals or organizations can employ to engage with their audience: 1) static architecture, 2) discrete messages, and 3) discrete connecting actions. The first is interesting and necessary but not terribly important for most research projects. The latter two reflect an organization’s attempts at dynamic engagement with its audience. These two levels are also the building blocks for aggregating to higher levels of analysis — notably, using tweet-level data to conduct conversation-level analyses or using connection-level data to conduct network-level analyses. I’ll cover those in future posts.

Two, Twitter — as with other social media sites — organizes and grants access to its data through series of APIs that roughly conform to the levels of analysis covered above. What I hope to have conveyed here is that to get the data you need, you first have to understand this essential differentiation of the data. Understanding the different levels of analysis is the first step to understanding the nature of social media data.




How Many Tags is Too Much?

Including a hashtag in a social media message can increase its reach. The question is, what is the ideal number of tags to include?

To answer this question, I examine 60,919 original tweets sent in 2014 by 99 for-profit and nonprofit member organizations of a large US health advocacy coalition.

First, the following table shows the distribution of the number of hashtags included in the organizations’ tweets. As shown in the table, almost a third (n = 19,747) of tweets do not have a hashtag, almost 39% (n = 23,493) have one hashtag, 19% include two hashtags (n = 11,836), 7% include three (n = 4,381), and 2% (n = 1,161) include 4. Few tweets contain more than 4 tags, though one tweet included a total of 10 different hashtags.

Frequency of Hashtags in 60,919 Original Tweets

# of HashtagsFrequency
019,747
123,493
211,836
34,381
41,161
5227
649
713
84
97
101
Total60,919

Now let’s look at the effectiveness of messages with different numbers of hashtags. A good proxy for message effectiveness is retweetability, or how frequently audience members share the message with their followers. The following graph shows the average number of retweets received by tweets with different numbers of hashtags included.

Untitled

What we see is that more hashtags are generally better, but there are diminishing returns. Excluding the 25 tweets with more than 6 hashtags, the effectiveness of hashtag use peaks at 2 hashtags, with more than 3 hashtags being only as effective or less effective than no hashtags.

The evidence isn’t conclusive — especially given the anomalous findings for the few tweets with 7-10 tags — but there is strong support here that, if you want your message to reach the biggest possible audience, limit your tweets to 1-2 hashtags.