Downloading Tweets using Python and MongoDB | Twitter for Academic Research

In this tutorial I walk you through how to use Python and MongoDB to download tweets from a list of Twitter users.

This tutorial builds on several recents posts on how to use Python to download Twitter data. Specifically, in a previous post I showed you how to download tweets using Python and an SQLite database — a type of traditional relational database. More and more people are interested in noSQL databases such as MongoDB, so in a follow-up post I talked about the advantages and disadvantages of using SQLite vs MongoDB to download social media data for research purposes. Today I go into detail about how to actually use MongoDB to download your data and I point out the differences from the SQLite approach along the way.

Overview

This tutorial is directed at those who are new to Python, MongoDB, and/or downloading data from the Twitter API. We will be using Python to download the tweets and will be inserting the tweets into a MongoDB database. This code will allow you to download up to the latest 3,200 tweets sent by each Twitter user. I will not go over the script line-by-line but will instead attempt to provide you a ‘high-level’ understanding of what we are doing — just enough so that you can run the script successfully yourself.

Before running this script, you will need to:

Have Anaconda Python 2.7 installed
Have your Twitter API details handy
Have MongoDB installed and running
Have created a CSV file (e.g., in Excel) containing the Twitter handles you wish to download. Below is a sample you can download and use for this tutorial. Name it accounts.csv and place it in the same directory as the Python script.

If you are completely new to Python and the Twitter API, you should first make your way through the following tutorials, which will help you get set up and working with Python:

Another detailed tutorial I have created, Python Code Tutorial, is intended to serve as an introduction to how to access the Twitter API and then read the JSON data that is returned. It will be helpful for understanding what we’re doing in this script.

Also, if you are not sure you want to use MongoDB as your database, take a look at this post, which covers the advantages and disadvantages of using SQLite vs MongoDB to download social media data. As noted in that post, MongoDB has a more detailed installation process.

At the end of this post I’ll show the entire script. For now, I’ll go over it in sections. The code is divided into seven parts:

Part I: Importing Necessary Python Packages

The first line in the code is the shebang — you’ll find this in all Python code.

[python]#!/usr/bin/env python[/python]

Lines 3 – 23 contain the docstring — also a Python convention. This is a multi-line comment that describes the code. For single-line comments, use the # symbol at the start of the line.

[python firstline=”3″] """
Social_Metrics_Tutorial_Script_User_Timeline_All_Pages.py – DOWNLOADS ALL AVAILABLE RECENT
TWEETS FROM 5 MLB ACCOUNTS INTO MONGODB DATABASE

BEFORE RUNNING THIS SCRIPT, YOU WILL NEED TO:
1. HAVE ANACONDA PYTHON 2.7 INSTALLED
2. HAVE CREATED CSV FILE (E.G., IN EXCEL) CONTAINING TWITTER HANDLES YOU
WISH TO DOWNLOAD (SEE TUTORIAL FOR DETAILS)
3. HAVE MONGODB INSTALLED AND RUNNING

THE CODE IS DIVIDED INTO SEVEN PARTS:
1. Importing necessary Python packages
2. Importing Twython and Twitter app key and access token
– YOU NEED TO MODIFY THIS SECTION IN ORDER TO GET SCRIPT TO WORK (LINES 39-41)
3. Defining function for getting Twitter data
4. Set up MongoDB database and collections (tables)
5. Read in Twitter accounts (and add to MongoDB database if first run)
6. Main loop over each of the Twitter handles in the accounts table of the database.
7. Print out number of tweets in database per account
"""
[/python]

In lines 26 – 31 we’ll import some Python packages needed to run the code. Twython can be installed by opening your Terminal and installing by entering pip install Twython. For more details on this process see this blog post.
[python firstline=”26″] ###### PART I: IMPORT PYTHON PACKAGES (ALL BUT TWYTHON ARE INSTALLED W/ ANACONDA PYTHON ######
import sys
import time
import json
import pandas as pd
from twython import Twython #NEEDS TO BE INSTALLED SEPARATELY ONCE: pip install Twython
[/python]

Part II: Import Twython and Twitter App Key and Access Token

Lines 37-42 is where you will enter your Twitter App Key and Access Token (lines 40-41). If you have yet to do this you can refer to the tutorial on Setting up access to the Twitter API.

[python firstline=”37″] ###### PART II: IMPORT TWYTHON, ADD TWITTER APP KEY & ACCESS TOKEN (TO ACCESS API) ######

#REPLACE ‘APP_KEY’ AND ‘ACCESS_TOKEN’ WITH YOUR APP KEY & ACCESS TOKEN IN THE NEXT 2 LINES
APP_KEY = ‘ ‘
ACCESS_TOKEN = ‘ ‘
twitter = Twython(APP_KEY, access_token=ACCESS_TOKEN)
[/python]

Part III: Define a Function for Getting Twitter Data

In this block of code we are creating a Python function. The function sets up which part of the Twitter API we wish to access (specifically, it is the get user timeline API), the number of tweets we want to get per page (I have chosen the maximum of 200), and whether we want to include retweets. We will call this function later on in the code.

[python firstline=”48″] ###### PART III: DEFINE TWYTHON FUNCTION FOR GETTING ALL AVAILABLE TWEETS PER USER ######

def get_data_user_timeline_all_pages(kid, page):
try:
”’
‘count’ specifies the number of tweets to try and retrieve, up to a maximum of 200
per distinct request. The value of count is best thought of as a limit to
the number of tweets to return because suspended or deleted content is removed
after the count has been applied. We include retweets in the count, even if
include_rts is not supplied. It is recommended you always send include_rts=1 when
using this API method.
”’
d = twitter.get_user_timeline(screen_name=kid, count="200", page=page, include_entities="true", include_rts="1")
except Exception, e:
print "Error reading id %s, exception: %s" % (kid, e)
return None
print len(d) #d[0] #NUMBER OF ENTRIES RETURNED, FIRST ENTRY
#print "d.keys(): ", d[0].keys()
return d
[/python]

Part IV: Set up MongoDB Database and Collections (Tables)

Lines 72-111 are where you set up your MongoDB database and ‘collections’ (tables).

This is where you’ll see the first major differences from an SQLite implementation of this code. First, unlike SQLite, you will need to make sure MongoDB is running by typing mongod or sudo mongod in the terminal. So, that’s one extra step you have to take with MongoDB. If you’re running the code on a machine that is running 24/7 that is no issue; if not you’ll just have to remember.

There is a big benefit to MongoDB here, however. Unlike with the SQLite implementation, there is no need to pre-define every column in our database tables. As you can see in the SQLite version, we devoted 170 lines of code to defining and naming database columns.

Below, in contrast, we are simply making a connection to MongoDB, creating our database, then our database tables, then indexes on those tables. Note that, if this is the first time you’re running this code, the database and tables and indexes will be created; if not, the code will simply access the database and tables. Note also that MongoDB refers to database tables as ‘collections’ and refers to columns or variables as ‘fields.’

One thing that is similar to the SQLite version is that we are setting indexes on our database tables. This means that no two tweets with the same index value — the tweet’s ID string (id_str) — can be inserted into our database. This is to avoid duplicate entries.

One last point: we are setting up two tables, one for the tweets and one to hold the Twitter account names for which we wish to download tweets.

[python firstline=”72″] ###### PART IV: SET UP MONGODB DATABASE AND ACCOUNTS AND TWEETS TABLES ######

#MAKE CONNECTION TO MONGODB
import pymongo
from pymongo import MongoClient
client = MongoClient()

# DEFINE YOUR MONGODB DATABASE
db = client[‘MLB’]

# CREATE ACCOUNTS COLLECTION (TABLE) IN YOUR DATABASE FOR TWITTER ACCOUNT-LEVEL DETAILS
accounts = db[‘accounts’]

# CREATE AN INDEX ON THE COLLECTION TO AVOID INSERTION OF DUPLICATES
db.accounts.create_index([(‘Twitter_handle’, pymongo.ASCENDING)], unique=True)

# SHOW INDEX ON ACCOUNTS TABLE
#list(db.accounts.index_information())

#SHOW NUMBER OF ACCOUNTS IN TABLE
#accounts.count()

# DEFINE COLLECTION (TABLE) WHERE YOU’LL INSERT THE TWEETS
tweets = db[‘tweets’]

# CREATE UNIQUE INDEX FOR TABLE (TO AVOID DUPLICATES)
db.tweets.create_index([(‘id_str’, pymongo.ASCENDING)], unique=True)

#SHOW INDEX ON TWEETS COLLECTION
#list(db.tweets.index_information())

#SHOW NUMBER OF TWEETS IN TABLE
#tweets.count()

#TO SEE LIST OF CURRENT MONGODB DATABASES
#client.database_names()

#TO SEE LIST OF COLLECTIONS IN THE *MLB* DATABASE
#db.collection_names()
[/python]

Part V: Read in Twitter Accounts (and add to MongoDB database if first run)

In Lines 117-139 we are creating a Python list of Twitter handles for which we want to download tweets. The first part of the code (lines 119-130) is to check if this is the first time you’re running the code. If so, it will read the Twitter handle data from your local CSV file and insert it into the accounts table in your MongoDB database. In all subsequent runs of the code the script will skip over this block and go directly to line 137 — that creates a list called twitter_accounts that we’ll loop over in Part VI of the code.

[python firstline=”117″] ###### PART V: READ IN TWITTER ACCOUNTS (AND ADD TO MONGODB IF FIRST RUN)

# IF ACCOUNTS COLLECTION IS EMPTY READ IN CSV FILE AND ADD TO MONGODB
if accounts.count() < 1:
df = pd.read_csv(‘accounts.csv’)
records = json.loads(df.T.to_json()).values()
print "No account data in MongoDB, attempting to insert", len(records), "records"
try:
accounts.insert_many(records)
except pymongo.errors.BulkWriteError, e:
print e, ‘\n’
#pass
else:
print "There are already", accounts.count(), "records in the *accounts* table"

#LIST ROWS IN ACCOUNTS COLLECTION
#list(accounts.find())[:1]

# CREATE LIST OF TWITTER HANDLES FOR DOWNLOADING TWEETS
twitter_accounts = accounts.distinct(‘Twitter_handle’)
#print len(twitter_accounts)
#twitter_accounts[:5] [/python]

Part VI: Main Loop: Loop Over Each of the Twitter Handles in the Accounts Table and Download Tweets

In lines 144-244 we are at the last important step.

This code is much shorter here as well compared to the SQLite version. As noted in my previous post comparing SQLite to MongoDB, in MongoDB we do not need to define all of the columns we wish to insert into our database. MongoDB will just take whatever columns you throw at it and insert. In the SQLite version, in contrast, we had to devote 290 lines of code just specifying what specific parts of the Twitter data we are grabbing and how they relate to our pre-defined variable names.

After stripping out all of those details, the core of this code is the same as in the SQLite version. At line 151 we begin a for loop where we are looping over each Twitter ID (as indicated by the Twitter_handle variable in our accounts database).

Note that within this for loop we have a while loop (lines 166-238). What we are doing here is, for each Twitter ID, we are grabbing up to 16 pages’ worth of tweets; this is the maximum allowed for by the Twitter API. It is in this loop (line 170) that we call our get_data_user_timeline_all_pages function, which on the first loop will grab page 1 for the Twitter ID, then page 2, then page 3, …. up to page 16 so long as there are data to return.

Lines 186-205 contains code for writing the data into our MongoDB database table. We have defined our variable d to contain the result of calling our get_data_user_timeline_all_pages function — this means that, if successful, d will contain 200 tweets’ worth of data. The for loop starting on line 187 will loop over each tweet, add three variables to each tweet — date_inserted, time_date_inserted, and screen_name — and then insert the tweet into our tweets collection.

One last thing I’d like to point out here is the API limit checks I’ve written in lines 221-238. What this code is doing is checking how many remaining API calls you have. If it is too low, the code will pause for 5 minutes.

[python firstline=”144″] ###### PART VI: LOOP OVER TWITTER HANDLES & DOWNLOAD TWEETS INTO MONGODB COLLECTION ######

import timeit
start_time = timeit.default_timer()

starting_count = tweets.count()

for s in twitter_accounts[:1]:

#SET THE DUPLICATES COUNTER FOR THIS TWITTER ACCOUNT TO ZERO
duplicates = 0

#CHECK FOR TWITTER API RATE LIMIT (450 CALLS/15-MINUTE WINDOW)
rate_limit = twitter.get_application_rate_limit_status()[‘resources’][‘statuses’][‘/statuses/user_timeline’][‘remaining’] print ‘\n\n’, ‘# of remaining API calls: ‘, rate_limit

#tweet_id = str(mentions.find_one( { "query_screen_name": s}, sort=[("id_str", 1)])["id_str"])
print ‘Grabbing tweets sent by: ‘, s, ‘– index number: ‘, twitter_accounts.index(s)

page = 1

#WE CAN GET 200 TWEETS PER CALL AND UP TO 3,200 TWEETS TOTAL, MEANING 16 PAGES’ PER ACCOUNT
while page < 17:
print "——XXXXXX—— STARTING PAGE", page, ‘…estimated remaining API calls:’, rate_limit

d = get_data_user_timeline_all_pages(s, page)
if not d:
print "THERE WERE NO STATUSES RETURNED……..MOVING TO NEXT ID"
break
if len(d)==0: #THIS ROW IS DIFFERENT FROM THE MENTIONS AND DMS FILES
print "THERE WERE NO STATUSES RETURNED……..MOVING TO NEXT ID"
break
#if not d[‘statuses’]:
# break

#DECREASE rate_limit TRACKER VARIABLE BY 1
rate_limit -= 1
print ‘…….estimated remaining API rate_limit: ‘, rate_limit

##### WRITE THE DATA INTO MONGODB — LOOP OVER EACH TWEET
for entry in d:
#ADD THE FOLLOWING THREE VARIABLES TO THOSE RETURNED BY TWITTER API
entry[‘date_inserted’] = time.strftime("%d/%m/%Y")
entry[‘time_date_inserted’] = time.strftime("%H:%M:%S_%d/%m/%Y")
entry[‘screen_name’] = entry[‘user’][‘screen_name’]

#CONVERT TWITTER DATA TO PREP FOR INSERTION INTO MONGO DB
t = json.dumps(entry)
#print ‘type(t)’, type(t) #<type ‘str’>
loaded_entry = json.loads(t)
#print type(loaded_entry) , loaded_entry #<type ‘dict’>

#INSERT THE TWEET INTO THE DATABASE — UNLESS IT IS ALREADY IN THE DB
try:
tweets.insert_one(loaded_entry)
except pymongo.errors.DuplicateKeyError, e:
#print e, ‘\n’
duplicates += 1
pass

print ‘——XXXXXX—— FINISHED PAGE’, page, ‘FOR ORGANIZATION’, s, "–", len(d), "TWEETS"

#IF THERE ARE TOO MANY DUPLICATES THEN SKIP TO NEXT ACCOUNT
if duplicates > 20:
print ‘\n********************There are %s’ % duplicates, ‘duplicates….moving to next ID********************\n’
#continue
break

page += 1
if page > 16:
print "WE’RE AT THE END OF PAGE 16"
break

#THIS IS A SOMEWHAT CRUDE METHOD OF PUTTING IN AN API RATE LIMIT CHECK
#THE RATE LIMIT FOR CHECKING HOW MANY API CALLS REMAIN IS 180, WHICH MEANS WE CANNOT
if rate_limit < 5:
print ‘Estimated fewer than 5 API calls remaining…check then pause 5 minutes if necessary’
rate_limit_check = twitter.get_application_rate_limit_status()[‘resources’][‘statuses’][‘/statuses/user_timeline’][‘remaining’] print ‘…….and here is our ACTUAL remaining API rate_limit: ‘, rate_limit_check
if rate_limit_check<5:
print ‘Fewer than 5 API calls remaining…pausing for 5 minutes’
time.sleep(300) #PAUSE FOR 300 SECONDS
rate_limit = twitter.get_application_rate_limit_status()[‘resources’][‘statuses’][‘/statuses/user_timeline’][‘remaining’] print ‘…….here is our remaining API rate_limit after pausing for 5 minutes: ‘, rate_limit
#if rate_limit_check == 450:
# rate_limit = 450

#if twitter.get_application_rate_limit_status()[‘resources’][‘search’][‘/search/tweets’][‘remaining’]<5:
if rate_limit < 5:
print ‘Fewer than 5 estimated API calls remaining…pausing for 5 minutes’
time.sleep(300) #PAUSE FOR 900 SECONDS

elapsed = timeit.default_timer() – start_time
print ‘# of minutes: ‘, elapsed/60
print "Number of new tweets added this run: ", tweets.count() – starting_count
print "Number of tweets now in DB: ", tweets.count(), ‘\n’, ‘\n’
[/python]

Part VII: Print out Number of Tweets in Database per Account

This final block of code will print out a summary of how many tweets there are per account in your tweets database.

[python firstline=”250″] ###### PART VII: PRINT OUT NUMBER OF TWEETS IN DATABASE FOR EACH ACCOUNT ######

for org in db.tweets.aggregate([
{"$group":{"_id":"$screen_name", "sum":{"$sum":1}}}
]):
print org[‘_id’], org[‘sum’] [/python]

Now let’s put the whole thing together. To recap, what this entire script does is to loop over each of the Twitter accounts in the accounts table of your MongoDB database — and for each one it will grab up to 3,200 tweets and insert the tweets into the tweets table of your database.

Below is the entire script — download it and save it as tweets.py (or something similar) in the same directory as your accounts.csv file. Add in your Twitter API account details and you’ll be good to go! For a refresher on the different ways you can run the script see this earlier post.

If you’ve found this post helpful please share on your favorite social media site.

You’re on your way to downloading your own Twitter data! Happy coding!