Python Data Analytics Tutorials

The bulk of my research involves some degree of ‘Big Data’ — such as datasets with a million or more tweets. Getting these data prepped for analysis can involve massive amounts of data manipulation — anything from aggregating data to the daily or organizational level, to merging in additional variables, to generating data required for social network analysis. For all such steps I now almost exclusively use Python’s PANDAS library (‘Python Data Analysis Library’). In conjunction with the Jupyter Notebook interactive computing framework and packages such as NetworkX, you will have a powerful set of analysis tools at your disposal. This page contains links to the tutorials I have created to help you learn data analytics in Python. I also have a page with shorter (typically one-liner) data analytic code bytes.

Data Collection

Data Analysis

  • Generating New Variables (coming soon)
  • Producing a Summary Statistics Table for Publication (coming soon)
  • Analyzing Audience Reaction on Twitter (coming soon)
  • Running, Interpreting, and Outputting Logistic Regression (coming soon)

I hope you have found this helpful. If so, please spread the word, and happy coding!




Downloading Tweets, Take III – MongoDB

In this tutorial I walk you through how to use Python and MongoDB to download tweets from a list of Twitter users.

This tutorial builds on several recents posts on how to use Python to download Twitter data. Specifically, in a previous post I showed you how to download tweets using Python and an SQLite database — a type of traditional relational database. More and more people are interested in noSQL databases such as MongoDB, so in a follow-up post I talked about the advantages and disadvantages of using SQLite vs MongoDB to download social media data for research purposes. Today I go into detail about how to actually use MongoDB to download your data and I point out the differences from the SQLite approach along the way.

Overview

This tutorial is directed at those who are new to Python, MongoDB, and/or downloading data from the Twitter API. We will be using Python to download the tweets and will be inserting the tweets into a MongoDB database. This code will allow you to download up to the latest 3,200 tweets sent by each Twitter user. I will not go over the script line-by-line but will instead attempt to provide you a ‘high-level’ understanding of what we are doing — just enough so that you can run the script successfully yourself.

Before running this script, you will need to:

  • Have Anaconda Python 2.7 installed
  • Have your Twitter API details handy
  • Have MongoDB installed and running
  • Have created a CSV file (e.g., in Excel) containing the Twitter handles you wish to download. Below is a sample you can download and use for this tutorial. Name it accounts.csv and place it in the same directory as the Python script.

If you are completely new to Python and the Twitter API, you should first make your way through the following tutorials, which will help you get set up and working with Python:

Another detailed tutorial I have created, Python Code Tutorial, is intended to serve as an introduction to how to access the Twitter API and then read the JSON data that is returned. It will be helpful for understanding what we’re doing in this script.

Also, if you are not sure you want to use MongoDB as your database, take a look at this post, which covers the advantages and disadvantages of using SQLite vs MongoDB to download social media data. As noted in that post, MongoDB has a more detailed installation process.

At the end of this post I’ll show the entire script. For now, I’ll go over it in sections. The code is divided into seven parts:

Part I: Importing Necessary Python Packages

The first line in the code is the shebang — you’ll find this in all Python code.


 

Lines 3 – 23 contain the docstring — also a Python convention. This is a multi-line comment that describes the code. For single-line comments, use the # symbol at the start of the line.


 

In lines 26 – 31 we’ll import some Python packages needed to run the code. Twython can be installed by opening your Terminal and installing by entering pip install Twython. For more details on this process see this blog post.

Part II: Import Twython and Twitter App Key and Access Token

Lines 37-42 is where you will enter your Twitter App Key and Access Token (lines 40-41). If you have yet to do this you can refer to the tutorial on Setting up access to the Twitter API.

Part III: Define a Function for Getting Twitter Data

In this block of code we are creating a Python function. The function sets up which part of the Twitter API we wish to access (specifically, it is the get user timeline API), the number of tweets we want to get per page (I have chosen the maximum of 200), and whether we want to include retweets. We will call this function later on in the code.

Part IV: Set up MongoDB Database and Collections (Tables)

Lines 72-111 are where you set up your MongoDB database and ‘collections’ (tables).

This is where you’ll see the first major differences from an SQLite implementation of this code. First, unlike SQLite, you will need to make sure MongoDB is running by typing mongod or sudo mongod in the terminal. So, that’s one extra step you have to take with MongoDB. If you’re running the code on a machine that is running 24/7 that is no issue; if not you’ll just have to remember.

There is a big benefit to MongoDB here, however. Unlike with the SQLite implementation, there is no need to pre-define every column in our database tables. As you can see in the SQLite version, we devoted 170 lines of code to defining and naming database columns.

Below, in contrast, we are simply making a connection to MongoDB, creating our database, then our database tables, then indexes on those tables. Note that, if this is the first time you’re running this code, the database and tables and indexes will be created; if not, the code will simply access the database and tables. Note also that MongoDB refers to database tables as ‘collections’ and refers to columns or variables as ‘fields.’

One thing that is similar to the SQLite version is that we are setting indexes on our database tables. This means that no two tweets with the same index value — the tweet’s ID string (id_str) — can be inserted into our database. This is to avoid duplicate entries.

One last point: we are setting up two tables, one for the tweets and one to hold the Twitter account names for which we wish to download tweets.

Part V: Read in Twitter Accounts (and add to MongoDB database if first run)

In Lines 117-139 we are creating a Python list of Twitter handles for which we want to download tweets. The first part of the code (lines 119-130) is to check if this is the first time you’re running the code. If so, it will read the Twitter handle data from your local CSV file and insert it into the accounts table in your MongoDB database. In all subsequent runs of the code the script will skip over this block and go directly to line 137 — that creates a list called twitter_accounts that we’ll loop over in Part VI of the code.

Part VI: Main Loop: Loop Over Each of the Twitter Handles in the Accounts Table and Download Tweets

In lines 144-244 we are at the last important step.

This code is much shorter here as well compared to the SQLite version. As noted in my previous post comparing SQLite to MongoDB, in MongoDB we do not need to define all of the columns we wish to insert into our database. MongoDB will just take whatever columns you throw at it and insert. In the SQLite version, in contrast, we had to devote 290 lines of code just specifying what specific parts of the Twitter data we are grabbing and how they relate to our pre-defined variable names.

After stripping out all of those details, the core of this code is the same as in the SQLite version. At line 151 we begin a for loop where we are looping over each Twitter ID (as indicated by the Twitter_handle variable in our accounts database).

Note that within this for loop we have a while loop (lines 166-238). What we are doing here is, for each Twitter ID, we are grabbing up to 16 pages’ worth of tweets; this is the maximum allowed for by the Twitter API. It is in this loop (line 170) that we call our get_data_user_timeline_all_pages function, which on the first loop will grab page 1 for the Twitter ID, then page 2, then page 3, …. up to page 16 so long as there are data to return.

Lines 186-205 contains code for writing the data into our MongoDB database table. We have defined our variable d to contain the result of calling our get_data_user_timeline_all_pages function — this means that, if successful, d will contain 200 tweets’ worth of data. The for loop starting on line 187 will loop over each tweet, add three variables to each tweet — date_inserted, time_date_inserted, and screen_name — and then insert the tweet into our tweets collection.

One last thing I’d like to point out here is the API limit checks I’ve written in lines 221-238. What this code is doing is checking how many remaining API calls you have. If it is too low, the code will pause for 5 minutes.

Part VII: Print out Number of Tweets in Database per Account

This final block of code will print out a summary of how many tweets there are per account in your tweets database.

Now let’s put the whole thing together. To recap, what this entire script does is to loop over each of the Twitter accounts in the accounts table of your MongoDB database — and for each one it will grab up to 3,200 tweets and insert the tweets into the tweets table of your database.

Below is the entire script — download it and save it as tweets.py (or something similar) in the same directory as your accounts.csv file. Add in your Twitter API account details and you’ll be good to go! For a refresher on the different ways you can run the script see this earlier post.

If you’ve found this post helpful please share on your favorite social media site.

You’re on your way to downloading your own Twitter data! Happy coding!




Does Twitter Matter?

5176609404_7cde3c5ee6_b

Twitter is not the Gutenberg Press. The ‘Big Data’ revolution is over-hyped. Nevertheless, Twitter is significant in a number of ways:

  • For identifying trends.
  • For rapid, near real-time dissemination of news.
  • It has been used to track the progression of the flu and other infectious diseases.
  • It has played a mobilizational role in the Arab Spring and other social movement activities.
  • For on-the-ground reporting of news and events.
  • It allows you to decentralize research; what Nigel Cameron calls mutual curation and others call social curation.
  • For looking into the global cocktail party that is Twitter.
  • Allows one to take the pulse of the community on almost any given topic.
  • Twitter is the Big Data source.
  • It constitutes a coordination and communication tool for post-disaster mobilizations.
  • Twitter facilitates the rapid diffusion of ideas, rumors, opinion, sentiment, and news.
  • For professionals and organizations alike, it facilitates networkingrelationship-building, and exposure.
  • Twitter is proving to be a powerful dialogic tool — for initiating and engaging in conversations.
  • Unlike other social media (e.g., Facebook), Twitter has a largely open model, allowing anyone to follow anyone else.
  • Social chatter has become a powerful tool for
    • Hedge fund managers listen in on social media conversations in making their decisions.
    • Tracking and identifying terrorists and extremists.
  • Facilitates the leveraging of what Granovetter (1973) calls weak ties.
  • Can be a force for good (e.g., Twestivals).

In short, Twitter is not just for sharing pictures of your lunch. In addition to all the silliness, Twitter has come to be the world’s premier message network. It These messages are used in a wide variety of settings and for a broad range of purposes. And researchers are able to listen in — a boon to anyone interested in messages, conversations, networks, information, mobilization, diffusion, or any number of social science phenomena.




Establishing a Presence: Advice for PhD Students

Being Director of Graduate Studies gives me plenty of time to reflect on what I’d like students to get out of graduate education. For budding academics, you have all likely heard (countless times!) that the ultimate “deliverable” is high-quality journal articles. Of this there is little doubt — at least in the fields I’m familiar with (social sciences and business). Beyond that, it is important to establish a presence in the field. This can involve such traditional activities as reviewing journal articles, presenting and organizing at conferences, conducting guest seminars, and being involved in sub-field specialty groups. With the spread of new and social media, there is also a new way: establishing a digital presence. If you put in the work during your PhD studies, over the course of your graduate career you will become one of the world’s experts on some area of research, and I would encourage all PhD students to explore the ways that you could make this presence known to your relevant academic community. Increasingly, knowledge and ideas are being shared online — and if you are not actively involved in influencing these knowledge networks you are missing out.

Increasingly, knowledge and ideas are being shared online — and if you are not actively involved in influencing these knowledge networks you are missing out.

I am not talking about just having a LinkedIn or Academia.edu account. Your ultimate goal in establishing a digital presence will be to add value to the conversations that are already happening online. This can be done through microblogging on Twitter, Tumblr or LinkedIn, through a conventional long-story blogging platform, or via original video or slide content such as Slideshare. Here are several paradigms for you to consider as you mull over the digital presence that best fits your interests and talents:

  • The provocateur — push, prod, and provoke the academy in a direction you feel strongly about.
  • The curator — become the source others turn to for by aggregating and re-framing relevant content.
  • The teacher — teach others how to do what you know.
  • The advice-giver — advice is cheap, but you may have something useful to add.
  • The marketer — promote your work, but in a way that is not merely self-serving. Rather, show how your work builds on and enhances existing research. Contribute to the discussion.
  • The practice whisperer — translate the findings of your research in a way that practitioners will find useful. Similarly, you could seek to be a public intellectual, as called for by Nicholas Kristof.

These are just some of the ways you can make a presence in the field. Play around with it and find your identity. One of our graduate students, Wayne Xu, has done an excellent job in using a new blog along with Slideshare to take a teaching role. One of UB’s long-ago graduates, Han Woo Park, has similarly become one of the most successful posters on Slideshare. Personally, my blog mixes the role of provocateur, teacher, advice-giver, marketer, and translator (I leave the curating to others).

Establishing a digital presence is not a replacement for writing strong journal articles, but it is one of the ways you can make your ultimate impact more powerful. In the end, whether you decide to create a web presence in addition to the traditional route or not, be sure you infuse everything you do with quality. The academic world seems huge but it isn’t. Word gets around. Lastly, talk with your advisor — he or she is there to help you set a long-term strategy for not only publishing high-quality journal articles but also for making your presence in the field known.




Why I Use Python for Academic Research

1206711_41147487

Academics and other researchers have to choose from a variety of research skills. Most social scientists do not add computer programming into their skill set. As a strong proponent of the value of learning a programming language, I will lay out how this has proven to be useful for me. A budding programmer could choose from a number of good options — including perl, C++, Java, PHP, or others — but Python has a reputation as being one of the most accessible and intuitive. I obviously like it.

No matter your choice of language, there are variety of ways learning programming will be useful for social scientists and other data scientists. The most important areas are data gathering, data manipulation, and data visualization and analysis.

Data Gathering

When I started learning Python four years ago, I kept a catalogue of the various scripts I wrote. Going over these scripts, I have personally written Python code to gather the following data:

  • Download lender and borrower information for thousands of donation transactions on kiva.org.
  • Download tweets from a list of 100 large nonprofit organizations.
  • Download Twitter profile information from a 150 advocacy nonprofits.
  • Scrape the ‘Walls’ from 65 organizations’ Facebook accounts.
  • Download @messages sent to 38 community foundations.
  • Traverse and download html files for thousands of webpages on large accounting firms’ websites.
  • Scrape data from 1,000s of organizational profiles on a charity rating site.
  • Scrape data from several thousand organizations raising money on the crowdfunding site Indiegogo.
  • Download hundreds of YouTube videos used in Indiegogo fundraising campaigns.
  • Gather data available through the InfoChimps API.
  • Scrape pinning and re-pinning data from health care organizations’ Pinterest accounts.
  • Tap into the Facebook Graph API to download status updates and number of likes, comments and shares for 100 charities.

This is just a sample. The point is that you can use a programming language like Python to get just about any data from the Web. When the website or social media platform makes available an API (application programming interface), accessing the data is easy. Twitter is fantastic for this very reason. In other cases — including most websites — you will have to scrape the data through creative use of programming. Either way, you can gain access to valuable data.

There’s no need to be an expert to obtain real-world benefits from programming. I started learning Python four years ago (I now consider myself an intermediate-level programmer) and gained substantive benefits right from the start.

Data Manipulation

Budding researchers often seem to under-estimate how much time they will be spending on manipulating, reshaping, and processing their data. Python excels at data munging. I have recently used Python code to

  • Loop over hundreds of thousands of tweets and modify characters, convert date formats, etc.
  • Identify and delete duplicate entries in an SQL database.
  • Loop over 74 nonprofit organizations’ Twitter friend-follower lists to create a 74 x 74 friendship network.
  • Read in and write text and CSV data.
  • Countless grouping, merging, and aggregation functions.
  • Automatically count the number of “negative” words in thousands of online donation appeals.
  • Loop over hundreds of thousands of tweets to create an edge list for a retweet network.
  • Compute word counts for a word-document matrix from thousands of crowdfunding appeals.
  • Create text files combining all of an organizations’ tweets for use in creating word clouds.
  • Download images included in a set of tweets.
  • Merging text files.
  • Count number of Facebook statuses per organization.
  • Loop over hundreds of thousands of rows of tweets in an SQLite database and create additional variables for future analysis.
  • Dealing with missing data.
  • Creating dummy variables.
  • Find the oldest entry for each organization in a Twitter database.
  • Use pandas (Python Data Analysis Library) to aggregate Twitter data to the daily, weekly, and monthly level.
  • Create a text file of all hashtags in a Twitter database.

Data Visualization and Analysis

With the proliferation of scientific computing modules such as pandas and statsmodels and scikit-learn, Python’s data analysis capabilities have gotten much more powerful over the past few years. With such tools Python can now compete in many areas with devoted statistical programs such as or Stata, which I have traditionally used for most of my data analysis and visualization. Lately I’m doing more and more of this work directly in Python. Here are some of the analyses I have run recently using Python:

  • Implement a naive Bayesian classifier to classify the sentiment in hundreds of thousands of tweets.
  • Linguistic analysis of donation appeals and tweets using Python’s Natural Language Tool Kit.
  • Create plots of number of tweets, retweets, and public reply messages per day, week, and month.
  • Run descriptive statistics and multiple regressions.

Summary

Learning a programming language is a challenge. Of that there is little doubt. Yet the payoff in improved productivity alone can be substantial. Add to that the powerful analytical and data visualization capabilities that open up to the researcher who is skilled in a programming language. Lastly, leaving aside the buzzword “Big Data,” programming opens up a world of new data found on websites, social media platforms, and online data repositories. I would thus go so far as to say that any researcher interested in social media is doing themselves a great disservice by not learning some programming. For this very reason, one of my goals on this site is to provide guidance to those who are interested in getting up and running on Python for conducting academic and social media research.




How Organizations Use Social Media: Engaging the Public

The research I’ve done on organizations’ use of social media suggests there are three main types of messages that organizations send on social media: informational, community-building, and “action” (promotional & mobilizational) messages.

Each type constitutes a different way of engaging with the intended audience:

  • Informational messages serve to inform — about the organization’s activities or anything of interest to the organization’s audience. One-way communication from organization to public. The audience is in the role of learner.
  • Community-building messages serve to build a relationship with the audience through engaging in dialogue or making a network connection. Two-way communication. Audience is in the role of discussant or connector.
  • Promotional & mobilizational messages serve to ask the audience to do something for the organization — attend an event, make a donation, engage in a protest, volunteer, or serve as an advocate, etc. One-way mobilizational communication. Audience is in the role of actor.

[bibshow file=I-C-A.bib, format=apa template=av-bibtex-modified]

This framework originated in a small “Cybermetrics” graduate seminar I taught several years ago that involved inductive analyses of nonprofit organizations’ messages on Twitter (working with one PhD student, Kristen Lovejoy), and Facebook (working with another PhD student, I-hsuan Chiu). This collaborative work resulted in two publications that layed out the basic framework (Lovejoy & Saxton, 2012; Saxton, Guo, Chiu, & Feng, 2011).[bibcite key=Lovejoy2012][bibcite key=Saxton2011b]

Why was this framework innovative or important? Public relations theory had a “relational turn” in the late 1990s, where the focused shifted from an emphasis on strategic one-way communications to building relationships (Broom, Casey, & Ritchey, 1997; Hon & Grunig, 1999; Kent & Taylor, 1998, 2002; Ledingham, 2003; Ledingham & Bruning, 1998).[bibcite key=Broom1997][bibcite key=Hon1999][bibcite key=Kent1998][bibcite key=Kent2002][bibcite key=Ledingham2003][bibcite key=Ledingham1998] These studies were highly influential and helped re-shape the field of public relations to this date. Around the same time they were published, new media began to take off. The effect was that public relations and communication scholars began to focus on ways organizations were employing relationship-building and dialogic strategies in their new media efforts, contrasting these co-creational and dialogic efforts with one-way “informational” communication. In brief, by the time I started this research there was a substantial body of work on the informational and community-building efforts of organizations on new media.

Yet two key things were missing. One, scholars had yet to examine and code the key tool used by organizations on social media — the actual messages, the tweets and Facebook statuses they organizations were sending. Prior social media studies had looked at static profiles and the like. Two, in focusing on informational vs. dialogic communication, scholars had not recognized the considerable mobilizational element of organizations’ social media messages. Our study helped build on prior research and fill in both of these gaps. Our inductive study zeroed in on the messages and revealed the substantial use of tweets as a “call to action” for the organizations’ constituents, whether this was a call for volunteers, for donations, for social action, for retweeting a message, for attending an event or, indeed, for anything where the organization asked its constituents to “do something” for the organization. We labeled these tweets “promotional and mobilizational” messages or, for short, action messages.

I think this “I-C-A” (information-community-action) framework is a useful way of examining organizations’ messages, and have continued to use it in my research on nonprofit organizations, including studies of advocacy organizations (Guo & Saxton, 2014),[bibcite key=Guo2014] of the determinants of social media use (Nah & Saxton, 2013),[bibcite key=Nah2013] and of the effectiveness of organizational messages (Waters & Saxton, 2014).[bibcite key=Saxton2014b]

I am also honored that the framework is also finding itself useful by scholars working in other fields, including those working in the health field (Thackeray, Neiger, Burton, & Thackeray, 2013)[bibcite key=Thackeray2013] and political communication (Xu, Sang, Blasiola, & Park, 2014).[bibcite key=Xu2014]

If you’re a social media manager and are wondering about the practical significance of this research, it is important to understand the differences between these different messages, and to have an appropriate mix of each type. Informational, mobilizational, and community-building messages each have a different intended audience orientation that should be tailored to the needs of both the audience and the organization. Don’t rely only on the ‘megaphone’ (informational messages), and don’t ‘mobilize’ (action messages) too often. Most effective will be organizations that actively seek to build relationships with their target audience members. Ultimately, the appropriate mix will depend heavily on the organization’s social media strategy — and if you don’t have one, you should.

I’ve created an infographic that shows the differences:

References
[/bibshow]