Does Twitter Matter?


Twitter is not the Gutenberg Press. The ‘Big Data’ revolution is over-hyped. Nevertheless, Twitter is significant in a number of ways:

  • For identifying trends.
  • For rapid, near real-time dissemination of news.
  • It has been used to track the progression of the flu and other infectious diseases.
  • It has played a mobilizational role in the Arab Spring and other social movement activities.
  • For on-the-ground reporting of news and events.
  • It allows you to decentralize research; what Nigel Cameron calls mutual curation and others call social curation.
  • For looking into the global cocktail party that is Twitter.
  • Allows one to take the pulse of the community on almost any given topic.
  • Twitter is the Big Data source.
  • It constitutes a coordination and communication tool for post-disaster mobilizations.
  • Twitter facilitates the rapid diffusion of ideas, rumors, opinion, sentiment, and news.
  • For professionals and organizations alike, it facilitates networkingrelationship-building, and exposure.
  • Twitter is proving to be a powerful dialogic tool — for initiating and engaging in conversations.
  • Unlike other social media (e.g., Facebook), Twitter has a largely open model, allowing anyone to follow anyone else.
  • Social chatter has become a powerful tool for
    • Hedge fund managers listen in on social media conversations in making their decisions.
    • Tracking and identifying terrorists and extremists.
  • Facilitates the leveraging of what Granovetter (1973) calls weak ties.
  • Can be a force for good (e.g., Twestivals).

In short, Twitter is not just for sharing pictures of your lunch. In addition to all the silliness, Twitter has come to be the world’s premier message network. It These messages are used in a wide variety of settings and for a broad range of purposes. And researchers are able to listen in — a boon to anyone interested in messages, conversations, networks, information, mobilization, diffusion, or any number of social science phenomena.

Establishing a Presence: Advice for PhD Students

Being Director of Graduate Studies gives me plenty of time to reflect on what I’d like students to get out of graduate education. For budding academics, you have all likely heard (countless times!) that the ultimate “deliverable” is high-quality journal articles. Of this there is little doubt — at least in the fields I’m familiar with (social sciences and business). Beyond that, it is important to establish a presence in the field. This can involve such traditional activities as reviewing journal articles, presenting and organizing at conferences, conducting guest seminars, and being involved in sub-field specialty groups. With the spread of new and social media, there is also a new way: establishing a digital presence. If you put in the work during your PhD studies, over the course of your graduate career you will become one of the world’s experts on some area of research, and I would encourage all PhD students to explore the ways that you could make this presence known to your relevant academic community. Increasingly, knowledge and ideas are being shared online — and if you are not actively involved in influencing these knowledge networks you are missing out.

Increasingly, knowledge and ideas are being shared online — and if you are not actively involved in influencing these knowledge networks you are missing out.

I am not talking about just having a LinkedIn or account. Your ultimate goal in establishing a digital presence will be to add value to the conversations that are already happening online. This can be done through microblogging on Twitter, Tumblr or LinkedIn, through a conventional long-story blogging platform, or via original video or slide content such as Slideshare. Here are several paradigms for you to consider as you mull over the digital presence that best fits your interests and talents:

  • The provocateur — push, prod, and provoke the academy in a direction you feel strongly about.
  • The curator — become the source others turn to for by aggregating and re-framing relevant content.
  • The teacher — teach others how to do what you know.
  • The advice-giver — advice is cheap, but you may have something useful to add.
  • The marketer — promote your work, but in a way that is not merely self-serving. Rather, show how your work builds on and enhances existing research. Contribute to the discussion.
  • The practice whisperer — translate the findings of your research in a way that practitioners will find useful. Similarly, you could seek to be a public intellectual, as called for by Nicholas Kristof.

These are just some of the ways you can make a presence in the field. Play around with it and find your identity. One of our graduate students, Wayne Xu, has done an excellent job in using a new blog along with Slideshare to take a teaching role. One of UB’s long-ago graduates, Han Woo Park, has similarly become one of the most successful posters on Slideshare. Personally, my blog mixes the role of provocateur, teacher, advice-giver, marketer, and translator (I leave the curating to others).

Establishing a digital presence is not a replacement for writing strong journal articles, but it is one of the ways you can make your ultimate impact more powerful. In the end, whether you decide to create a web presence in addition to the traditional route or not, be sure you infuse everything you do with quality. The academic world seems huge but it isn’t. Word gets around. Lastly, talk with your advisor — he or she is there to help you set a long-term strategy for not only publishing high-quality journal articles but also for making your presence in the field known.

Slideshare Tutorial

One of the PhD students I work with, Weiai Xu, has put together an informative slideshow for how to download Twitter data:

Curiosity Bits Tutorial: Mining Twitter User Profile on Python V2 from Weiai Wayne Xu

For more, check out his website.

Why I Use Python for Academic Research


Academics and other researchers have to choose from a variety of research skills. Most social scientists do not add computer programming into their skill set. As a strong proponent of the value of learning a programming language, I will lay out how this has proven to be useful for me. A budding programmer could choose from a number of good options — including perl, C++, Java, PHP, or others — but Python has a reputation as being one of the most accessible and intuitive. I obviously like it.

No matter your choice of language, there are variety of ways learning programming will be useful for social scientists and other data scientists. The most important areas are data gathering, data manipulation, and data visualization and analysis.

Data Gathering

When I started learning Python four years ago, I kept a catalogue of the various scripts I wrote. Going over these scripts, I have personally written Python code to gather the following data:

  • Download lender and borrower information for thousands of donation transactions on
  • Download tweets from a list of 100 large nonprofit organizations.
  • Download Twitter profile information from a 150 advocacy nonprofits.
  • Scrape the ‘Walls’ from 65 organizations’ Facebook accounts.
  • Download @messages sent to 38 community foundations.
  • Traverse and download html files for thousands of webpages on large accounting firms’ websites.
  • Scrape data from 1,000s of organizational profiles on a charity rating site.
  • Scrape data from several thousand organizations raising money on the crowdfunding site Indiegogo.
  • Download hundreds of YouTube videos used in Indiegogo fundraising campaigns.
  • Gather data available through the InfoChimps API.
  • Scrape pinning and re-pinning data from health care organizations’ Pinterest accounts.
  • Tap into the Facebook Graph API to download status updates and number of likes, comments and shares for 100 charities.

This is just a sample. The point is that you can use a programming language like Python to get just about any data from the Web. When the website or social media platform makes available an API (application programming interface), accessing the data is easy. Twitter is fantastic for this very reason. In other cases — including most websites — you will have to scrape the data through creative use of programming. Either way, you can gain access to valuable data.

There’s no need to be an expert to obtain real-world benefits from programming. I started learning Python four years ago (I now consider myself an intermediate-level programmer) and gained substantive benefits right from the start.

Data Manipulation

Budding researchers often seem to under-estimate how much time they will be spending on manipulating, reshaping, and processing their data. Python excels at data munging. I have recently used Python code to

  • Loop over hundreds of thousands of tweets and modify characters, convert date formats, etc.
  • Identify and delete duplicate entries in an SQL database.
  • Loop over 74 nonprofit organizations’ Twitter friend-follower lists to create a 74 x 74 friendship network.
  • Read in and write text and CSV data.
  • Countless grouping, merging, and aggregation functions.
  • Automatically count the number of “negative” words in thousands of online donation appeals.
  • Loop over hundreds of thousands of tweets to create an edge list for a retweet network.
  • Compute word counts for a word-document matrix from thousands of crowdfunding appeals.
  • Create text files combining all of an organizations’ tweets for use in creating word clouds.
  • Download images included in a set of tweets.
  • Merging text files.
  • Count number of Facebook statuses per organization.
  • Loop over hundreds of thousands of rows of tweets in an SQLite database and create additional variables for future analysis.
  • Dealing with missing data.
  • Creating dummy variables.
  • Find the oldest entry for each organization in a Twitter database.
  • Use pandas (Python Data Analysis Library) to aggregate Twitter data to the daily, weekly, and monthly level.
  • Create a text file of all hashtags in a Twitter database.

Data Visualization and Analysis

With the proliferation of scientific computing modules such as pandas and statsmodels and scikit-learn, Python’s data analysis capabilities have gotten much more powerful over the past few years. With such tools Python can now compete in many areas with devoted statistical programs such as or Stata, which I have traditionally used for most of my data analysis and visualization. Lately I’m doing more and more of this work directly in Python. Here are some of the analyses I have run recently using Python:

  • Implement a naive Bayesian classifier to classify the sentiment in hundreds of thousands of tweets.
  • Linguistic analysis of donation appeals and tweets using Python’s Natural Language Tool Kit.
  • Create plots of number of tweets, retweets, and public reply messages per day, week, and month.
  • Run descriptive statistics and multiple regressions.


Learning a programming language is a challenge. Of that there is little doubt. Yet the payoff in improved productivity alone can be substantial. Add to that the powerful analytical and data visualization capabilities that open up to the researcher who is skilled in a programming language. Lastly, leaving aside the buzzword “Big Data,” programming opens up a world of new data found on websites, social media platforms, and online data repositories. I would thus go so far as to say that any researcher interested in social media is doing themselves a great disservice by not learning some programming. For this very reason, one of my goals on this site is to provide guidance to those who are interested in getting up and running on Python for conducting academic and social media research.

How Organizations Use Social Media: Engaging the Public

The research I’ve done on organizations’ use of social media suggests there are three main types of messages that organizations send on social media: informational, community-building, and “action” (promotional & mobilizational) messages.

Each type constitutes a different way of engaging with the intended audience:

  • Informational messages serve to inform — about the organization’s activities or anything of interest to the organization’s audience. One-way communication from organization to public. The audience is in the role of learner.
  • Community-building messages serve to build a relationship with the audience through engaging in dialogue or making a network connection. Two-way communication. Audience is in the role of discussant or connector.
  • Promotional & mobilizational messages serve to ask the audience to do something for the organization — attend an event, make a donation, engage in a protest, volunteer, or serve as an advocate, etc. One-way mobilizational communication. Audience is in the role of actor.

[bibshow file=I-C-A.bib, format=apa template=av-bibtex-modified]

This framework originated in a small “Cybermetrics” graduate seminar I taught several years ago that involved inductive analyses of nonprofit organizations’ messages on Twitter (working with one PhD student, Kristen Lovejoy), and Facebook (working with another PhD student, I-hsuan Chiu). This collaborative work resulted in two publications that layed out the basic framework (Lovejoy & Saxton, 2012; Saxton, Guo, Chiu, & Feng, 2011).[bibcite key=Lovejoy2012][bibcite key=Saxton2011b]

Why was this framework innovative or important? Public relations theory had a “relational turn” in the late 1990s, where the focused shifted from an emphasis on strategic one-way communications to building relationships (Broom, Casey, & Ritchey, 1997; Hon & Grunig, 1999; Kent & Taylor, 1998, 2002; Ledingham, 2003; Ledingham & Bruning, 1998).[bibcite key=Broom1997][bibcite key=Hon1999][bibcite key=Kent1998][bibcite key=Kent2002][bibcite key=Ledingham2003][bibcite key=Ledingham1998] These studies were highly influential and helped re-shape the field of public relations to this date. Around the same time they were published, new media began to take off. The effect was that public relations and communication scholars began to focus on ways organizations were employing relationship-building and dialogic strategies in their new media efforts, contrasting these co-creational and dialogic efforts with one-way “informational” communication. In brief, by the time I started this research there was a substantial body of work on the informational and community-building efforts of organizations on new media.

Yet two key things were missing. One, scholars had yet to examine and code the key tool used by organizations on social media — the actual messages, the tweets and Facebook statuses they organizations were sending. Prior social media studies had looked at static profiles and the like. Two, in focusing on informational vs. dialogic communication, scholars had not recognized the considerable mobilizational element of organizations’ social media messages. Our study helped build on prior research and fill in both of these gaps. Our inductive study zeroed in on the messages and revealed the substantial use of tweets as a “call to action” for the organizations’ constituents, whether this was a call for volunteers, for donations, for social action, for retweeting a message, for attending an event or, indeed, for anything where the organization asked its constituents to “do something” for the organization. We labeled these tweets “promotional and mobilizational” messages or, for short, action messages.

I think this “I-C-A” (information-community-action) framework is a useful way of examining organizations’ messages, and have continued to use it in my research on nonprofit organizations, including studies of advocacy organizations (Guo & Saxton, 2014),[bibcite key=Guo2014] of the determinants of social media use (Nah & Saxton, 2013),[bibcite key=Nah2013] and of the effectiveness of organizational messages (Waters & Saxton, 2014).[bibcite key=Saxton2014b]

I am also honored that the framework is also finding itself useful by scholars working in other fields, including those working in the health field (Thackeray, Neiger, Burton, & Thackeray, 2013)[bibcite key=Thackeray2013] and political communication (Xu, Sang, Blasiola, & Park, 2014).[bibcite key=Xu2014]

If you’re a social media manager and are wondering about the practical significance of this research, it is important to understand the differences between these different messages, and to have an appropriate mix of each type. Informational, mobilizational, and community-building messages each have a different intended audience orientation that should be tailored to the needs of both the audience and the organization. Don’t rely only on the ‘megaphone’ (informational messages), and don’t ‘mobilize’ (action messages) too often. Most effective will be organizations that actively seek to build relationships with their target audience members. Ultimately, the appropriate mix will depend heavily on the organization’s social media strategy — and if you don’t have one, you should.

I’ve created an infographic that shows the differences:


Using Python to Grab Twitter User Data


I often get requests to explain how I obtained the data I used in a particular piece of academic research. I am always happy to share my code along with my data (and frankly, I think academics who are unwilling to share should be forced to take remedial Kindergarten). The problem is, many of those who would like to use the code don’t know where to start. There are too many new steps involved for the process to be accessible. So, I’ll try to walk you through the basic steps here through periodic tutorials.

To start, Python is a great tool for grabbing data from the Web. Generally speaking, you’ll get your data by either accessing an API (Application Programming Interface) or by ‘scraping’ the data off a webpage. The easiest scenario is when a site makes available an API. Twitter is such a site. Accordingly, as an introductory example I’ll walk you through the basic steps of using Python to access the Twitter API, read and manipulate the data returned, and save the output.

In any given project I will run a number of different scripts to grab all of the relevant data. We’ll start with a simple example. This script is designed to grab the  information on a set of Twitter users. First, as stated above, what we’re doing to get the data is tapping into the Twitter API. For our purposes, think of the Twitter API as a set of routines Twitter has set up for allowing us to access specific chunks of data. I use Python for this, given its many benefits, though any programming language will work. If you are really uninterested in programming and have more limited data needs, you can use NodeXL  (if you’re on a Windows machine) or other services for gathering the data. If you do go the Python route, I highly recommend you install Anaconda Python 2.7 — it’s free, it works on Mac and PC, and includes most of the add-on packages necessary for scientific computing. In short, you pick a programming language and learn some of it and then develop code that will extract and process the data for you. Even though you can start with my code as a base, it is still useful to understand the basics, so I highly recommend doing some of the many excellent tutorials now available online for learning how to use and run Python. A great place to start is Codeacademy.

Accessing the Twitter API

Almost all of my Twitter code grabs data from the Twitter API. The first step is to determine which part of the Twitter API you’ll need to access to get the type of data you want — there are different API methods for accessing information on tweets, retweets, users, following relationships, etc. The code we’re using here plugs into the users/lookup part of the Twitter API, which allows for the bulk downloading of Twitter user information. You can see a description of this part of the API here, along with definitions for the variables returned. Here is a list of the most useful of the variables returned by the API for each user (modified descriptions taken from the Twitter website):

created_atThe UTC datetime that the user account was created on Twitter.
descriptionThe user-defined UTF-8 string describing their account.
entitiesEntities which have been parsed out of the url or description fields defined by the user.
favourites_countThe number of tweets this user has favorited in the account's lifetime. British spelling used in the field name for historical reasons.
followers_countThe number of followers this account currently has. We can also get a list of these followers by using different parts of the API.
friends_countThe number of users this account is following (AKA their "followings"). We can also get a list of these friends using other API methods.
idThe integer representation of the unique identifier for this User. This number is greater than 53 bits and some programming languages may have difficulty/silent defects in interpreting it. Using a signed 64 bit integer for storing this identifier is safe. Use id_str for fetching the identifier to stay on the safe side. See Twitter IDs, JSON and Snowflake.
id_strThe string representation of the unique identifier for this User. Implementations should use this rather than the large, possibly un-consumable integer in id.
langThe BCP 47 code for the user's self-declared user interface language.
listed_countThe number of public lists that this user is a member of.
locationThe user-defined location for this account's profile. Not necessarily a location nor parseable.
nameThe name of the user, as they've defined it. Not necessarily a person's name.
screen_nameThe screen name, handle, or alias that this user identifies themselves with. screen_names are unique but subject to change. Use id_str as a user identifier whenever possible. Typically a maximum of 15 characters long, but some historical accounts may exist with longer names.
statuses_countThe number of tweets (including retweets) issued by the user to date.
time_zoneA string describing the Time Zone this user declares themselves within.
urlA URL provided by the user in association with their profile.
withheld_in_countriesWhen present, indicates a textual representation of the two-letter country codes this user is withheld from. See New Withheld Content Fields in API Responses.
withheld_scopeWhen present, indicates whether the content being withheld is the "status" or a "user." See New Withheld Content Fields in API Responses.


Second, beginning in 2013 Twitter made it more difficult to access the API. Now OAuth authentication is needed for almost everything. This means you need to go on Twitter and create an ‘app.’ You won’t actually use the app for anything — you just need the password and authentication code. You can create your app here. For more detailed instructions on creating the app take a look at this presentation.

Third, as a Python ‘wrapper’ around the Twitter API I use Twython. This is a package that is an add-on to Python. You will need to install this as well as simplejson (for parsing the JSON data that is returned by the API). Assuming you installed Anaconda Python, the simplest way is to use pip. On a Mac or Linux machine, you would simply open the Terminal and type pip install Twython and pip install simplejson. 

The above steps can be a bit of a pain depending on your familiarity with UNIX, but you’ll only have to do them once. It may take you a while. But once they’re all set up you won’t need to do it again.

Understanding the Code

At the end of this post I’ll show the entire script. For now, I’ll go over it in sections. The first line in the code is the shebang — you’ll find this in all Python code.


Lines 3 – 10 contain the docstring — also a Python convention. This is a multi-line comment that describes the code. For single-line comments, use the # symbol at the start of the line.


Next we’ll import several Python packages needed to run the code.


In lines 18-22 we will create day, month, and year variables to be used for naming the output file.


Modify the Code

There are two areas you’ll need to modify. First, you’ll need to add your OAuth tokens to lines 26-30.


Second, you’ll need to modify lines 32-35 with the ids from your set of Twitter users. If you don’t have user_ids for these, you can use screen_names and change line 39 to ‘screen_name = ids’


Line 39 is where we actually access the API and grab the data. If you’ve read over the description of users/lookup API, you know that this method allows you to grab user information on up to 100 Twitter IDs with each API call.


Understanding JSON

Now, a key step to this is understanding the data that are returned by the API. As is increasingly common with Web data, this API call returns data in JSON format. Behind the scenes, Python has grabbed this JSON file, which has data on the 32 Twitter users listed above in the variable ids. Each user is an object in the JSON file; objects are delimited by left and right curly braces, as shown here for one of the 32 users:


JSON output can get messy, so it’s useful to bookmark a JSON viewer for formatting JSON output. What you’re seeing above is 38 different variables returned by the API — one for each row — and arranged in key: value (or variable: value) pairs. For instance, the value for the screen_name variable for this user is GPforEducation. Now, we do not always want to use all of these variables, so what we’ll do is pick and label those that are most useful for us.

So, we first initialize the output file, putting in the day/month/year in the file name, which is useful if you’re regularly downloading this user information:


We then create a variable with the names for the variables (columns) we’d like to include in our output file, open the output file, and write the header row:


Recall that in line 39 we grabbed the user information on the 32 users and assigned these data to the variable users. The final block of code in lines 55-90 loops over each of these IDs (each one a different object in the JSON file), creates the relevant variables, and writes a new row of output. Here’s the first few rows:


If you compare this code to the raw JSON output shown earlier, what we’re doing here is creating an empty Python dictionary, which we’ll call ‘r’, to hold our data for each user, creating variables called id and screen_name, and assigning the values held in the entry[‘id’] and entry[‘screen_name’] elements of the JSON output to those two respective variables. This is all placed inside a Python for loop — we could have called ‘entry’ anything so long as we’re consistent.

Now let’s put the whole thing together. To recap, what this entire script does is to loop over each of the Twitter accounts in the ids variable — and for each one it will grab its profile information and add that to a row of the output file (a text file that can be imported into Excel, etc.). The filename given to the output file varies according to the date. Now you can download this script, modify the lines noted above, and be on your way to downloading your own Twitter data!