Academics and other researchers have to choose from a variety of research skills. Most social scientists do not add computer programming into their skill set. As a strong proponent of the value of learning a programming language, I will lay out how this has proven to be useful for me. A budding programmer could choose from a number of good options — including perl, C++, Java, PHP, or others — but Python has a reputation as being one of the most accessible and intuitive. I obviously like it.
No matter your choice of language, there are variety of ways learning programming will be useful for social scientists and other data scientists. The most important areas are data gathering, data manipulation, and data visualization and analysis.
Data Gathering
When I started learning Python four years ago, I kept a catalogue of the various scripts I wrote. Going over these scripts, I have personally written Python code to gather the following data:
- Download lender and borrower information for thousands of donation transactions on kiva.org.
- Download tweets from a list of 100 large nonprofit organizations.
- Download Twitter profile information from a 150 advocacy nonprofits.
- Scrape the ‘Walls’ from 65 organizations’ Facebook accounts.
- Download @messages sent to 38 community foundations.
- Traverse and download html files for thousands of webpages on large accounting firms’ websites.
- Scrape data from 1,000s of organizational profiles on a charity rating site.
- Scrape data from several thousand organizations raising money on the crowdfunding site Indiegogo.
- Download hundreds of YouTube videos used in Indiegogo fundraising campaigns.
- Gather data available through the InfoChimps API.
- Scrape pinning and re-pinning data from health care organizations’ Pinterest accounts.
- Tap into the Facebook Graph API to download status updates and number of likes, comments and shares for 100 charities.
This is just a sample. The point is that you can use a programming language like Python to get just about any data from the Web. When the website or social media platform makes available an API (application programming interface), accessing the data is easy. Twitter is fantastic for this very reason. In other cases — including most websites — you will have to scrape the data through creative use of programming. Either way, you can gain access to valuable data.
There’s no need to be an expert to obtain real-world benefits from programming. I started learning Python four years ago (I now consider myself an intermediate-level programmer) and gained substantive benefits right from the start.
Data Manipulation
Budding researchers often seem to under-estimate how much time they will be spending on manipulating, reshaping, and processing their data. Python excels at data munging. I have recently used Python code to
- Loop over hundreds of thousands of tweets and modify characters, convert date formats, etc.
- Identify and delete duplicate entries in an SQL database.
- Loop over 74 nonprofit organizations’ Twitter friend-follower lists to create a 74 x 74 friendship network.
- Read in and write text and CSV data.
- Countless grouping, merging, and aggregation functions.
- Automatically count the number of “negative” words in thousands of online donation appeals.
- Loop over hundreds of thousands of tweets to create an edge list for a retweet network.
- Compute word counts for a word-document matrix from thousands of crowdfunding appeals.
- Create text files combining all of an organizations’ tweets for use in creating word clouds.
- Download images included in a set of tweets.
- Merging text files.
- Count number of Facebook statuses per organization.
- Loop over hundreds of thousands of rows of tweets in an SQLite database and create additional variables for future analysis.
- Dealing with missing data.
- Creating dummy variables.
- Find the oldest entry for each organization in a Twitter database.
- Use pandas (Python Data Analysis Library) to aggregate Twitter data to the daily, weekly, and monthly level.
- Create a text file of all hashtags in a Twitter database.
Data Visualization and Analysis
With the proliferation of scientific computing modules such as pandas and statsmodels and scikit-learn, Python’s data analysis capabilities have gotten much more powerful over the past few years. With such tools Python can now compete in many areas with devoted statistical programs such as R or Stata, which I have traditionally used for most of my data analysis and visualization. Lately I’m doing more and more of this work directly in Python. Here are some of the analyses I have run recently using Python:
- Implement a naive Bayesian classifier to classify the sentiment in hundreds of thousands of tweets.
- Linguistic analysis of donation appeals and tweets using Python’s Natural Language Tool Kit.
- Create plots of number of tweets, retweets, and public reply messages per day, week, and month.
- Run descriptive statistics and multiple regressions.
Summary
Learning a programming language is a challenge. Of that there is little doubt. Yet the payoff in improved productivity alone can be substantial. Add to that the powerful analytical and data visualization capabilities that open up to the researcher who is skilled in a programming language. Lastly, leaving aside the buzzword “Big Data,” programming opens up a world of new data found on websites, social media platforms, and online data repositories. I would thus go so far as to say that any researcher interested in social media is doing themselves a great disservice by not learning some programming. For this very reason, one of my goals on this site is to provide guidance to those who are interested in getting up and running on Python for conducting academic and social media research.