Skip to main content

Republicans vs. Democrats: A Structural Topic Model of Primary Presidential Debates

If you had told me a year or two ago that I would one day learn how to estimate a structural topic model (STM) -- an unsupervised machine learning based form of text analysis that incorporates “metadata” (i.e., a matrix of covariates) in the estimation of a generalized linear model of topical prevalence and content -- I would have either laughed until I cried or stared blankly at you wondering if either you or I had just had a stroke because I didn't understand a word of the gibberish that just came out of your mouth. And yet, here I am, taking my first steps into the wild world of text analysis.

Unlike other, more traditional quantitative methods, topic models allow researchers to conduct statistical analysis on textual data -- open-ended survey questions, newspaper articles, blogs, transcripts, etc.  -- where "topics" (i.e., sets of highly associated words) are identified, not by a person, but by an algorithm, allowing users to conduct statistical analysis on large bodies of text in a much quicker way than traditional hand coding of textual data can be done.

While topic models have been growing in popularity among statisticians and computer scientists, they are not a new development; however, as computing power has improved, estimating topic models has become much more efficient and less time consuming, which is probably why they've become more popular. The structural topic model (STM), however, is a recent development. Unlike previous types of text analysis, the STM allows users to estimate a topic model using document covariates. For example, if you wanted to analyze political blogs and estimate the prevalence and/or content of those blogs based on whether the blog's author was conservative or liberal, an STM would allow you to do so.

A Structural Topic Model of Primary and General Debate Transcripts:  Why Not?

Earlier this summer I received a research assistantship, at the beginning of which I was given a fairly simple assignment: look at presidential debates dating back to 2000 and develop a research question.

I was given a lot of freedom to approach this project as I saw fit, and as I was researching methods for studying textual data I came across the STM. I decided to give it a try, because, why not?

If it didn't work, I could always use another technique. But, if it did work, I realized I'd be doing, potentially, cutting edge research. No one has ever used an STM to analyze presidential debates. And for that matter, few social scientists in general have made use of the STM given that the technique is so new. So, after weeks of headaches and a lot of trial and error, I managed to teach myself how to use an STM with R (a free "language and environment for statistical computing"). The results so far have been pretty cool.


The Model

For the purposes of this post, I'm just going to display the results I obtained by analyzing transcripts of primary debates from 2000, 2004, 2008, 2012, and 2016.

While the STM does most of the work for the user, the model requires a user specified number of topics for the model to estimate. There really is no "right" or "wrong" number of topics to select, but it takes some trial and error to settle on a number of topics that give interpretable results. Thus, topic models require both machine and human being working in tandem to make sense out of the STM, which is kind of cool. You have a computer algorithm and a human's intuition working together...but I digress.

I finally settled on an STM with 10 user-selected topics where the prevalence of topics per debate was a function of the election year and whether the debate was affiliated with either the Republican or Democratic Party (party ID), and where the content of each topic was a function of party ID.

Results

The STM was estimated using 93 total primary debates with a combined 13,132 word vocabulary where each word was "stemmed" (i.e., all words were converted to root words).

The following figures display the top 20 word stems associated with each of the 10 topics identified by the STM:

And here is a figure of my interpretation of each of the topics:


As you can see based on my heuristic interpretation of each of the topics, each topic itself is clearly not a monolithic issue in the way most people would think. Rather, each topic encompasses a cluster of issues and individual debaters and moderators that are all strongly associated with one another. If you think about the average debate you've watched on TV, this way of understanding topics as issue clusters makes a lot of sense. Politicians love to string issues and policies together as they discuss them. And the STM allows us to see, I would argue, more clearly how these issues are correlated than merely watching debates or reading debate transcripts would allow.

What's also pretty cool about using an STM to analyze something like debate transcripts is that you can use the results from the STM to estimate regression models where the proportion of each debate associated with a given topic is a function of any or all of the covariates included in the estimation of the STM itself. For example, we can estimate the effect of party ID on the expected proportion of each debate associated with a given topic appearing in a Republican, as opposed to a Democratic, primary debate while controlling for the effect of the election year:


In the above figure, you can see the expected proportion of a debate associated with a given topic depending on whether the debate was a Republican or Democratic primary debate (95% confidence intervals shown). Topics 1, 4, and 7 are clearly a more prevalent feature of Democratic debates, while the rest are more likely to appear in Republican debates. Topic 9 is especially more likely to appear in a Republican debate. And what does topic 9 have to do with?: jobs, manufacturing, and making a deal with China.

I wanted to take some time to highlight this topic because, as most would agree, this cluster of issues has become especially prominent in recent years.

And the results from my analysis clearly bear this out:

As you can see via the above figure, not only has topic 9 been more associated with Republican debates (red = Republican; blue = Democrat), it has become more prevalent over time, which got me to thinking: has the prevalence of this topic (jobs, manufacturing, and making a deal with China) varied with public perceptions about China's economic strength? Out of curiosity, I decided to find out.

Below is a figure showing the percentage of Americans who, according to Gallup polls, viewed China as the leading world economy for each of the election years included in my analysis (2004 values were unavailable so I took the mean of the 2000 and 2008 values as a proxy for public perception in 2004).

There clearly has been an upward trend, and this shift, though it does not do so perfectly, coincides with the rise in topic 9's prevalence in primary debates:


While the high degree of overlap of the 95% confidence intervals suggests that the relationship between Gallup's public opinion poll and topic 9's prevalence is not statistically significant by conventionally accepted standards, there is a clear trend here. (However, I should warn that the % of Americans who view China as the world’s leading economy was not a covariate used in STM estimation, which means caution should be taken in interpreting these results because they may fail to meet the "assumptions of the method of composition." It's recommended, when covariates are included in a regression model where the response variable is the proportion of a document associated with a topic estimated by a STM, that at least all regressors in the model should be covariates included in the STM’s estimation. But, while this method improves the plausibility of the model, it does not mean that the above results are entirely implausible.)


Why Care?

At this point you might be saying to yourself, "this is pretty neat stuff, but why is this important?"

It's important for two reasons:
  1. Ease of use and replicability:  Analyzing debate transcripts by hand, especially in the future as more and more debates will have taken place, is especially time consuming and fraught with the possibility of human error. The STM does not completely remove the human element from textual analysis, but it does incorporate a level of consistency, and thus provides for a high degree of replicability, that is not usually possible with more traditional methods. In other words, anyone else could repeat my procedures exactly and obtain results comparable to my own. Granted, their heuristic interpretation of the results might be different than mine, but the results upon which that interpretation would be made would be the same as mine. This allows researchers to conduct research, test hypotheses, and build upon one another's work as a part of a potentially larger research program.
  2. It opens up a whole new world for statistical analysis and research:  You can apply a STM to a variety of text-based data: open-ended survey questions, tweets, online discussions, pamphlets, Facebook posts, blogs, you name it. Many such sources of text would be otherwise difficult to quantify and thus analyze using traditional statistical methods. The STM allows us to measure what would not have been possible to measure previously.
In short, the STM makes hard work easy, and it makes what was previously hard to quantify quantifiable. It brings new meaning to the phrase, "work smarter; not harder."

'Till next time, cheers!

-------------

For more details about STMs and how to use them with R, see the following site:

Comments

Popular posts from this blog

A Network Analysis of Foreign Aid Commitments

International Relations scholars often talk about the "diffusion" of norms, behaviors, security worries, etc. throughout the international system. Foreign aid policy is one such norm -- one that developed, democratic countries often are peer-pressured into sharing. But which countries lead the way in terms of aid commitments? Why Network Analysis? The study of networks in the social sciences has largely been restricted to sociology; however, more recently, other fields such as political science (international relations in particular) have adopted network science as a tool in the study of social phenomena. Networks provide a visually intuitive graphical representation of the multiple connections among numerous actors. Aside from being a visually appealing representation of a network of relationships, network analysis of the international system helps to bring to light (and also account for) the fact that international politics is inherently multilateral . Most analyses in

Immigration, Islam, and Social Media: The Latest Round of Research (Spring 2016)

With my journey through grad school continuing, I've successfully completed three research papers this semester for my proseminars on international politics, political modernization, and congressional politics. The fact that I enjoyed each class and conducting research for each proves just how much of a nerd I really am! This latest round of papers I've completed encompass a variety of topics. In the first, I explore the relationship between foreign aid and immigration: is increasing foreign aid to a given country a viable strategy for curbing inflows of migrants? If so, foreign aid can serve as a valuable tool for policymakers who have to contend with constituent demands that immigration be reduced. Simultaneously, because foreign aid, in theory, would reduce immigration by improving prospects for employment and ensuring more secure household incomes in other countries, aid can be a more successful long-term strategy for reducing immigration since individuals are motivated t

World Bank Education Programs and Democratic Transition

Among the various topics that I'm currently researching, the relationship between education and democracy is one that I find particularly interesting. Even though the relationship between education and democratization has been regarded as a quintessentially important one, only recently have political scholars begun to empirically examine this relationship. The pdf file I've linked below is a recent research paper I completed for a proseminar in comparative politics in which I attempt to contribute to the existing discussion on the relationship between education and democracy. I'd be happy to answer any questions regarding my findings or the data I used. Enjoy! The Impact of World Bank Education Programs on Democratic Transition