If you had told me a year or two ago that I would one day learn how to
estimate a structural topic model (STM) -- an unsupervised machine learning
based form of text analysis that incorporates “metadata” (i.e., a matrix of covariates) in the estimation of a generalized linear model of topical
prevalence and content -- I would have either laughed until I cried or stared blankly at you wondering if either you or I had just had a stroke because I didn't understand a word of the gibberish that just came out of your mouth. And yet, here I am, taking my first steps into the wild world of text analysis.
I was given a lot of freedom to approach this project as I saw fit, and as I was researching methods for studying textual data I came across the STM. I decided to give it a try, because, why not?
If it didn't work, I could always use another technique. But, if it did work, I realized I'd be doing, potentially, cutting edge research. No one has ever used an STM to analyze presidential debates. And for that matter, few social scientists in general have made use of the STM given that the technique is so new. So, after weeks of headaches and a lot of trial and error, I managed to teach myself how to use an STM with R (a free "language and environment for statistical computing"). The results so far have been pretty cool.
As you can see based on my heuristic interpretation of each of the topics, each topic itself is clearly not a monolithic issue in the way most people would think. Rather, each topic encompasses a cluster of issues and individual debaters and moderators that are all strongly associated with one another. If you think about the average debate you've watched on TV, this way of understanding topics as issue clusters makes a lot of sense. Politicians love to string issues and policies together as they discuss them. And the STM allows us to see, I would argue, more clearly how these issues are correlated than merely watching debates or reading debate transcripts would allow.
What's also pretty cool about using an STM to analyze something like debate transcripts is that you can use the results from the STM to estimate regression models where the proportion of each debate associated with a given topic is a function of any or all of the covariates included in the estimation of the STM itself. For example, we can estimate the effect of party ID on the expected proportion of each debate associated with a given topic appearing in a Republican, as opposed to a Democratic, primary debate while controlling for the effect of the election year:
In the above figure, you can see the expected proportion of a debate associated with a given topic depending on whether the debate was a Republican or Democratic primary debate (95% confidence intervals shown). Topics 1, 4, and 7 are clearly a more prevalent feature of Democratic debates, while the rest are more likely to appear in Republican debates. Topic 9 is especially more likely to appear in a Republican debate. And what does topic 9 have to do with?: jobs, manufacturing, and making a deal with China.
I wanted to take some time to highlight this topic because, as most would agree, this cluster of issues has become especially prominent in recent years.
And the results from my analysis clearly bear this out:
As you can see via the above figure, not only has topic 9 been more associated with Republican debates (red = Republican; blue = Democrat), it has become more prevalent over time, which got me to thinking: has the prevalence of this topic (jobs, manufacturing, and making a deal with China) varied with public perceptions about China's economic strength? Out of curiosity, I decided to find out.
Below is a figure showing the percentage of Americans who, according to Gallup polls, viewed China as the leading world economy for each of the election years included in my analysis (2004 values were unavailable so I took the mean of the 2000 and 2008 values as a proxy for public perception in 2004).
There clearly has been an upward trend, and this shift, though it does not do so perfectly, coincides with the rise in topic 9's prevalence in primary debates:
While the high degree of overlap of the 95% confidence intervals suggests that the relationship between Gallup's public opinion poll and topic 9's prevalence is not statistically significant by conventionally accepted standards, there is a clear trend here. (However, I should warn that the % of Americans who view China as the world’s leading economy was not a covariate used in STM estimation, which means caution should be taken in interpreting these results because they may fail to meet the "assumptions of the method of composition." It's recommended, when covariates are included in a regression model where the response variable is the proportion of a document associated with a topic estimated by a STM, that at least all regressors in the model should be covariates included in the STM’s estimation. But, while this method improves the plausibility of the model, it does not mean that the above results are entirely implausible.)
It's important for two reasons:
Unlike other, more traditional quantitative methods, topic models allow researchers to conduct statistical analysis on textual data -- open-ended survey questions, newspaper articles, blogs, transcripts, etc. -- where "topics" (i.e., sets of highly associated words) are identified, not by a person, but by an algorithm, allowing users to conduct statistical analysis on large bodies of text in a much quicker way than traditional hand coding of textual data can be done.
While topic models have been growing in popularity among statisticians and computer scientists, they are not a new development; however, as computing power has improved, estimating topic models has become much more efficient and less time consuming, which is probably why they've become more popular. The structural topic model (STM), however, is a recent development. Unlike previous types of text analysis, the STM allows users to estimate a topic model using document covariates. For example, if you wanted to analyze political blogs and estimate the prevalence and/or content of those blogs based on whether the blog's author was conservative or liberal, an STM would allow you to do so.
A Structural Topic Model of Primary and General Debate Transcripts: Why Not?
Earlier this summer I received a research assistantship, at the beginning of which I was given a fairly simple assignment: look at presidential debates dating back to 2000 and develop a research question.I was given a lot of freedom to approach this project as I saw fit, and as I was researching methods for studying textual data I came across the STM. I decided to give it a try, because, why not?
If it didn't work, I could always use another technique. But, if it did work, I realized I'd be doing, potentially, cutting edge research. No one has ever used an STM to analyze presidential debates. And for that matter, few social scientists in general have made use of the STM given that the technique is so new. So, after weeks of headaches and a lot of trial and error, I managed to teach myself how to use an STM with R (a free "language and environment for statistical computing"). The results so far have been pretty cool.
The Model
For the purposes of this post, I'm just going to display the results I obtained by analyzing transcripts of primary debates from 2000, 2004, 2008, 2012, and 2016.
While the STM does most of the work for the user, the model requires a user specified number of topics for the model to estimate. There really is no "right" or "wrong" number of topics to select, but it takes some trial and error to settle on a number of topics that give interpretable results. Thus, topic models require both machine and human being working in tandem to make sense out of the STM, which is kind of cool. You have a computer algorithm and a human's intuition working together...but I digress.
I finally settled on an STM with 10 user-selected topics where the prevalence of topics per debate was a function of the election year and whether the debate was affiliated with either the Republican or Democratic Party (party ID), and where the content of each topic was a function of party ID.
Results
The STM was estimated using 93 total primary debates with a combined 13,132 word vocabulary where each word was "stemmed" (i.e., all words were converted to root words).
The following figures display the top 20 word stems associated with each of the 10 topics identified by the STM:
And here is a figure of my interpretation of each of the topics:
What's also pretty cool about using an STM to analyze something like debate transcripts is that you can use the results from the STM to estimate regression models where the proportion of each debate associated with a given topic is a function of any or all of the covariates included in the estimation of the STM itself. For example, we can estimate the effect of party ID on the expected proportion of each debate associated with a given topic appearing in a Republican, as opposed to a Democratic, primary debate while controlling for the effect of the election year:
In the above figure, you can see the expected proportion of a debate associated with a given topic depending on whether the debate was a Republican or Democratic primary debate (95% confidence intervals shown). Topics 1, 4, and 7 are clearly a more prevalent feature of Democratic debates, while the rest are more likely to appear in Republican debates. Topic 9 is especially more likely to appear in a Republican debate. And what does topic 9 have to do with?: jobs, manufacturing, and making a deal with China.
I wanted to take some time to highlight this topic because, as most would agree, this cluster of issues has become especially prominent in recent years.
And the results from my analysis clearly bear this out:
As you can see via the above figure, not only has topic 9 been more associated with Republican debates (red = Republican; blue = Democrat), it has become more prevalent over time, which got me to thinking: has the prevalence of this topic (jobs, manufacturing, and making a deal with China) varied with public perceptions about China's economic strength? Out of curiosity, I decided to find out.
Below is a figure showing the percentage of Americans who, according to Gallup polls, viewed China as the leading world economy for each of the election years included in my analysis (2004 values were unavailable so I took the mean of the 2000 and 2008 values as a proxy for public perception in 2004).
There clearly has been an upward trend, and this shift, though it does not do so perfectly, coincides with the rise in topic 9's prevalence in primary debates:
While the high degree of overlap of the 95% confidence intervals suggests that the relationship between Gallup's public opinion poll and topic 9's prevalence is not statistically significant by conventionally accepted standards, there is a clear trend here. (However, I should warn that the % of Americans who view China as the world’s leading economy was not a covariate used in STM estimation, which means caution should be taken in interpreting these results because they may fail to meet the "assumptions of the method of composition." It's recommended, when covariates are included in a regression model where the response variable is the proportion of a document associated with a topic estimated by a STM, that at least all regressors in the model should be covariates included in the STM’s estimation. But, while this method improves the plausibility of the model, it does not mean that the above results are entirely implausible.)
Why Care?
At this point you might be saying to yourself, "this is pretty neat stuff, but why is this important?"It's important for two reasons:
- Ease of use and replicability: Analyzing debate transcripts by hand, especially in the future as more and more debates will have taken place, is especially time consuming and fraught with the possibility of human error. The STM does not completely remove the human element from textual analysis, but it does incorporate a level of consistency, and thus provides for a high degree of replicability, that is not usually possible with more traditional methods. In other words, anyone else could repeat my procedures exactly and obtain results comparable to my own. Granted, their heuristic interpretation of the results might be different than mine, but the results upon which that interpretation would be made would be the same as mine. This allows researchers to conduct research, test hypotheses, and build upon one another's work as a part of a potentially larger research program.
- It opens up a whole new world for statistical analysis and research: You can apply a STM to a variety of text-based data: open-ended survey questions, tweets, online discussions, pamphlets, Facebook posts, blogs, you name it. Many such sources of text would be otherwise difficult to quantify and thus analyze using traditional statistical methods. The STM allows us to measure what would not have been possible to measure previously.
In short, the STM makes hard work easy, and it makes what was previously hard to quantify quantifiable. It brings new meaning to the phrase, "work smarter; not harder."
'Till next time, cheers!
-------------
For more details about STMs and how to use them with R, see the following site:
Comments
Post a Comment