Topic modeling is an inductively driven text analytic method commonly used to show the underlying, latent topics of large bodies of text (Blei, Ng, and Jordan 2003; McFarland et al. 2013; Mohr and Bogdanov 2013). While researchers often use topic modeling for descriptive purposes, recent extensions allow a researcher to predict the likelihood that a document’s metadata (e.g., publication year, author, funder, etc.) is correlated with the proportion of that document dedicated to a particular topic (Roberts, Stewart, and Tingley n.d.).
In the code below, I use the STM package in R to run topic models using the abstracts of a subset of ERIC articles. I then run a regression that estimates the effect of federal funding on the proportion of each article that is about each of the latent topics.
library(stm) library(tidyverse) library(furrr) library(plotly) library(htmlwidgets)
After getting all ERIC articles, I subset the data to get ones focused specifically on academic achievement in the U.S. before 1983. I then created a bunch of dummy variables (including, most importantly, whether the government funded the study) that are included in the files I import here.
# import subset of articles and different dummy datasets and merge together options(warn=-1) achieve_orig <- read_csv('achieve_articles_60_94.csv', col_types = cols()) ach_inst_dummies <- read_csv('achieve_60_94_inst_dummies.csv', col_types = cols()) achieve_pre_83_dummies = read_csv('ach_pre_83_dummies.csv', col_types = cols()) achieve_orig_inst <- merge(achieve_orig, ach_inst_dummies, by = 'accno') # just select last columns, since they're already in achieve_orig achieve_pre_83_dummies <- achieve_pre_83_dummies %>% select(16:24) achieve = merge(achieve_orig_inst, achieve_pre_83_dummies, by = 'accno')
# a few columns, sorted by federal government sponsorship achieve %>% select(accno, pub_year, title, sponsors, fed_gov_v2, foundations_v2) %>% arrange(desc(fed_gov_v2)) %>% head
|ED021313||1967||A Student Report on Student and Teacher Reactions to Modular Scheduling and Independent Study at South Hills High School during the 1966-67 School Year.||Office of Education (DHEW), Washington, DC.||1||0|
|ED021675||1968||Effects of an Enrichment Program on the School Achievement of Rural Welfare Recipient Children.||Office of Education (DHEW), Washington, DC. Bureau of Research.||1||0|
|ED021702||1968||Forecasting Academic Status.||Office of Education (DHEW), Washington, DC. Bureau of Research.||1||0|
|ED021944||1968||The Segregated Student in Desegregated Schools; Sources of Influence on Negro Secondary Students. Final Report.||Office of Education (DHEW), Washington, DC.||1||0|
|ED021948||1967||The Development and Evaluation of a Diagnostically Based Curriculum for Preschool Psychosocially Deprived Children. Final Report.||Office of Education (DHEW), Washington, DC. Bureau of Research.||1||0|
|ED021972||1967||Flexibility for Vocational Education through Computer Scheduling. Quarterly Report.||Office of Education (DHEW), Washington, DC.||1||0|
Seventeen percent were funded by an education agency and 1 percent by a foundation
Running Topic Models
A common critique of topic modeling is that researchers have to choose the number of latent topics for the body of texts, an often subjective and difficult to replicate process. To address this, I run models that include different numbers of topics (ranging from 8 to 20).
# list of the numbers of topics num_k <- c(8, 10, 12, 14, 16, 18, 20) # words to exclude stop_words <- c('education', 'educational', 'achievement', 'achieve', 'quot', 'author', 'authors', 'review', 'study', 'academic', 'studies', 'research', 'paper', 'document', 'report', 'use', 'can', 'may', 'documents', 'will', 'must')
The code below processes the abstracts of the abstracts, includes only characters, excludes the words in the stop_words list, and doesn’t stem the words. See Schofield et al. for more on stemming. Thanks to Neal for the tip. It also excludes words that are in fewer than 5 documents (out of ~10,000) and more than 70 percent.
achieve_process <- textProcessor(achieve$description, metadata = achieve, onlycharacter = TRUE, stem = FALSE, customstopwords = stop_words) # drop words that appear in fewer than 5 documents and more than 70 percent of documents achieve_out <- prepDocuments(achieve_process$documents, achieve_process$vocab, achieve_process$meta, lower.thresh = 4, upper.thresh = floor(nrow(achieve)*.7))
The next four code snippets are adapted from a great blog post by Julia Silge about how to run multiple topic models and evaluate them based on a number of measures, including semantic coherence (words in a topic that frequently co-occur together in documents) and exclusivity (how exclusive words are to a particular topic). See the STM documentation for more.
# start parallel processing plan(multiprocess) # run 7 different topic models, using publication year and whether a publication was sponsored at all as covariates in estimating the topic prevalence many_models <- tibble(K = num_k) %>% mutate(topic_model = future_map(K, ~stm(documents = achieve_out$documents, vocab = achieve_out$vocab, data = achieve_out$meta, K = ., prevalence = ~ s(pub_year) + sponsor_dummy, max.em.its = 75, seed = 1234, verbose = FALSE)))
# this creates a table that includes different measures (heldout likelihood, semantic coherence, exclusivity, and residuals) for each of the models heldout <- make.heldout(documents = achieve_out$documents, vocab = achieve_out$vocab) k_result <- many_models %>% mutate(exclusivity = map(topic_model, exclusivity), semantic_coherence = map(topic_model, semanticCoherence, achieve_out$documents), eval_heldout = map(topic_model, eval.heldout, heldout$missing), residual = map(topic_model, checkResiduals, achieve_out$documents), bound = map_dbl(topic_model, function(x) max(x$convergence$bound)), lfact = map_dbl(topic_model, function(x) lfactorial(x$settings$dim$K)), lbound = bound + lfact, iterations = map_dbl(topic_model, function(x) length(x$convergence$bound)))
# plot the model diagnostics, by number of topics k_result %>% transmute(K, `Lower bound` = lbound, Residuals = map_dbl(residual, "dispersion"), `Semantic coherence` = map_dbl(semantic_coherence, mean), `Held-out likelihood` = map_dbl(eval_heldout, "expected.heldout")) %>% gather(Metric, Value, -K) %>% ggplot(aes(K, Value, color = Metric)) + geom_line(size = 1.5, alpha = 0.7, show.legend = FALSE) + facet_wrap(~Metric, scales = "free_y") + labs(x = "K (number of topics)", y = NULL, title = "Diagnostics by number of topics")
From the above plots, it looks like 12 is a good number of topics, but let’s compare semantic coherence and exclusivity of 10 and 14 topics too.
k_result %>% select(K, exclusivity, semantic_coherence) %>% filter(K %in% c(10, 12, 14)) %>% unnest() %>% mutate(K = as.factor(K)) %>% ggplot(aes(semantic_coherence, exclusivity, color = K)) + geom_point(size = 2, alpha = 0.7) + labs(x = "Semantic coherence", y = "Exclusivity", title = "Exclusivity and semantic coherence")
From this plot, it looks like k=12 does a good job of balancing semantic coherence and exclusivity.
Print out most common words and most frequent words for our BEST model (K=12).
# Use 3 in double brackets because K=12 is the third model in many_models map(many_models$topic_model, labelTopics)[]
Topic 1 Top Words: Highest Prob: children, social, parents, child, class, school, home FREX: article, child, childs, influences, emotional, role, home Lift: conversation, dep, deviancy, jencks, jensen, malnutrition, nurture Score: children, child, article, parents, parent, home, social Topic 2 Top Words: Highest Prob: students, group, groups, experimental, control, classes, instruction FREX: experimental, week, treatment, reinforcement, cai, assigned, programed Lift: cai, guides, jigsaw, quizzes, shorthand, stad, tutees Score: experimental, guides, group, instruction, treatment, control, cai Topic 3 Top Words: Highest Prob: rural, family, age, income, population, years, data FREX: women, labor, men, employment, rural, income, residence Lift: coal, compositional, immigration, mexicans, migrating, owned, owners Score: women, income, rural, family, indian, labor, employment Topic 4 Top Words: Highest Prob: school, schools, students, high, public, district, student FREX: desegregation, schools, inner, desegregated, city, busing, private Lift: castle, desegregated, ghettos, hoffer, kilgore, bused, busing Score: schools, desegregation, school, district, districts, racial, students Topic 5 Top Words: Highest Prob: grade, test, reading, children, scores, year, tests FREX: kindergarten, arithmetic, grade, start, reading, readiness, vocabulary Lift: caldwell, delta, piat, ppvt, slosson, binet, mrt Score: grade, children, reading, test, scores, kindergarten, arithmetic Topic 6 Top Words: Highest Prob: students, science, performance, mathematics, course, student, college FREX: science, courses, course, checks, naep, chemistry, music Lift: booklets, bscs, oec, pssc, xerography, chem, commentsobservations Score: science, course, checks, naep, iscs, students, college Topic 7 Top Words: Highest Prob: system, evaluation, problems, learning, performance, programs, goals FREX: accountability, issues, standards, policy, decision, contracting, conference Lift: auditing, technologies, transaction, audiences, campbell, cbe, coherent Score: accountability, policy, issues, evaluation, discusses, contracting, objectives Topic 8 Top Words: Highest Prob: data, variables, analysis, used, test, measures, model FREX: regression, analyses, variable, variables, multiple, variance, validity Lift: squares, longstep, accessioned, equations, jacobson, reanalysis, slopes Score: variables, tell, regression, variance, data, test, analysis Topic 9 Top Words: Highest Prob: self, ability, concept, high, students, low, relationship FREX: locus, esteem, boys, anxiety, ses, self, girls Lift: internality, ach, locus, arousal, coopersmith, crandall, externality Score: self, boys, concept, girls, locus, sex, hall Topic 10 Top Words: Highest Prob: program, programs, evaluation, project, students, title, language FREX: title, bilingual, projects, program, esea, services, career Lift: amigos, bedford, bslc, clovis, connell, coordinator, creole Score: program, title, staff, migrant, evaluation, services, bilingual Topic 11 Top Words: Highest Prob: information, data, presented, section, assessment, results, procedures FREX: hearing, section, bibliography, chapter, appendix, volume, summary Lift: dissertations, entries, alphabetical, alphabetically, collections, debt, indexed Score: section, chapter, assessment, information, hearing, contains, summary Topic 12 Top Words: Highest Prob: teachers, teacher, student, classroom, teaching, behavior, learning FREX: teacher, behaviors, classroom, teachers, classrooms, observation, teaching Lift: engagement, authorsjd, brophy, btes, objections, pbte, personsinstructors Score: teacher, teachers, classroom, teaching, student, behavior, behaviors
Once we have our topics, one thing we may want to do is get the prevalence of each topic over time. You can do this in the STM package, but the code below creates a long dataframe, where each row gives the prevalence of each topic (X1-X12, in this case) for each document.
achieve_out$meta$ID <- seq.int(nrow(achieve_out$meta)) # replace list number with best model theta <- data.frame(many_models$topic_model[]$theta) theta$ID <- seq.int(nrow(theta)) achieve_all <- merge(achieve_out$meta, theta) achieve_cond <- achieve_all %>% select(ID, accno, pub_year, starts_with('X')) achieve_long <- gather(data = achieve_cond, key = topic, value = theta, starts_with('X'), factor_key=TRUE)
The table below gives the thetas (topic prevalence) of two articles from 1964. According to our topic models, the top 3 topics of the first article are topics 10, 3, and 4, and the top 3 topics of the second article are topics 10, 6, and 2. Based on a quick reading of the articles, that looks pretty accurate!
achieve_long %>% arrange(accno, desc(theta)) %>% head(24)
Plotting topic prevalence over time
# This code creates a table that gets average and median topic prevalence by year and makes the plot a little more readable by # adding the five top words of a topic as a column. achieve_cols <- as.list(levels(achieve_long$topic)) achieve_yr_topic <- achieve_long %>% group_by(pub_year, topic) %>% summarize(total = n(), med_theta = median(theta), avg_theta = mean(theta)) # change first part of list to best model achieve_words <- labelTopics(many_models$topic_model[], n = 5)[] achieve_topics_words <- data.frame(achieve_words, row.names = achieve_cols) # add column with words all together achieve_topics_words <- achieve_topics_words %>% unite(topic_name_words, sep = ', ') achieve_topics_words$topic <- rownames(achieve_topics_words) achieve_merged <- merge(achieve_yr_topic, achieve_topics_words, by = 'topic')
achieve_merged %>% arrange(pub_year, desc(avg_theta)) %>% head
|X2||1960||18||0.305325985||0.26460608||students, group, groups, experimental, control|
|X6||1960||18||0.072287775||0.12910775||students, science, performance, mathematics, course|
|X3||1960||18||0.005440901||0.10032768||rural, family, age, income, population|
|X8||1960||18||0.034242573||0.08077072||data, variables, analysis, used, test|
|X12||1960||18||0.036720780||0.07868751||teachers, teacher, student, classroom, teaching|
|X5||1960||18||0.035785285||0.07195502||grade, test, reading, children, scores|
# Not a great plot achieve_merged %>% # start with 1965. Not many articles in dataset before then. filter(pub_year > 1964) %>% ggplot(aes(pub_year, avg_theta, color = topic)) + geom_line() + labs(title = 'Achievement Articles, Twelve Topics, Using Average Thetas', x = 'Publication Year', y = 'Topic Prevalence') + theme_classic()
# If you want the plot to be interactive with a useful legend # Unfortunately, this is not rending correctly to my website plot_12k <- achieve_merged %>% # start with 1965. Not many articles in dataset before then. filter(pub_year > 1964) %>% ggplot(aes(pub_year, avg_theta, color = topic_name_words)) + geom_line() + labs(title = 'Achievement Articles, Twelve Topics, Using Average Thetas', x = 'Publication Year', y = 'Topic Prevalence') + theme_classic() as_widget(ggplotly(plot_12k))
# Save it as an interactive .html file saveWidget(as_widget(p_12k), "p_12k.html")
# Run a regression of topic prevalence on federal support, foundation support, and publication year reg_prep12 <- estimateEffect(1:12 ~ fed_gov_v2 + foundations_v2 + s(pub_year), many_models$topic_model[], meta = achieve_out$meta, uncertainty = "Global")
# From regression results, plot the effect of federal support on the 12 topics # This also labels the topics plot(reg_prep12, covariate = 'fed_gov_v2', topics = 1:12, model = many_models$topic_model[], method = 'difference', cov.value1 = 1, cov.value2 = 0, xlim = c(-.1, .1), xlab = 'Difference in Topic Proportion, Based on Federal Support', main = 'Effect of Federal Support on Topic Prevalence \nin Academic Achievement Publications', labeltype = 'custom', # label topics 1-12 with more common sense names custom.labels = c('Parental Influence', 'Experimental Research', 'Rural Issues', 'Race and Desegregation', 'Reading Development', 'Science and Math Education', 'Performance Evaluation and Accountability', 'Test Scores and Analysis', 'Self-Esteem and Gender', 'Federal Program Evaluation', 'Assessment', 'Classroom Instruction'))
- Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2003. “Latent Dirichlet Allocation.” Journal of Machine Learning Research 3:993–1022.
- McFarland, Daniel A., Daniel Ramage, Jason Chuang, Jeffrey Heer, Christopher D. Manning, and Daniel Jurafsky. 2013. “Differentiating Language Usage through Topic Models.” Poetics 41(6):607–25.
- Mimno, David, Hanna Wallach, Edmund Talley, Miriam Leenders, and Andrew McCallum. 2011. “Optimizing Semantic Coherence in Topic Models.” Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. 262–72.
- Mohr, John W. and Petko Bogdanov. 2013. “Introduction—Topic Models: What They Are and Why They Matter.” Poetics 41(6):545–69.
- Schoﬁeld, Alexandra, Mans Magnusson, Laure Thompson, and David Mimno. n.d. “Understanding Text Pre-Processing for Latent Dirichlet Allocation.”