# install.packages("word2vec")
library(word2vec)Word Embeddings
Session 3d
Training your own word embeddings is beyond the scope of this class. For a simple demonstration, I provide examples of using pre-trained word embeddings in R. Off-the-shelf word embeddings often perform on par with human-generated words (Rodriguez & Spirling, 2020). Pre-trained models are also preferred over training your own word embeddings when your data set is relatively small.
1 Load the pre-trained model
In this session, I demonstrate using the pre-trained word2vec model. In the following example, I use the one trained by the CBOW negative sampling method: “CBOW, Negative Sampling, vector size 500, window 10”. Other pre-trained models are available online or by request. To work them in R, you most likely need to use Python codes first. For simplistic reason, the example I used here are ready to import directly into R.
To load this pre-trained word2vec model, we use the read.word2vec() function from the word2vec package. This may take a while depending on your computer RAM and configurations.
w2v.pt <- read.word2vec("cb_ns_500_10.w2v", normalize = TRUE)How many word vectors does this pre-trained model have?
# Convert the loaded word2vec model to a matrix
w2v.pt.matrix<-as.matrix(w2v.pt)
dim(w2v.pt.matrix)[1] 437107 500
# Display the first 10 words in the model
head(rownames(w2v.pt.matrix), 10) [1] "tessin" "kilmarnock]" "spewers" "sgwhite" "sealberg"
[6] "gabon" "divergentes" "apco" "arulanantham" "viggle"
If you want to use a GloVe pre-trained model, you can find a tutorial provided by Emil Hvitfeldt and Julia Silge.
2 Similar words
As words are presented by vectors, we can find word similarities by calculating the cosine similarities of pairs of words.
Let’s find the most similar words to the listed political words that Rodriguez and Spirling (2020) used in their intrinsic validation test.
poli_similar.words<-predict(w2v.pt,
newdata = c("immigration", "taxes", "republican", "democracy"),
type = "nearest",
top_n = 5)Display the most similar words:
poli_similar.words$immigration term1 term2 similarity rank
1 immigration immigrants 0.8366448 1
2 immigration undocumented 0.7768343 2
3 immigration deportations 0.7714913 3
4 immigration immigrant 0.7623906 4
5 immigration deportation 0.7597899 5
poli_similar.words$taxes term1 term2 similarity rank
1 taxes tax 0.9159983 1
2 taxes taxation 0.8370461 2
3 taxes levies 0.8146588 3
4 taxes taxing 0.8013099 4
5 taxes deductions 0.7893462 5
poli_similar.words$republican term1 term2 similarity rank
1 republican gop 0.9607631 1
2 republican democratic 0.9106967 2
3 republican republicans 0.8939405 3
4 republican democrats 0.8659310 4
5 republican democrat 0.8647416 5
poli_similar.words$democracy term1 term2 similarity rank
1 democracy democracies 0.8171181 1
2 democracy freedoms 0.7857101 2
3 democracy dictatorship 0.7847129 3
4 democracy autocracy 0.7838171 4
5 democracy pluralist 0.7821502 5
3 Word analogies
Do you recall the classical example of “king-man+woman=queen”? Let’s check it out.
# Retrieve the word vectors for "king", "man", and "woman"
word.vector <- predict(w2v.pt,
newdata = c("king", "man", "woman"),
type = "embedding")dim(word.vector)[1] 3 500
rownames(word.vector)[1] "king" "man" "woman"
Then, we conduct the word vector operations.
# Perform the word vector operation: king - man + woman
word.vector.op <- word.vector["king", ] - word.vector["man", ] + word.vector["woman", ]Finally, we find the words that have the nearest locations to the word.vector.op vector.
# Find the nearest words to the resulting vector
predict(w2v.pt,
newdata = word.vector.op,
type = "nearest",
top_n = 3) term similarity rank
1 king 0.9479475 1
2 queen 0.7680065 2
3 princess 0.7155131 3
It seems like some “debiasing” work was done in this pre-trained model.
4 Go beyond the basics
Train your own models:
The tidyverse approach demonstrated by Emil Hvitfeldt and Julia Silge in their book “Supervised Machine Learning for Text Analysis in R” (Ch5) and by Julia Silge in her blog.
You can find a introductory-level tutorial showing how to use the
word2vecpackage to train word2vec models here.A comprehensive case study series by Jurriaan Nagelkerke and Wouter van Gils showed how to use word embeddings to predict restaurant Michelin stars from customer reviews.
If you want to aggregate the word vevtors at higher levels (e.g., sentence and document levels), you may find the
doc2vecandtext2mappackages are helpful.Zhaowen Guo provides a quick example of working with a Chinese word embeddings model
If you prefer to use Python, you may want to check out the word embeddings 101 material prepared for social scientist audience made by Connor Gilroy.