Word Embeddings

Session 3d

Author
Affiliation

Zixi Chen, PhD

NYU-Shanghai

Published

November 13, 2025

Training your own word embeddings is beyond the scope of this class. For a simple demonstration, I provide examples of using pre-trained word embeddings in R. Off-the-shelf word embeddings often perform on par with human-generated words (Rodriguez & Spirling, 2020). Pre-trained models are also preferred over training your own word embeddings when your data set is relatively small.

# install.packages("word2vec")

library(word2vec)

1 Load the pre-trained model

In this session, I demonstrate using the pre-trained word2vec model. In the following example, I use the one trained by the CBOW negative sampling method: “CBOW, Negative Sampling, vector size 500, window 10”. Other pre-trained models are available online or by request. To work them in R, you most likely need to use Python codes first. For simplistic reason, the example I used here are ready to import directly into R.

To load this pre-trained word2vec model, we use the read.word2vec() function from the word2vec package. This may take a while depending on your computer RAM and configurations.

w2v.pt <- read.word2vec("cb_ns_500_10.w2v", normalize = TRUE)

How many word vectors does this pre-trained model have?

# Convert the loaded word2vec model to a matrix
w2v.pt.matrix<-as.matrix(w2v.pt)

dim(w2v.pt.matrix)
[1] 437107    500
# Display the first 10 words in the model
head(rownames(w2v.pt.matrix), 10)
 [1] "tessin"       "kilmarnock]"  "spewers"      "sgwhite"      "sealberg"    
 [6] "gabon"        "divergentes"  "apco"         "arulanantham" "viggle"      
Tip

If you want to use a GloVe pre-trained model, you can find a tutorial provided by Emil Hvitfeldt and Julia Silge.

2 Similar words

As words are presented by vectors, we can find word similarities by calculating the cosine similarities of pairs of words.

Let’s find the most similar words to the listed political words that Rodriguez and Spirling (2020) used in their intrinsic validation test.

poli_similar.words<-predict(w2v.pt, 
        newdata = c("immigration", "taxes", "republican", "democracy"), 
        type = "nearest", 
        top_n = 5)

Display the most similar words:

poli_similar.words$immigration
        term1        term2 similarity rank
1 immigration   immigrants  0.8366448    1
2 immigration undocumented  0.7768343    2
3 immigration deportations  0.7714913    3
4 immigration    immigrant  0.7623906    4
5 immigration  deportation  0.7597899    5
poli_similar.words$taxes
  term1      term2 similarity rank
1 taxes        tax  0.9159983    1
2 taxes   taxation  0.8370461    2
3 taxes     levies  0.8146588    3
4 taxes     taxing  0.8013099    4
5 taxes deductions  0.7893462    5
poli_similar.words$republican
       term1       term2 similarity rank
1 republican         gop  0.9607631    1
2 republican  democratic  0.9106967    2
3 republican republicans  0.8939405    3
4 republican   democrats  0.8659310    4
5 republican    democrat  0.8647416    5
poli_similar.words$democracy
      term1        term2 similarity rank
1 democracy  democracies  0.8171181    1
2 democracy     freedoms  0.7857101    2
3 democracy dictatorship  0.7847129    3
4 democracy    autocracy  0.7838171    4
5 democracy    pluralist  0.7821502    5
Activity

Do you have any words for which you would like to find their most similar words from the embeddings trained on Google News?

Are the similar words detected by the embeddings the same as your thoughts?

Did you find any interesting findings?

3 Word analogies

Do you recall the classical example of “king-man+woman=queen”? Let’s check it out.

# Retrieve the word vectors for "king", "man", and "woman"
word.vector <- predict(w2v.pt, 
                       newdata = c("king", "man", "woman"), 
                       type = "embedding")
dim(word.vector)
[1]   3 500
rownames(word.vector)
[1] "king"  "man"   "woman"

Then, we conduct the word vector operations.

# Perform the word vector operation: king - man + woman
word.vector.op <- word.vector["king", ] - word.vector["man", ] + word.vector["woman", ]

Finally, we find the words that have the nearest locations to the word.vector.op vector.

# Find the nearest words to the resulting vector
predict(w2v.pt, 
        newdata = word.vector.op, 
        type = "nearest", 
        top_n = 3)
      term similarity rank
1     king  0.9479475    1
2    queen  0.7680065    2
3 princess  0.7155131    3

It seems like some “debiasing” work was done in this pre-trained model.

Activity

Can you find the analogies of A) countries and capital cities, and B) gender and occupations?

How about the accuracy of task A? Any gender stereotypes detected in task B?

4 Go beyond the basics

  • Train your own models:

    • The tidyverse approach demonstrated by Emil Hvitfeldt and Julia Silge in their book “Supervised Machine Learning for Text Analysis in R” (Ch5) and by Julia Silge in her blog.

    • You can find a introductory-level tutorial showing how to use the word2vec package to train word2vec models here.

    • A comprehensive case study series by Jurriaan Nagelkerke and Wouter van Gils showed how to use word embeddings to predict restaurant Michelin stars from customer reviews.

  • If you want to aggregate the word vevtors at higher levels (e.g., sentence and document levels), you may find the doc2vec and text2map packages are helpful.

  • Zhaowen Guo provides a quick example of working with a Chinese word embeddings model

  • If you prefer to use Python, you may want to check out the word embeddings 101 material prepared for social scientist audience made by Connor Gilroy.

Back to top