CS224N Notes - Lecture 1

Language is glorious chaos.

How do we represent the meaning of a word?

Traditional nlp

In traditional nlp, we regard words as discrete symbols (represented by a one-hot vectors $[0\ 0\ 0\ 1\ \ldots]$ )
It’s obvious that vectors ar orthogonal, thus ther is no natural notion of similarities for one-hot vectors.

Distributional semantics

The main idea is: A word’s meaning is given by the words that “frequently” appear close-by.

“You shall know a word by the company it keeps.” (John Rupert Firth) It essentially means that the meaning of a word can be understood better by examining the context in which it is used, the words it is commonly associated with, and the situations in which it is employed. In other words, the meaning of a word is influenced by the words that surround it and the way it is used in various contexts. This idea emphasizes the importance of considering language holistically rather than in isolation.

This is one of the most successful ideas of modern statistical NLP.

Word vectors

We build a dense vector for each word, chosen so that it is similar to vectors of words that appear in similar contexts.

\text{banking}=\begin{bmatrix}0.286\\0.792\\-0.532\\ \ldots\end{bmatrix}

Word vectors are also called word embeddings or (neural) word representations.

Word2vec

Idea:

We have a large corpus (“body”) of text
Every word in a fixed vocabulary is represented by a vector
Go through each position $t$ in the text, which has a center word $c$ and context (“outside”) words o
Use the similarity of the word vectors for c and o to calculate the probability of o given c
Keep adjusting the word vector to maximize this probability

Data likelihood:

L(\theta) = \prod_{t=1}^T \prod_{-m\leq j\leq m, j\neq 0} P(w_{t+j}|w_t;\theta)

$\theta$ is the parameters of the model. $w_i$ is the center word and $w_{i+j}$ is the context word. $m$ is the size of the context window.

Calculating the product of probabilities is hard, so we take the log and the negative to make it easier to calculate.

J(\theta) = -\frac{1}{T}\sum_{t=1}^T\sum_{-m\leq j\leq m, j\neq 0}\log P(w_{t+j}|w_t;\theta)

In order to calculate the $P(w_{t+j}|w_t;\theta)$ , we use two vectors for each word $w$ :

$v_w$ when $w$ is a center word
$u_w$ when $w$ is a context word

Using a softmax function, we can calculate the probability of a context word given a center word:

P(o|c) = \frac{\exp(u_o^T v_c)}{\sum_{w\in Vocab}\exp(u_w^T v_c)}

where $Vocab$ is the vocabulary of the corpus.

A softmax function $\mathbb{R}^n\rightarrow\mathbb(0,1)^n$

\mathrm{softmax}(x_i) = \frac{\exp(x_i)}{\sum_{j=1}^n\exp(x_j)}=p_i

The softmax function maps arbitrary values $x_i$ to a probability distribution $p_i$ .

max because amplifies the largest input and suppresses the smaller ones
soft because still assigns some probability to even the smallest input

It is worth noting that the softmax function does not return a single value, but a distribution of probabilities over the entire vocabulary.

To train the model: Optimize value $\theta$ to minimize loss

Since every word has two vectors, we can concatenate them to form a single vector. This is called a “word vector” or “word embedding”.

With d-dimensional word vectors, the model has $2\times |V|\times d$ parameters.

\theta = (v_{w_1}, u_{w_1}, v_{w_2}, u_{w_2}, \ldots)\in\mathbb{R}^{2\times |V|\times d}

The gradient of the loss function

\begin{aligned} \frac{\partial}{\partial v_c}\log P(o|c) &= \frac{\partial}{\partial v_c}\log\exp(u_o^T v_c)-\frac{\partial}{\partial v_c}\log\sum_{w\in Vocab}\exp(u_w^T v_c) \\ &= u_o - \frac{\sum_{w\in Vocab}\exp(u_w^T v_c)u_w}{\sum_{w\in Vocab}\exp(u_w^T v_c)} \\ &= u_o - \sum_{w\in Vocab}P(w|c)u_w \\ &= \text{observed} - \text{expected} \end{aligned}

The result perfectly makes sense. The gradient of the loss function is the difference between the observed and expected values.

Addressing the bias in word vectors

It’s important to be cognizant of the biases (gender, race, sexual orientation etc.) implicit in our word embeddings. Bias can be dangerous because it can reinforce stereotypes through applications that employ these models.

pprint.pprint(wv_from_bin.most_similar(
    positive=['man', 'profession'], negative=['woman']))
pprint.pprint(wv_from_bin.most_similar(
    positive=['woman', 'profession'], negative=['man']))

Running the above code gives the following results:

[('reputation', 0.5250176787376404),
 ('professions', 0.5178037881851196),
 ('skill', 0.49046966433525085),
 ('skills', 0.49005505442619324),
 ('ethic', 0.4897659420967102),
 ('business', 0.487585186958313),
 ('respected', 0.485920250415802),
 ('practice', 0.4821045696735382),
 ('regarded', 0.4778572618961334),
 ('life', 0.4760662019252777)]
[('professions', 0.5957458019256592),
 ('practitioner', 0.4988412857055664),
 ('teaching', 0.48292139172554016),
 ('nursing', 0.48211804032325745),
 ('vocation', 0.4788965880870819),
 ('teacher', 0.47160351276397705),
 ('practicing', 0.4693780839443207),
 ('educator', 0.46524327993392944),
 ('physicians', 0.4628995656967163),
 ('professionals', 0.46013936400413513)]

The results show that the word vectors are biased. One explanation of how bias gets into the word vectors is that the word vectors are based on the prediction of the probability of a word given its context. Therefore, the bias in the context of the word gets into the word vectors. A real-world example that demonstrates this source of bias is that the word “doctor” is more associated with the word “he” than the word “she” in the word vectors. This is because the word “doctor” is more associated with the context of “he” than the context of “she” in the word vectors.

One way to debias the word vectors is to remove the bias in the context of the word vectors. This can be done by preset a group a potentially biased words and then neutralize and equalize the word vectors. Neutralizing the word vectors means removing the bias in the context of the word vectors. Equalizing the word vectors means making the word vectors of the group of words equal in the context of the word vectors. A real-world example that demonstrates this method is that the word “doctor” is equally associated with the word “he” and the word “she” in the debiased word vectors. This is because the bias in the context of the word is removed from the debiased word vectors.

MicDZ's Blog