CS224N Notes - Lecture 2

Bag of Words Model

The model makes the same prediction at each position.

Optimization: Gradient Descent

Idea: from current value of $\theta$ , calculate the gradient of $J(\theta)$ , and move in the opposite direction of the gradient. Repeat until convergence.

\theta^{\text{new}}=\theta^{\text{old}}-\alpha\nabla_{\theta}J(\theta)

where $\alpha$ is the learning rate.

Stochastic Gradient Descent

Problem: $J(\theta)$ is the sum of the loss over all training examples. $\nabla_{\theta}J(\theta)$ is very expensive to compute.

Solution: Stochastic Gradient Descent. At each step, sample a random training example and update $\theta$ based on the loss of that example.

The gradient of the loss of a single example is a noisy estimate of the true gradient. But it’s much cheaper to compute. And it’s often good enough.

Running SGD with Word Vectors

\nabla_{v_c}J_t(\theta) = \begin{bmatrix} 0\\ \vdots\\ \nabla_{v_{\text{like}}}\\ \vdots\\ \nabla_{u_{\text{like}}}\\ 0\\ \vdots\\ \end{bmatrix}\in \mathbb{R}^{2dV}

Word2vec: more details

Why we use two vectors to describe one word?

A word may occur more than once in a sentence. If we use the same vector for each occurrence, the math will be sophisticated.

Two model variants:

Skip-gram: predict context words (outside words) given center word
Continuous Bag of Words (CBOW): predict center word from (bag of) context words

Skip-gram with Negative Sampling

Additional efficiency: hierarchical softmax and negative sampling.

Hierarchical softmax
Negative sampling: train binary logistic regressions for a true pair (center word, context word) and a number of noise pairs.

Overall objective function:

J_{\text{neg-sample}}(u_0,v_0,\theta) = -\log\sigma(u_o^Tv_c) - \sum_{i=1}^k\mathbb E_{j\sim P(w)}[\log\sigma(-u_j^Tv_c)]

where $\sigma(x)=\frac{1}{1+e^{-x}}$ is the sigmoid function. $\mathbb E_{j\sim P(w)}$ means the expectation over the noise distribution $P(w)$ . Sample with $P(w)=U(w)^{3/4}/Z$ , where $U(w)$ is the unigram distribution raised to the 3/4 power, and $Z$ is the normalization constant. Taking 3/4 power has the effect of dampening the differences between the common and rare words.

Trick: the sigmoid function is symmetric, so minimize the $-u_j^Tv_c$ is equivalent to maximize $\sigma(u_j^Tv_c)$ .

Use a Co-occurrence Matrix

Consider a large corpus of text. For each word $w$ , count how often it appears in some window of words around it. This is the co-occurrence matrix.

It contains two kinds of information:

Window length: how many words on each side of $w$ to consider
Symmetric: the co-occurrence matrix is symmetric because the co-occurrence of $u$ to $v$ is the same as the co-occurrence of $v$ to $u$ .

Problem: The matrix is very large and very sparse. Thus, the results tend to be noisy and less robust.

Solution: Low-dimensional vectors.
Idea: store “most” of the important information in a fixed, small number of dimensions. (Usually 25-1000)

We could use SVD (Single Value Dimensionality) to reduce the dimensionality of the matrix.

X=USV^T

where $U$ and $V$ are orthogonal matrices, and $S$ is a diagonal matrix. The columns of $U$ are the left singular vectors, and the columns of $V$ are the right singular vectors. The diagonal of $S$ contains the singular values. (Refer to linear algebra)

We could delete some small singular values and their corresponding columns in $U$ and $V$ to reduce the dimensionality of the matrix.

To deal with the most frequent words, we could use a model called COALS.

Towards GloVe: Count Based vs. Direct Prediction

Count Based:

Pros:

Fast training
Efficient use of statistics

Cons:

Primarily used to capture word similarity
Disproprtionate importance given to large counts.

Direct Prediction:

Pros:

Generate improved task performance
Can capture complex patterns beyond word similarity
Cons:
Inefficient usage of statistics
Only look at one center word at a time, and only generating a few negative samples for each center word.
Related to the size of the corpus

Encoding Meaning in Vector Differences

Cucial insight: ratios of co-occurrence can encode meaning components.

Example:

	$x=\text{solid}$	$x=\text{gas}$	$x=\text{water}$	$x=\text{random}$
$P(x\|\text{ice})$	large	small	large	small
$P(x\|\text{steam})$	small	large	large	small
$\frac{P(x\|\text{ice})}{P(x\|\text{steam})}$	large	small	$\sim 1$	$\sim 1$

We use a log-bilinear model to capture the ratios of co-occurrence probabilities:

w_i\cdot w_j = \log P(i|j)

with vector differences:

w_x\cdot(w_a-w_b) = \log\frac{P(x|a)}{P(x|b)}

GloVe: Global Vectors for Word Representation

GloVe is a model for distributed word representation. It is a log-bilinear model with a weighted least squares objective.

w_i\cdot w_j=\log P_{ij}

J(\theta) = \sum_{i,j=1}^V f(X_{ij})(w_i^T\tilde{w_j}+b_i+\tilde{b_j}-\log X_{ij})^2

f function is used to dampen the influence of very frequent words. It is usually $f(x)=(x/x_{\text{max}})^\alpha$ .

Pros:

Fast training
Scalable to large corpora
Good performance even with small corpus and small vectors

How to Evaluate Word Vectors?

Intrinsic evaluation:

Evaluate the word vectors directly on a specific/intermeiate task.
Fast to compute
Helps to understand the model
Not clear if really helpful for the final task, unless the correlation to the final task is established.

Extrinsic evaluation:

Evaluate the word vectors on a real task
Can take a long time to compute accuracy
Unclear if the subsystem is the problem or its interaction or other subsystems.
If replacing exactly one subsystem with another improves the performance, then we find the subsystem is the problem.

Intrinsic Example: Analogy Task

We could use a big collection of word analogy task. For example, given a:b, find c:d. More pecifically, man is to woman like king is to what? The answer is queen.

The math language is:

d=\arg\max_{w\in Vocab}\cos(w, w_a-w_b+w_c)

where $\cos(x,y)=\frac{x\cdot y}{\|x\|\|y\|}$ is the cosine similarity.

One trick is to discarding the input words from the search.

Analogy Evalutaion and Hyperparameters

More data helps and Wikipedia is a good source.

300 dimensions is a good choice for the dimensionality of the word vectors.

Intrinsic Example: Human Judgements

Example dataset: WordSim353.

Evaluate word vector distances and their correlation with human judgement.

Word Senses and Word Sense Ambiguity

Most words have lots of meanings. For example, “bank” could mean a financial institution, or the side of a river.

Idea: Cluster word windows around words, retrain with each word assigned to multiple different clusters.

However, this method is complex. First of all, it has to learn words senses and then learning word vectors in terms of the word senses. Cutting the meaning of a word into defferent sensors making the differences overlap and it’s often not clear which ones to use.

Solution: Linear Algebraic Structure of Word Senses, with Applications to Polysemy

Different senses of a word reside in a linear superposition (weighted sum) in standard word embeddings like word2vec.

v_{\text{bank}} = \sum_{i=1}^k \alpha_i v_{\text{bank}_i}

where $\alpha_i=\frac{f_i}{\sum f}$ , for frequency $f_i$ of sense $i$ .

Surprising result: beacaus of ideas from sparse coding, we can easily separate out the senses (providing they are relatively common).

MicDZ's Blog

CS224N Notes - Lecture 2

Bag of Words Model

Optimization: Gradient Descent

Stochastic Gradient Descent

Running SGD with Word Vectors

Word2vec: more details

Skip-gram with Negative Sampling

Use a Co-occurrence Matrix

Towards GloVe: Count Based vs. Direct Prediction

Count Based:

Direct Prediction:

Encoding Meaning in Vector Differences

GloVe: Global Vectors for Word Representation

How to Evaluate Word Vectors?

Intrinsic Example: Analogy Task

Analogy Evalutaion and Hyperparameters

Intrinsic Example: Human Judgements

Word Senses and Word Sense Ambiguity

Solution: Linear Algebraic Structure of Word Senses, with Applications to Polysemy