The model makes the same prediction at each position.
Idea: from current value of , calculate the gradient of , and move in the opposite direction of the gradient. Repeat until convergence.
where is the learning rate.
Problem: is the sum of the loss over all training examples. is very expensive to compute.
Solution: Stochastic Gradient Descent. At each step, sample a random training example and update based on the loss of that example.
The gradient of the loss of a single example is a noisy estimate of the true gradient. But it’s much cheaper to compute. And it’s often good enough.
Why we use two vectors to describe one word?
A word may occur more than once in a sentence. If we use the same vector for each occurrence, the math will be sophisticated.
Two model variants:
Additional efficiency: hierarchical softmax and negative sampling.
Overall objective function:
where is the sigmoid function. means the expectation over the noise distribution . Sample with , where is the unigram distribution raised to the 3/4 power, and is the normalization constant. Taking 3/4 power has the effect of dampening the differences between the common and rare words.
Trick: the sigmoid function is symmetric, so minimize the is equivalent to maximize .
Consider a large corpus of text. For each word , count how often it appears in some window of words around it. This is the co-occurrence matrix.
It contains two kinds of information:
Problem: The matrix is very large and very sparse. Thus, the results tend to be noisy and less robust.
Solution: Low-dimensional vectors.
Idea: store “most” of the important information in a fixed, small number of dimensions. (Usually 25-1000)
We could use SVD (Single Value Dimensionality) to reduce the dimensionality of the matrix.
where and are orthogonal matrices, and is a diagonal matrix. The columns of are the left singular vectors, and the columns of are the right singular vectors. The diagonal of contains the singular values. (Refer to linear algebra)
We could delete some small singular values and their corresponding columns in and to reduce the dimensionality of the matrix.
To deal with the most frequent words, we could use a model called COALS.
Pros:
Cons:
Pros:
Cucial insight: ratios of co-occurrence can encode meaning components.
Example:
large | small | large | small | |
small | large | large | small | |
large | small |
We use a log-bilinear model to capture the ratios of co-occurrence probabilities:
with vector differences:
GloVe is a model for distributed word representation. It is a log-bilinear model with a weighted least squares objective.
f function is used to dampen the influence of very frequent words. It is usually .
Pros:
Intrinsic evaluation:
Extrinsic evaluation:
We could use a big collection of word analogy task. For example, given a:b, find c:d. More pecifically, man is to woman like king is to what? The answer is queen.
The math language is:
where is the cosine similarity.
One trick is to discarding the input words from the search.
More data helps and Wikipedia is a good source.
300 dimensions is a good choice for the dimensionality of the word vectors.
Example dataset: WordSim353.
Evaluate word vector distances and their correlation with human judgement.
Most words have lots of meanings. For example, “bank” could mean a financial institution, or the side of a river.
Idea: Cluster word windows around words, retrain with each word assigned to multiple different clusters.
However, this method is complex. First of all, it has to learn words senses and then learning word vectors in terms of the word senses. Cutting the meaning of a word into defferent sensors making the differences overlap and it’s often not clear which ones to use.
Different senses of a word reside in a linear superposition (weighted sum) in standard word embeddings like word2vec.
where , for frequency of sense .
Surprising result: beacaus of ideas from sparse coding, we can easily separate out the senses (providing they are relatively common).
本博客所有文章除特别声明外,均采用 CC BY-NC-SA 4.0 许可协议,转载请注明出处。 rss订阅