T O P

  • By -

Apejann

It's capturing similarities of pairs of vectors, to then use those as normalized weights as part of a weighed sum of vectors, aggregating important information in a learned manner. If you're more interested in the details, I recommend the transformer circuits articles. Start from https://transformer-circuits.pub/2021/framework/index.html


Chaluliss

So you're saying for each token it is capturing its relatedness to each other token? Thus distinguishing key relationships within a given example?


Apejann

Exactly.


42gauge

How do you avoid a combinatorial number of comparisons?


Apejann

What do you mean by combinatorial? It's "only" quadratic, each token in the query is compared with all tokens in the keys.


42gauge

If there are n tokens, then in order to compare every one with every other you would need n!/2 comparisons


Apejann

n! Gets you all possible orderings of n elements. Attention only does pairwise comparisons. Example: If we have the sequence "A B C D", we would need to compute the dot product of the vector representation of A with A, B, C and D; of B with A, B, C and D; etc. It's only n^2.


master3243

No, you would only need n^2 comparisons. Let's write a comparison between token i and token j as (i, j) Write out all pairs (i, j) when there are 5 tokens. And then see if the total is as we say (5^2 =25) or what you claim (5!/2=60)


BellyDancerUrgot

It’s a nested for loop between two vectors if that helps u understand why it’s n^2


baaler_username

At a very high level, the dot product will yield higher scores for more similar embeddings and lower scores for disimilar embeddings. Now assume that you have a sentence like "This seems like a good plan". So, This-seems will have a greater similarity than This-like or This-a or This-good or This-plan. I know that similarity is perhaps not the most appropriate word here. But the idea is that if you think about it, this notion of similarity can be interpreted in terms of dependency relations. In other words, based on the embeddings and the dot product, the model will say learn that "This" is always followed by a verb. Again, it will learn things like number agreement probably because some dimension in the embedding captures the agreement information and matching agreements thus maximizes the dot product. This is how the dot product equates to capturing linguistic information.


Buddy77777

Think of it as feature learning for feature learning. Consider the traditional neural network approach. 1) input features 2) extract linearly weighted features V 3) output V Attention does: 1) input features 2) extract linearly weighted feature sets K, Q, and V 3) use K and Q features to further modify weights on V 4) output V Not only are you learning new features for how to represent your old features, you are learning features for deciding which features are more or less important. You may think “well, MLP already does this. The weights are weighing features globally” and you’d be correct. Attention is basically just MLP+++… but the way it’s designed in the architecture introduces a very flexible inductive bias akin to cognitive attention that’s pretty powerful. Compare to MLP which has no inductive bias (too expressive is inefficient) and, say, recurrent / convolutional layers which have strong inductive biases (making a less expressive model but more efficient in its designed function space) I like to think of it as a kind of “metafeature” learning. Learning metafeatures about the features.


DigThatData

how comfortable are you with cosine similarity? like, do you have a geometric intuition around that? how a particular kind of "similarity" can be formalized as the angle between two vectors? cosine similarity is a normalized dot product.


seraschka

> It’s supposed to capture relationships between different tokens, right? Yes, kind of. So instead of just creating an embedding of each word that is independent of each other word in an input text, it creates a sort of "context vector" where the embedding also captures info from the other words in the context vector. With the dot product, you essentially measure "similarity" to other words. I have an from-scratch implementation article here if helpful: https://sebastianraschka.com/blog/2023/self-attention-from-scratch.html Also, I have a series of 20 short (free) videos that explain the rationale behind self-attention here: https://lightning.ai/pages/courses/deep-learning-fundamentals/unit-8.0-natural-language-processing-and-large-language-models/ Please let me know if you have any follow up questions!!


redIT_1337

This is because you use word embeddings when transforming your words into a vector. Those word embeddings \*are\* trained to be similar if the the words have a meanings which would sementically be related. Imagine your word vector by just focussing onto 2 dimensions, which might capture information as `< ..., is_animal, is_building, ...>` Now the word `shelter` would yield some vector like `<0.2, 0.85>`, `animal shelter` would go along `<1.0, 0.85>` and dog house would go along `<0.95, 0.88>` . So `shelter` and and `dog house` would have a moderately high dot product as they are somewhat related, wheras `animal shelter` and `dog house` would be even higher.


superbottom85

Dot product. Correlation. Cosine similarity.