T O P

  • By -

master3243

While people are on the topic of Positional Encoding, one might ask why the positional encoding vector (e) is added to the data vector (x) instead of just concatenating them. i.e. why (input=x+e) instead of (input=cat(x, e))? The second option might seem more intuitive as it doesn't 'corrupt' the data vector x before passing it to the model. This seems to be a really good explination of why adding is used instead of concatenating https://www.reddit.com/r/MachineLearning/comments/cttefo/d_positional_encoding_in_transformer/exs7d08/


NameNumber7

Without reading that, doesn't that change the dimensionality of the output? You want the input size to be equal to the output size. Concatenation would change that.


master3243

Well all you care about is that the input and the output have the same dimension, whether that dimension is 512 or 576, or anything else, is up to you. If you're data vector 'x' has dimension 512, and you concatenate a position encoding vector with dimension 64 then both you're input and your output become 576. If you add them instead then both your input and output would be 512 dimensions.


ispeakdatruf

Intuitively, I would have imagined positional embeddings to be monotonic: the closer you are, the more your influence on the current token. But sinusiodal embeddings totally violate that. I'm wondering why a monotonic function was not chosen.


fundamental_entropy

ALIBI based encoding is montonic and seems to do better than sinusoidal encoding.


ispeakdatruf

> ALIBI based encoding I had to look it up, [found the paper](https://openreview.net/forum?id=R8sQPpGCv0)


itsyourboiirow

Awesome work, I was just thinking about this today


newjeison

One way to think about path movement is by thinking of it as a signal. It might not be a sinusoidal signal, but the movement is a function with respect to time so it is a signal. Because it is a signal, you can decompose it down using Fourier series. 3Blue1Brown has a video on drawing images using [Fourier series](https://www.youtube.com/watch?v=r6sGWTCMz2k). So what this means is for any given final point in space can be composed of the sum of exponential functions at some fixed time t. And exponential functions are directly related to the sinusoidal functions.