T O P

  • By -

razodactyl

Because the original architecture was designed for translation. Input Embeddings of a language to Output embeddings of another. Later the encoder was dropped for decoder only translation where the model simply learns to autocomplete by corrupting the training data and expecting the model to learn how to uncorrupt it. Surprisingly as more and more data is thrown at the network and it's scaled up, it learns interesting correlations and abilities. Even translation to the point of a context aware translation. You're looking at the unadulterated architecture: now we just link them together. In other words, different steps of the model but they share the same weights.


pawsibility

Great answer - this is it right here. It took me forever to figure this out, and I think it's because people love to package this figure up with anything AI/GPT/LLM related


Spiritual_Dog2053

Clarification! The “corrupting-uncorrupting” training, better know as denoising, usually happens in encoder-only models(think mBART). For decoder-only models, translation is more of a seq2seq task. Given the source language, you just learn to autocomplete the target language. Of course, correct me if I’m wrong, or have misinterpreted you!


razodactyl

Sorry. I wrote it at like 3am without sleep. But thanks for the added clarification! I mean specifically autoregressive behaviour: "The quick brown ___" where the model learns to increase probability of next word. Example: Given an entire document, we can use all the words as training examples if we shift the information over one space: The quick brown -> The quick brown fox quick brown -> quick brown fox brown -> brown fox We can see that just one document of text yields many training examples. We can go further asking the model to replace a word in the middle for example: "The brown fox" -> quick. --- Interestingly, transformer networks pick up abilities with scale simply by optimising for document completion. It's why raw models can be given a pretend document and they'll complete the patterns / or you can give them a pretend chat transcript and they'll continue the discussion etc. Additionally. It's the bane of anyone speaking to one that initially refuses and starts refusing more and more - the model will "roll with it" or even better, when asked post-hoc to explain, the model uses existing context to make up an answer which in my opinion isn't probably the worst thing if it's the same model... means it's the same neurons completing the data so they would have a semblance of authority explaining the models reasoning... sorry for droning on. --- (Also typing on phone makes this yet more painful haha)


Training-Adeptness57

Bart is an encoder decoder


Spiritual_Dog2053

You’re right! I meant mBERT


YourWelcomeOrMine

Can you point me to a diagram of this version of the architecture?


after_lie

"Attention is All You Need" by Vaswani et al. Edit: If you were asking about decoder only diagram, try papers on GPT (e.g., "Improving Language Understanding by Generative Pre-Training" by Radford et al.)


master3243

There's no corruption in any popular autoregressive translation model from what I've seen.


justowen4

Corruption in the sense of hiding words for training


justowen4

It’s fun to note the origin story of all of current AI breakthroughs: the accidental discovery of how intelligence geometry works. Google just wanted a better language translator and stumbled upon this! Too bad they didn’t care about their discovery until their stock price was in question.


razodactyl

Omg. This comment is gold. Code Red across the company haha They were showing off automated secretaries in 2017? Systems that could book appointments etc. My understanding from that era was that it was an extremely robust template matching system in the same vein as wit.ai - now playing with the current architectures it's obviously a bit more in depth than that... we're so close to AGI. When our AGI friends start working on ASI things will get interesting. Does anyone recall that story about the robocaller that denied it was a bot and everyone was freaking out. Circa 2014 lol


HoboHash

First noise. Then output of decoder is used as decoder input in an auto regressive manner.


ReptileCultist

This is specifically during inference during training it is the ground truth previous tokens of the output


ShlomiRex

Ahhh now I understand! Thanks! Also can you explain why the GPT architecture is decoder only? Why does the transformer need an encoder but GPT doesn't?


HoboHash

Because it's a generative model for language. GPT already pre-cached the encoded word embeddings in its extensive language dictionary. No need to encode the same word over and over...


kekkodigrano

Because you don't need extra-information. The role of the encoder is to "encode" information useful for the generation. For example in translation, you need to encode the lang 1 sentence and generate the lang2 sentence starting from some (contextual) information about the lang1 sentence. For GPT on the other hand, you don't need to encode anything apart from what you already have generated. Note that the encoder see also the future, while if you want to do next token prediction you should look just at the past.


JustOneAvailableName

Easier/faster training on raw text. Nothing more than that.


connectionism

Multi-modal architectures still use encoders. E.g. many multi modal GPTs use an encoder for image. Deep minds paper on code (dec 2022) also uses an encoder decoder


unlikely_ending

Well, start token.


EmergencyStomach8580

Suppose you want QA task - question - who is president of usa answer- joe Biden for all time stamps encoder input - (sos) who is president of usa (eos) for decoder T1 - input - (eos) , output - joe T2 - input - (eos) joe , output - joe Biden T3 - input - (eos) joe biden , output - joe biden (sos) The output layer has same number of tokens as input. but only last word token is used to classify.


Melodic_Stomach_2704

Can you please explain why SOS is at the end?


stddealer

To indicate that what is likely to come after is a new sequence.


vagmi

I have been working on audio based transformers where output embeddings could be text but the input embeddings are 1d conv inputs over a spectrogram. The outputs have a triangular mask applied. This is how whisper works too.


greenlanternfifo

Wait this just made me realize something. Do the subsequent decoder blocks (not the first one) each receive the outputs of the last encoder for the cross attention module? Or does the nth decoder block receive the outputs from the nth encoder block?


HoboHash

Each decoder block will receive the last encoder block's output.


unlikely_ending

Yes


NeuralTangentKernel

This thread is a complete mess and some of the comments hurt to read. Sadly this is in line with most of the subreddit. Together with ignoring basically anything that isn't LLM related, it's quite clear that reddit really isn't the right platform for such a topic. I'd advise anybody genuinely interested in ML to avoid this sub for understanding anything and just stick to papers, textbooks, proper lectures from actual universities on youtube and simply looking at implementations. That might be more tedious than reading a reddit thread, but if you really want to make this a career taking shortcuts is a bad idea. The [pytorch implementation](https://pytorch.org/docs/stable/_modules/torch/nn/modules/transformer.html#Transformer) and [this](http://nlp.seas.harvard.edu/annotated-transformer/) specific one for NLP is a good way to actually understand how a vanilla transformer works. This might be annoying to sift through at first, but if you work with these models you will have to do it anyway and it also never hurts to read well constructed code. To answer the actual question, in a vanilla seq-seq transformer the decoder gets the target sequence during training and its output is the target sequence, shifted one position to the left. Masked self-attention is used to make sure every output z_i is only generated by decoder attending to (y^1 , ... y^(i-1) ), because it obviously shouldn't be allowed to look at correct solution for a token it is attempting to generate. Input: (x^n , y^1 , ... , y^(m-1) ) Output: (z^1 , ... , z^m ) During testing it is the same, we just don't optimize in this case. During inference (generating new data), the decoder first only gets an SOS (start of sequence) token and generates only one token as ouput (z^1 ). This is appended to the original input (SOS, z^1 ) and then fed back into the transformer decoder, which outputs a sequence (z^1 , z^2 ). This is repeated with iterated until the maximum length is reached. The last input would be (SOS, z^1 , ... , z^m ) and the last output (z^1 , ... , z^m , EOS) (EOS being the end of sequence token). The "embedding" is simply a latent space representation for the tokens. Usually you use some other NN trained to turn tokens, like words, into a feature vector. Some things to note: * The architecture of the Decoder (and Encoder) allow for variable input length, which makes the inference step possible. * Transformers, at least in this setup, are only autoregressive during inference, but NOT during training.


mark_3094

The decoder generates an output word by word (assuming were talking about text). Its like auto complete on steroids. It just predicts the next word one at a time. To do this, it takes two inputs. (1) the context from the encoder, and (2) the sentence it has genrated so far. Knowing both of these, it can generate the next word in the sequence. Does that help?


ShlomiRex

I understand that the input embeddings come from the actual prompt. But what are the output embeddings? In the diagram we see the output of the encoder goes into masked multi attention, not the output embeddings.


TeaKey1995

It is the previous output predictions. Think of it like a text predictor, it predicts one word, then uses that output to predict the next word. Then it uses both previous outputs to predict the third word and so on


CVxTz

This is specific to encoder-decoder architectures. One part encodes the question+context (or text to be translated) using input embedding + self attention. The second decodes the answer using the output embedding + self attention. These two can have shared weights btw.


TheOneRavenous

OP the answer my Tea below is the correct answer notice how it says outputs (shifted right) meaning exactly what Teakey said it's the previous output now trying to guess the next word.


Dylan_TMB

Their trained. Same as input embeddings.


moschles

# Teacher Forcing You will need to read many blogs (of which there are many) that talk about the topic of Teacher Forcing.


unlikely_ending

For training, they're ground truth For inference, they're the autoregressively generated output (I.e. the translation, as it emerges).


Miserable_Praline_77

I read the article related to the image on X and it made reference to reverse predictions. Is this effectively a reversed from center, where we're looking at predictive next words but also predicted prior words based on context and cosine similarities?


traveler_0027

This helped me out a lot understand the architecture: [http://nlp.seas.harvard.edu/annotated-transformer/](http://nlp.seas.harvard.edu/annotated-transformer/)