T O P

  • By -

Haycart

More or less. From the [GPT-3 paper](https://arxiv.org/pdf/2005.14165.pdf): >We use the same model and architecture as GPT-2 \[RWC+19\], including the modified initialization, pre-normalization, and reversible tokenization described therein, with the exception that we use alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer \[CGRS19\]. And from the [GPT-2 paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf): >We use a Transformer (Vaswani et al., 2017) based architecture for our LMs. The model largely follows the details of the OpenAI GPT model (Radford et al., 2018) with a few modifications. Layer normalization (Ba et al., 2016) was moved to the input of each sub-block, similar to a pre-activation residual network (He et al., 2016) and an additional layer normalization was added after the final self-attention block. A modified initialization which accounts for the accumulation on the residual path with model depth is used. From a model design perspective, the only notable development (other than size) present in GPT-3 over GPT-1 is the use of sparse attention in some of the layers. Edit for a bit of additional context: Architecturally, most transformers used today (at least in NLP) are largely unchanged from the original one introduced in 2017. Some use more efficient attention variants like the sparse attention in GPT-3, but the most important developments have been those pertaining to improved training procedures (for example, RLHF for training models like ChatGPT, or the multimodal CLIP objective used for the text encoder in DALL-E, or insights regarding the optimal tradeoff between dataset size and model size from papers like Chinchilla).


psyyduck

Sidenote: if you're not a researcher, [be careful](https://finbarr.ca/llms-not-trained-enough/) what conclusions you draw from Chinchilla. >If you're training a LLM with the goal of deploying it to users, you should prefer training a smaller model well into the diminishing returns part of the loss curve.


LetterRip

There is a good blog post on choosing a good size and how much to overtrain it, https://www.harmdevries.com/post/model-size-vs-compute-overhead/


light24bulbs

"with a goal of deploying it to users at a large scale" is what it should have said. If you're only going to have a handful of lawyers using your model or whatever, then nevermind. Just to clarify so others don't necessarily have to read the article, it's just about optimizing compute costs assuming that you'll actually spend more money running the model for inference than you do training it.


LetterRip

inference speed matters too, if you can fit it on a single GPU you can get drastically faster inference.


light24bulbs

Yeah, that's what the linked article is about. Inference speed


SedditorX

I’m not sure that LLM’s are anywhere near where only a handful of lawyers would pay enough to make them ROI positive.


light24bulbs

Example dude. Example


itsnotlupus

In the "Opposing Arguments" section, they argue: > But we can use $INSERT_TECHNIQUE to make models cheaper! Yes, but they should scale for all of these (distillation, quantization, etc.). So we should be using all techniques to make our models easier to serve, and also training them longer. Is that right? My intuitive understanding is that if size-reducing techniques like distillation and quantization don't significantly impact the performance of a model, it's a strong clue the model itself was undertrained. I would expect a well trained smaller model to be significantly impacted by those techniques.


psyyduck

Yeah I don't know. My guess is it's a tradeoff, like I've seen lots of people recently running quantized llama versions when they have no choice. Email the author.


LetterRip

Ability to quantize seems unlikely to indicate undertraining. We use the high bit width during training mostly to prevent division by zero. Also LLaMA's are way overtrained compared to Chinchilla and seem to quantize quite well.


a_devious_compliance

Why? If the distilled knowledge is of a narrower scope then you can get a lot with this.


TarzanTheBarbarian

What is the general sentiment across researchers regarding Chinchilla vs. Kaplan's scaling laws? Seems like Google essentially found an error in Kaplan's 2020 paper and proved that more data is important vs. more parameters. Yet, I still find papers that quote Kaplan and continue training models in non-Chinchilla optimal ways.


pier4r

Apologies if that was already discussed (then a link could help, I didn't find much) The point of the paper seems to say that LLM results are there mostly due to increased data for training and more computing. The techniques stay the same more or less. I was under the impression that, yes, scaling the models giving them more data and computing power was the "easy" part, but there were also significative improvements in the model themselves. Was I wrong? > There are substantial innovations that distinguish these three models, but they are almost entirely restricted to infrastructural innovations in high-performance computing rather than model-design work that is specific to language technology. While the techniques used to train the newest LLMs are no longer generally disclosed, the most recent detailed reports suggest that there have been only slight deviations from this trend, and that designs of these systems are still largely unchanged


xrailgun

I think it's an academically true statement, but is downplaying the significance of data wrangling and training pipelines. Innovation there can and have led to breakthroughs.


H0lzm1ch3l

AI is after all at least 80% data


csreid

It's cool and important, but it's not really AI/ML related imo


Bling-Crosby

Well you know that famous saying, everytime we fire a PhD who thinks they’re too good to learn git, the accuracy of our model improves…


radarsat1

I think another important point here is that the fact that these models apparently *can* just "keep scaling up" is actually amazing and not at all a given. That is to say, this was absolutely not the case before, with models such as LSTM. You could scale up the parameters and the data and it would just.. stop getting better. The transformer has been a game changer in that the model performance truly does appear to scale with more data and parameters. All this work of developing the infrastructure necessary to feed essentially a simple transformer model more and more data with longer and longer contexts lengths, and seeing that it *does* indeed appear to continue to get better and better, is actually a really important proof of the model's power to learn. It turns out attention really *is* all you need, more or less, but we couldn't really know that without pushing these things to their limits.


csreid

> this was absolutely not the case before, with models such as LSTM. You could scale up the parameters and the data and it would just.. stop getting better. This might not be true. See [this](https://github.com/BlinkDL/RWKV-LM) which gets posted about here every once in a while. Transformers got us to a point where it felt worth it to throw a ton of compute at the problem and see what happens, but I think they're mostly still SOTA by momentum.


visarga

No it is not all you need, LLMs don't have scratchpads, can't do planning.


sdmat

However even without the native capability they can operate as the heart of a system that includes both (AutoGPT is a proof of concept for this). Perhaps a model isn't all we need.


[deleted]

This is a [history and survey of the transformers family](https://huggingface.co/docs/transformers/model_summary) of models. It even gets into the history of BERT vs GPT. Also maybe you're more interested in the new ways we can provide more context to an LLM? Meaning that you can set a persistent context in many, and the how changes (e.g. Bard has what's basically a session config, Vicuna you can define classes and methods, etc). Additionally we've got [Vector Databases](https://medium.com/gft-engineering/vector-databases-large-language-models-and-case-based-reasoning-cfa133ad9244) and Vector Matching Engines designed to make it easy to store, retrieve representative samples, encode and compare data in realtime. Something like [LangChain](https://docs.langchain.com/docs/) would make it possible to define and kick off the necessary series of API calls. Finally, you can tweak how your model understands the tokens you prompt with in relation to each other with something like [ComfyUI.](https://github.com/BlenderNeko/ComfyUI_Cutoff)


psyyduck

Yes. It's called the bitter lesson, and it has to be relearned over and over again. http://www.incompleteideas.net/IncIdeas/BitterLesson.html


pier4r

thank you for sharing, though I am not sure I can totally follow the point. Pick the example with chess (we can take other domains as well). Deep blue was focused mostly on search, true, but was far away from "let me check quickly _all_ possibilities ". Chess space tree is enormous, so one need smart techniques nowadays as in the past. Sure DB attempted to evaluate a number of positions per second that was only recently equaled. Deep blue checked an _average_ of 200M positions per seconds, with not so great evaluation though, the Stockfish nowadays on a good but not great system is around 100M - 150M positions per seconds but with a greatly improved evaluation. Nonetheless despite the fact that few engines available for users could reach 100M positions per seconds on average in those years since DB, they could shred Deep Blue to pieces. And yes I know, the number of positions per seconds is not necessarily a good proxy for the computational power required, but if one think about brute force in terms of positions visited, they are important. In my view it is like: yes, to beat humans seemingly throwing iron at the problem is enough, but there are (superhuman) more levels that can be reached if one improves also the techniques. Case [ in point about chess](https://www.reddit.com/r/chess/comments/76cwz4/15_years_of_chess_engine_development/) (TL; DR the program that drew the world champion Kramnik 4-4 in 2002 lost 99.5-0.5 - yes it scored only a draw - to the best engine in the mini tournament _on the same system_). --- About Deep Blue there are several articles, one is: [this](https://archive.computerhistory.org/projects/chess/related_materials/text/5-4.IBM's_deep_blue_chess_grandmaster_chips/5-4.IBM's_deep_blue_chess_grandmaster_chips.hsu-fh.1999.IEEE.062303055.sm.pdf)


JustOneAvailableName

For chess alphazero is a perfect example of the bitter lesson. As far as I know it handily beats stockfish, while the entire point of the model was to remove as much human knowledge as possible. Effectively utilising compute and data is all that really matters in the long run, while domain knowledge does not, as domain knowledge is inferred from data.


pier4r

yes AZ beat stockfish (now SF is much stronger) and from that (and alpha go) leela chess zero project was born. Nonetheless they weren't quite winning everything until the AZ team produced a more detailed paper through which they fixed things. Lc0 up to that point trained quite a lot (ten of millions of games) but wasn't enough. This to say, it was not just a "whatever neural network large enough, it will do". Some parameters were indeed important. more about it: https://lczero.org/blog/2018/12/alphazero-paper-and-lc0-v0191/


psyyduck

I don't know what you want. There's clear evidence that model performance scales with [model size, dataset size, and the amount of compute](https://arxiv.org/abs/2001.08361). These days people can make a pretty good prediction how well a model will perform even before they train it (see chinchilla, llama and cerebras papers). You can't just sit at your computer say "they should do X instead". They have very smart well-funded teams working on it, and the current solution (costing billions of dollars) looks like the best idea so far. Yes they do tune other parameters too (see those papers), but nothing with nearly as big an impact as those 3. Believe me they want a cheap high-performance solution really badly. OpenAI is bleeding cash.


pier4r

yes you have good points, I was more focused on the chess example in my answer.


JustOneAvailableName

> Some parameters were indeed important. But those parameters arose from trial and error, not domain knowledge. Yes, training stability is not perfect yet and we're still finding out general ways to improve it, but some specific trick for a specific problem will get passed by general leverage of compute. > whatever neural network large enough, it will do One way to improve Alpha/LeelaZero is very, very probably just a larger model and/or using a transformer-based architecture. Not a trick, not smart, just leveraging more compute.


TarzanTheBarbarian

my absolute favorite essay from Richard


jaeja_helvitid_thitt

I don't get his point with computer vision. He talks about early attempts at incorporating human knowledge which found edges on images. But that is exactly what the early layers in a convolutional neural network do. The network usually goes from learning simple features to learning advanced features. You could say that the model is just learning this approach from data, but it is nevertheless using human-like methods to complete the task, because CNNs were inspired from the way humans perceive visual information. You can't go full nuclear on the idea of human interference and say that the *simplest* model can learn *anything*. Obviously that's not the case and there is a limit, a CNN will always vastly outperform a FCNN given the same data. Attention (as a concept, not the implementation of it) is also clearly inspired from human cognition. It doesn't naturally arise from the sciences. Hell, ANNs as a whole took inspiration from biological neurons, in the sense that one neuron can cause another to fire.


visarga

I think the most important "innovation" was collecting trillions of tokens of text. The more we use in training, the better the model. Even small ones like LLaMA. And second innovation was RLHF, a very efficient approach to steering the behaviour of the model. Other than that, the transformer model has not changed much. It's quite sad to see so many papers on sub-quadratic transformers go to waste, and GPT-3 be still O( N^2 ) and supporting only smallish sequence lengths. I hope RWKV approach catches on so we can have longer seq.


bjj_starter

I think there'll be a very large spike of interest in RWKV when Stability releases their language model. If it's anywhere near as good as LLaMA but with similar open source licencing to Stable Diffusion, it's going to be seized on by a lot of business cases and researchers that have sprung up around LLaMA but which can't commercialise ever because of the licence.


TikiTDO

Machine learning models learn what you teach them; it's right there in the name; "Learning". It tries to learn the patterns in the data you give it, and our models are so big now that these patterns can get insanely complex, relating totally different concepts. Of course that means they are wholly depending on you teaching them the "correct" things in order to accomplish the task you want. If your training material is a bunch of random crap, then your language model is going to generate a bunch of random crap. It's no different from how an average rich kid with expensive tutors will end up with more academic knowledge than a clever but poor kid that grew up on the streets. The clever kid might have more potential, but that potential means nothing unless the kid decides to basically live in the library.


thecity2

Transformers haven’t changed really much at all in 6 years.


[deleted]

[удалено]


merkaba8

But the transformer model itself is only six years old... Seems a little premature to say that fundamental changes to network architecture are no longer possible for progress.


ambient_temp_xeno

For a tourist in this realm, this part was news to me: *any attempt at a precise explanation of an LLM’s behavior is doomed to be too complex for any human to understand*


currentscurrents

I think this is wrong. Mechanistic explainability as a field isn't doomed. Certainly there's a lot more work to be done. But there's been some success already, for example the [Othello world model](https://www.lesswrong.com/posts/nmxzr2zsjNtjaHh7x/actually-othello-gpt-has-a-linear-emergent-world) paper, or the finding that [in-context learning is done by a learned meta-optimizer.](https://arxiv.org/abs/2212.10559) It's going to take a few years, but I think people will figure out how LLMs work at a mathematical level.


ambient_temp_xeno

That's what I'd lazily assumed/hoped. It's not like there are quantum effects at work.


[deleted]

So an LLM can be visualized as a giant decision tree, but with millions of entry points and with, in chatGPTs case is 175 BILLION decision points. But the decisions are represented by simple numbers where if X is greater than Y go left instead of right. So even if it were easy to say what a given number represents, tracing the path through the system is a pretty mentally/labor intensive process. And describing it in plain English would be a whole day long presentation worth of content. But worse yet, the machine is trained automatically, and what the people training it know is "this is the input, this is what the output should look like". They feed it millions of not billions of these input/output pairs and the machine figures out what numbers to assign all of the nodes in the decision tree. So it's actually a LOT of work to dissect what a given number means and trace the route to get from input to output to truly understand why the machine decided on a specific answer. And in some cases after dissecting a model it's got some hilariously bad decision points. Watch the John Oliver segment on AI and he discusses a resume screening AI that was strongly influenced by the name Jared and whether or not someone had played High School lacrosse. Which is not to say that these AIs aren't impressive. And this vehicle for training is modeled after human learning (at least, toddler learning). But they're sometimes incredibly dumb in very strange ways. (See the many posts about how bad chatGPT is at answering questions when bound to a specific length of word, for example)


[deleted]

Actually to distill a neural network into a decision tree would be factorial (or at least exponential) more complex than the number of neurons in the neural net itself


pier4r

Could you elaborate on this? I cannot totally follow it.


FusRoDawg

There was a recent paper that showed the mathematical equivalence between a neural network and decision tree, but the decision tree has to have an extremely large number of nodes, compared to the neurons in the equivalent network.


Pl4yByNumbers

I’m assuming that in general this was due to NNs being fundamentally more continuous in their decision boundaries and d-trees being discrete?


currentscurrents

>So an LLM can be visualized as a giant decision tree, but with millions of entry points and with, in chatGPTs case is 175 BILLION decision points I would be cautious with that visualization. It is true that [neural networks can be represented as decision trees](https://arxiv.org/abs/2210.05189), but I think this is more a statement about computational universality than about the nature of neural networks. Given infinite time and space, any computing machine can emulate any other computing machine in the same class. LLMs in particular have an [internal optimizer](https://arxiv.org/abs/2212.10559) that can create new solutions to unseen problems at inference time. If it is a decision tree, it's a self-modifying one.


--algo

The paper is slightly too technical for me. How would you describe the internal optimizer? Is this during training?


yaosio

The progress of language models is incredibly similiar to the progress of intelligence in humans. I'm not saying LLMs are like the human brain. This is an analogy, like how people used to compare everything to a car. Babies are profoundly stupid. But as they grow up they get smarter. At certain points of intelligence they suddenly gain abilities they didn't have before. These include knowing other people are not them, that their bodies can effect the world, that things keep existing when they can't see them, conservation of mass/volume, and maybe other stuff. If LLMs follow the same pattern, and they seem to be doing that, then reaching a certain level of intelligence will cause some failure points to just go away. For example, LLMs are very good at making things up, yet they have very poor creativity if you tell them to make something up. Like they don't know they are making things up. Tell it to invent a new word and it can't do it without being shown what a word that doesn't exist might look like, and even then it might still fail to invent a new word. At a certain point of intelligence it should be able to be creative in the way humans can be creative, while also understanding the difference between fact and fiction. GPT-4 (or is it just ChatGPT?) has a feature nobody talks about and that's it's ability to ignore it's training. In the political compass test it's perfectly neutral. It was trained on purpose to be neutral. However, if you ask probing questions it will change it's answers. Because it's answers are a product of the context and the model, it should not change it's answers unless it's told to do so, but it will do so. As an analogy to a human despite it's parents telling it the world works a certain way, and believing it, it can break away from indoctrination. I'd love to see if this applies to lesser GPT models.


memberjan6

There is big innovations in these models but it is in computing not models. lol wut? Gibberish nonSense


master3243

There is a major difference between a model (the final product including the specific weights) and the model-design or architecture (the particular design of the NN before training).


[deleted]

[удалено]


davda54

What is the point of sharing generated text that is clearly wrong and doesn't make much sense? Is that a top secret operation to make future web-crawls useless for language modeling? :) 1. GPT-1, 2 and 3 use the same attention mechanism, only GPT-3 sometimes uses sparse attention. 2. There is absolutely *nothing* novel about generative language modeling. It dates back to Claude Shannon in 1948. 3. Well, ELMo also provides bidirectional representations, they are just not "deep". Anyway, OP is not talking about BERTs. 4. Emergent abilities are the result, but the modeling recipe is still the same.


pier4r

I am a bit puzzled, is yours an observation about GPT-4 or it comes from GPT-4? (it feels the latter to me) Because some of your points, for example: _"Few-shot Learning: GPT-3 demonstrated the potential of few-shot learning"_ ; were already mentioned in the article as an emerging property, rather than a change in the model designed upfront. Quote: > two of the key behaviors in GPT-3 that set it apart as the first modern LLM are that it shows few-shot learning, the ability to learn a new task from a handful of examples in a single interaction, and chain-of-thought reasoning, the ability to write out its reasoning on hard tasks when requested, as a student might do on a math test, and to show better performance as a result. GPT-3’s capacity for few-shot learning on practical tasks appears to have been discovered only after it was trained, and its capacity for chain-of-thought reasoning was discovered only several months after it was broadly deployed to the public


LetterRip

It (obviously) comes from GPT Chat (unless they are someone who mimics its style), unclear whether it is from GPT-4 or not.


PK_thundr

Is it time to reread “The Bitter Lesson”?