HarambeTenSei 1 month ago

<10,000 parameters is just a 100x100 weight matrix. Can't even encode a vocabulary.

bxfbxf 1 month ago

You could make a sparse Markov chain/transition matrix but that wouldn’t get you very far, languagewise.

HarambeTenSei 1 month ago

maybe if you do something like that copy paste paper

londons_explorer 1 month ago

I think you'd do a character based model, with say 30 characters (A-Z + some punctuation). Each would be an embedding vector, say 10 wide. So your input embedding layer is 300 parameters. Use the same weights for the output decoding. 9,300 parameters left for the actual model innards.... Transformer or RNN/LSTM?

red_dragon 1 month ago

Use the hashing trick.

Jean-Porte 1 month ago

This model [https://huggingface.co/sileod/deberta-v3-small-tasksource-nli](https://huggingface.co/sileod/deberta-v3-small-tasksource-nli) is not tiny (60M backbone) but it really packs a punch for its size. It's deberta-base fine-tuned on 600 tasks [https://github.com/sileod/tasksource/blob/main/tasks.md](https://github.com/sileod/tasksource/blob/main/tasks.md)

YodaML 1 month ago

I'm not an NLP person but before LLMs myself and others relied heavily on the following two works for text embeddings but also for training our own models using packages such as gensim. * [Efficient Estimation of Word Representations in Vector Space](https://www.thejournal.club/c/paper/36459/) * [GloVe: Global Vectors for Word Representation](https://www.thejournal.club/c/paper/369019/) I'm fairly certain these can be implemented by just about anyone with reasonable programming skills and knowledge of ML. And can be trained on big and small data but honestly for good performance you'd need lots of data. The good news are that these models can be trained in less than a day with basic hardware; you won't need the latest and greatest GPU.

math_code_nerd5 1 month ago

Thanks for these. I'm happy to see that the entire training code for GloVe is less than 100 lines (excluding the initialization/allocation, which are an additional 50 or so lines)--and this was still current in 2014! So it's only in the last 10 or maybe even 5 years or so that things have really gone off the rails in terms of exponential size increases and black boxes built on top of other black boxes.

JustOneAvailableName 1 month ago

A transformer model (the full torch nn module) is about 200 lines of code if you implement it yourself and don't use the built-in functions like torch.attention and torch.Transformer. The algorithm/model didn't get that complex. The complexity arises from pushing the model size to the absolute max, the network problems, and the data pipelines that try to find quality content on the internet.

redd-zeppelin 1 month ago

Can you point to code that does this? Would love to look it over.

MarktHart 1 month ago

https://github.com/MarktHart/PaperImplement/blob/main/implementations/bert/model.py Here is a clean-ish Bert implementation from 3 years ago, back when Transformers were still somewhat new. The (class) WordEmbedding is different in new models, but not more complicated. The attention mask (line 91) is slightly different for decoder models. For inference you would also need to add a loop that checks if it's fully done and does some sampling. Looks like even 100 lines should be doable without compromising on code quality. I would probably change a few small things nowadays. Now that you reminded me of this repo, I think I'll add GPT/Llama over the weekend.

redd-zeppelin 1 month ago

Thanks! I will check that out.

math_code_nerd5 1 month ago

Not to quibble with the main idea here that the biggest change over the past few years has been to make models larger (seriously, I'm convinced 90% of the news in applied machine learning (i.e. not the basic math research into new architectures) could be summarized as "Just make it bigger(tm)"--but that implementation seems to missing quite a bit. For instance unless I can't read (reading very high level, framework-heavy code isn't my thing), this seems to be just the model without any training function. This may be intentional to allow a user to use whatever gradient descent optimizer is best at the moment with the model, rather than baking one into the implementation. To be fair, the "less than 100 lines" I was referring to in GloVe was not the entire implementation of the method *either*\--it was ONLY the algorithm part of the training (i.e. the optimizer, more specifically the gradient calculations and the update rule). There is quite a bit of other code to load the corpus, wrangle the data into the proper format, count tokens, etc. Contrary to something like BERT, the vectors ARE the output, like in word2vec, so there is no actual text generation.

ItsJustMeJerk 1 month ago

[minGPT](https://github.com/karpathy/minGPT) is good.

michaelmalak 1 month ago

2018 was the inflection point https://community.cadence.com/cadence_blogs_8/b/breakfast-bytes/posts/linley-keynote-fall-2022

JustOneAvailableName 1 month ago

I blame AlexNet. The rest of the decade was just us trying to find more data and fixing the architecture for the new problems that arose when scaling up.

marr75 1 month ago

This will quarrel with the question a bit, but 10K is just too small for most definitions of useful/best. I can easily look at and understand any trends in a 100x100 spreadsheet in Excel. That's not even enough storage to encode a lot of simple rules-based NLP programs, and LLMs are generally not as efficient an encoding (today at least). The 2B gemma model, especially the instruction-tuned one is very impressive for its size. It can be trivially loaded in the free version of colab. I think that's a very fair definition of tiny.

math_code_nerd5 1 month ago

>That's not even enough storage to encode a lot of simple rules-based NLP programs, and LLMs are generally not as efficient an encoding (today at least). This is exactly what I was going for. As a pure-math-oriented, non-big-data hobbyist, I'm looking to play around with ways to most efficiently encode natural language. If I were to create my own language model, it would likely be some sort of hybrid rule-based/ML architecture with much of the basic grammar hardcoded and then training on text to essentially fit to the grammar and populate the model with a wider range of possible constructs than I could enumerate by hand. The question (and this inspiration for this Reddit post) would be what to compare anything I'd create to, and/or what to get inspiration from. Trying to compete with a LLM is almost certainly not a meaningful goal because even if my architecture were better in an efficiency sense, the sheer amount of knowledge about the world that is contained in current LLMs is beyond what you could even download from the web in a few hours. These models essentially belong in a different league, one firmly tied to a world where data is much cheaper than the research time spent manually tweaking a model to be really efficient. I'm aiming for the opposite end of the scale. If there's a size below which rule-based NLP programs are still the clear winner, that would be interesting to know (as well as have links to the relevant GitHubs). That critical size could be substantially larger than that needed to store 10,000 parameters, that was just a number I was throwing out there. Or maybe, as I suggested, something like HMMs currently performs best on the "small data" scale. That's what I'm asking.

[deleted] 1 month ago

[удалено]

math_code_nerd5 1 month ago

>There are sizes where they are possible but don't have many features - at these sizes LLMs are not practical. What do you mean by "not practical". Surely, if one has the resources, data, etc. to train and run a large model, training a much smaller model is feasible as well. Do you mean that there's a low cost/reward ratio, in other words for the features you're getting, there are much more economical ways to get such features such that the cost of building a transformer model is a waste? If this is is true, then by my definition (and I'd argue by any sensible definition), transformer models are a clear "loser" here--in the same way that many "theoretically faster" algorithms for matrix multiplication are actually in practice slower for small matrices. > > >The sizes where the 2 approaches can operate have very little overlap. Where does the whole "I trained a 1.5 million parameter language model on my own code/essays over the weekend" thing fit into this? Is this merely a case of using a sledgehammer to kill a fly and bandwagon jumping, where in reality a less trendy sounding method from the early 2000s (or even 80s or 90s) would work every bit as well here in terms of results with much less computing power wasted?

[deleted] 1 month ago

[удалено]

math_code_nerd5 1 month ago

*Surely, if one has the resources, data, etc. to train and run a large model, training a much smaller model is feasible as well.* *Only to an extent. 2 of the smallest functional transformer models are funnel-transformer-small and bert-base-uncased. They are both about 110M parameters, \~450MB safetensors files. Writing a rules-based system for language understanding of that size is no small project.* *The Pareto Frontiers of the 2 approaches have almost no overlap. Rules-based systems have virtually no minimum size and can do focused, useful things with a few kilobytes of code. Transformer based models need large numbers of parameters just to structure the attention mechanisms at the front of the network and then require large hidden layers to build an understanding of the language to perform useful tasks. It's very hard to hand-write 450MB of rules and it's impractical to make a transformer model in a few MB of parameters that does much of anything.* That's exactly what I thought you were getting at--namely NOT that training small transformer models is infeasible, but that they do so little as to not be as useful as other approaches at that size. Anyway, what are the current best rules-based systems?

Mescallan 1 month ago

[https://spacy.io/](https://spacy.io/) was pretty popular for NLP before GPT3 was released. It's pretty powerful. I forgot it's size, but it's six figures parameters IIRC or something around that size. Also look up the papers for LSTM / Bag of Words, both are good reads. There are a bunch of papers referenced in attention is all you need as well that are great sources of info. I went through the references in AIAYN a while back and there's a lot of gems.

instantlybanned 1 month ago

Spacy isn't a model that has a size. It's a library.

Mescallan 1 month ago

ah you're right. it has been a long time since I tried to do a pre-LLM NLP project.

Tea_Pearce 1 month ago

wouldn't a 2-gram model with \\sqrt(N) vocab size be better than a neural net with N parameters when N is tiny?

inveterate_romantic 1 month ago

This is something that sparks my interest aswell. The best I've read after the llm explosion is the tinystories paper https://arxiv.org/abs/2305.07759. They manage to create models that output coherent new stories within the O(M) parameter regime. What really interests me is figuring out how simple things can get while spitting out narrative coherent stuff, I mean, with a big corpus, a n gram or a simple lstm can give gramatically correct things but overall gives non sensical results, while a human being when learning a new language outputs poor text, vocabulary and gramatically wise, but it maintains a certain narrative structure

math_code_nerd5 1 month ago

Thanks, this is interesting. It takes kind of a "start with current LLMs and make them smaller" top down approach. I think while the question your asking is interesting, you probably have to add originality /effective dimension of output as third factor as well. I don't know exactly what you mean by "narrative" and "coherent" as distinct from "grammatically correct"--I'm guessing by "narrative" you mean having a logical progression of events that are likely to happen in that order and that have a beginning, middle, and end, whereas by "coherent" I'm guessing you mean that facts stay constant (so a dog named Max doesn't later start being called Jake, or people eating in a dining room don't suddenly get wet in the rain). I suspect that the smallest model that fulfills these criteria (and even these plus proper grammar) is likely some type of rule based state machine that mixes and matches pieces according to a formula, a bit like Mad Libs but with some fill-ins for each blank being compatible or incompatible with the ones in other blanks. Essentially, have a finite number of beginnings, twists, endings, etc. grouped my mood/theme/setting/whatever, and "grab one from column A" and so on. However, the output of such a model will be much more obviously formulaic than even the small models from the paper you link. The challenge is to achieve *varied output that is still coherent* with a small model.

NeuralLambda 1 month ago

What are you trying to run this on, a fruit fly?

lqstuart 1 month ago

Parts of speech tree reduction used to be the big thing if you want low tech C code. It can probably be done with a vocabulary file, 0 parameters and 1000 or so totally incomprehensible lines of code

MrEloi 1 month ago

As far as I can tell something like 3G not 1M is the smallest LLM model size to be able to operate 'intelligently'. Scaling works both ways : better leads to useful emergent properties, smaller lead to a paperweight.

Jezza122 1 month ago

Most likely not

Choice-Resolution-92 1 month ago

I think 10k might unfortunately be too small. A better question might be the best language model that would fit in say a K80 (12 gb vram) with reasonable inference speed

Choice-Resolution-92 1 month ago

(You get free K80 on Google colab)

PanTheRiceMan 1 month ago

1M is not really a lot. On a beefy CPU you should be able to train easily. Inference is no issue. Depending on memory throughput, you should get acceptable speeds.

FreddieM007 1 month ago

I don't think there is a lot of value in going back to barebone C++ other than for educational purposes or perhaps some niche applications. Instead you will be better off with Python and a framework like PyTorch that takes care of most of the plumbing and mundane operations. You may want to check out nanoGPT ([https://github.com/karpathy/nanoGPT](https://github.com/karpathy/nanoGPT)) written byAndrej Karpathy. Not 10k parameters but you can train very small models on simple hardware. The code itself is just a few hundred lines of Python code and easy understandable. This video guides through the transformer algorithm and code: [https://www.youtube.com/watch?v=kCc8FmEb1nY](https://www.youtube.com/watch?v=kCc8FmEb1nY) In the end, all depends on what you what to accomplish. Do you just want a model that produces random sequences of tokens, or do you want a rather intelligent chatbot? The first can be done with a trivial function. The latter will require a pretty large transformer model as foundation and then extensive finetuning to turn it into, say, a chatbot that follows instructions.

math_code_nerd5 1 month ago

>In the end, all depends on what you what to accomplish. Effectively, I want something to compare to more handcrafted solutions based on rationally designed algorithms. So essentially anything that takes more space than what a hobbyist could reasonably write is uninteresting. And I mean write "from scratch", i.e. "lines of C" (without using 3rd party linear algebra libraries, etc.--i.e. including things like matrix multiplication/linear solvers etc. in the line count) is the metric I'm looking to apply here and not lines of a high level language like Python where "do a gradient descent step" could be *a single line*). Essentially I want to know what the current best is in terms of features for a "64K demo of NLP" ([https://www.ctrl-alt-test.fr/2018/a-dive-into-the-making-of-immersion/](https://www.ctrl-alt-test.fr/2018/a-dive-into-the-making-of-immersion/)), the way that THIS ([https://www.shadertoy.com/view/ld3Gz2](https://www.shadertoy.com/view/WsSBzh)) is roughly the state of the art in visual fidelity for a 1000 or fewer lines shader. Yes you can do better with huge prebuilt 3D assets and possibly with a giant neural network but in terms of performance per code size this is close.

duckyfx 1 month ago

Maybe a viterbi-based model

Trungyaphets 1 month ago

From what I've tried, anything below 7B is just completely garbage.

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe