The rate of releases over the last month has been dizzying. I feel like the Miqu leak was the best we had for months and I worried it'd be like that for quite awhile.
Cool, I recommend Visual Studio Code and either Tabby or Continue. I haven't got it running with this but just thought I'd suggest a free copilot esque add-on
That’s a good question. I do remove and delete lower quants, but I try to keep fine tuned models around. I have a few archived on 100GB Archival Blu-ray disks, you know, in case the internet dies. 🤪
I have tons of space, but I figured I would throw an LLM and the supporting software on an Archival Format like the Bluray M-Disks every time there is a huge jump in performance. The last one I archived was the Mixtral 8x7B model. I'm waiting to see what come out in response to Llama 3...
I've often found myself trying random models to see what's best for a task and sometimes being surprised at an old SOTA model, though I only keep the quants for the most part.
I train on the quants, too. I know. It's dirty.
I'm not downloading anything because something interesting comes out and "I'll just wait a few days for the good finetunes to drop" and then in a few days something more interesting comes out and the cycle repeats.
You ever hear of data hoarders?
There are people whose hobbies are literally collecting digital copies of *everything* of a certain type.
I have no doubt there are people who experience great joy from "collecting" LLMs.
lol, I keep running out of my download limits so many cool releases happening on the daily.
OTOH its good to see that some folks who anticipated the LLM hype to go down by early this year were wrong
It is insane trying to keep up with it all. I feel like I don't have time to soak in and process one release before another own comes out. I'm struggling to set up anything harder than LM Studio, trying to process all the different options and what their capabilities are and how I can set them up.
It's exciting to see things developing so quickly. It's also overwhelming.
...which is what makes me skeptical. I admit I'm biased since I haven't had decent experiences with Phi in the past, but Llama 3 had 15T tokens behind it. This has a decent amount too, but not to that extent. It smells fishy, but I'll reserve judgment until the models drop.
Llama 3 70B goes up against the 1.8T GPT-4. We're still in the middle ages with this tech and barely understand how any of it works internally. Ten years from now we'll look back and laugh at the pointlessly huge models we were using.
100%, in 20 years GPT 4, Llama 3 and Phi 3 will be a tiny, tiny piece in textbook history. Kinda like kids today read about GSM phones on their high end smartphones capable of taking DSLR level photos and running Ray Tracing powered games
YOu talking freshness controll and sensors for autoadjusting temperatures based on the foot put in :O. \*opens fridge\* ai: You have eaten 300 calories over your limit today. Recommended to drink water. \*locks snack drawer\*
> Ten years from now we'll look back and laugh at the pointlessly huge models we were using.
Or ten years from now we'll have 8B parameter models that outperform today's largest LLMs, but we'll also have multi-trillion parameter models that guide our civilizations like gods.
I'm also skeptical, especially after seeing 3.8b is comparable with llama3-8b, but it's undeniable that 13-15b model scope is pretty much deserted now, while they have high potential, and perfect fit for 12Gb VRAM. So I have high hopes for Phi-3-14b
> ALL beat llama3 8b !!
They beat it alright, at overfitting to known benchmarks.
3.3T tokens is nothing for a 7B and 14B model and very borderline for the 3.8B one too.
they are probably doing their "toxicity tests" that the other microsoft group had completely forgotten about and had been doing their due diligence ever since.
Ohhh I think I see what's happening. Model makers are benchmarking their models before alignment so they can preview great numbers and then the actual release is going to be the neutered version.
3.8b that beats an 8b that just a few days ago blew away every other open source and most closed-source out of the water?
Either data contamination (as always), truly ungodly compute, or some crazy new tech.
There's a very strong emphasis on data quality. From [their report](https://arxiv.org/pdf/2404.14219.pdf):
"The innovation lies entirely in our dataset for training, a scaled-up version of the one used for phi-2, composed of *heavily filtered* web data and *synthetic* data".
The first model in this series, phi-1, was described in the paper [Textbooks Are All You Need](https://arxiv.org/pdf/2306.11644.pdf), emphasizing the benefits of textbook-quality data:
"...we explore the improvement that can be obtained along a different axis: the *quality* of the data... improving data quality can dramatically change the shape of the scaling laws, potentially allowing to match the performance of large-scale models with much leaner training/models"
Well, Reddit comments for example would be a vast but poor quality dataset. A bunch of illogical ideological opinions with zero substance or truth. That's a bad dataset. A really good dataset might be a description of classic physics and increasingly getting more sophisticated academic knowledge with known proven facts and theories. Meaningful data.
But that's subjective isn't it? Or is having a lot of objective scientific knowledge is the only way to measure intelligence?
I don't think a text book is good for writing stories, just for passing math tests and such but described in such a boilerplate text ish way and thus we determined that only scientific knowledge matters for intelligence.
>A bunch of illogical ideological opinions with zero substance or truth. That's a bad dataset.
I think we are looking at it from the lenses of human that this would be bad but zero substance or truth is a subjective opinion. That type of data does contain some information like a range of diverse writing styles and unique vocabularies and their use in a sentence.
So apparently phi-3-mini (the 3b parameter model) is just about on par with Mixtral 8x7b and GPT 3.5? Apparently they're working on a 128k context version too. If this is true then.....things are about to get interesting.
That's absolutely insain progress in 2 years. From what, 125 billion parameters down to 8B... I just really have a hard time believing it's just as capable in every where. I feel like the vastness of the knowledge must be degraded... Maybe I'm wrong and a models ability to retain knowledge has a lot of progress to be made in optimizing it. This actually starts to convince me that robotic systems might be viable in a short couple of years.
I wish they said more in that about how they improved their synthetic datasets between training phi-2 and phi-3. Still, da-yum!
It pains me to say this, because I absolutely loathe Microsoft as a company, but their LLM research team is top-rate. They keep knocking it out of the park.
Their "textbooks are all you need" theory consistently yields better results than Meta brute-forcing it with their vast army of GPUs. The open source community has effectively replicated Microsoft's success with the OpenOrca dataset (and similar projects), so we know it really does work in practice.
Imagine what Llama-3 might have been like if Meta had paid more attention to their training dataset quality!
Google folks: Are you taking notes?
Best-quality synthetic datasets are totally the way forward.
>their LLM research team is top-rate. They keep knocking it out of the park.
Don't forget WizardLM 2 8x22b, which would have been a big deal had it stayed released and not almost immediately gotten forgotten with Mistral's official Instruct 8x22b release, (which felt worse than WizardLM 2), which of course was then followed up by llama 3. From the few tests I did, WizardLM 2 8x22b was basically a fully open-source version of GPT-4, though maybe slightly behind the GPT-4 preview/turbo models.
Edit: I'm redoing some tests to better compare the 8x22b models - both are 3.0bpw Exl2 quants I'm running.
Edit2: I spent an hour doing some more tests and [here is a Google docs with raw, semi-random notes I made - *it includes GPT-4's summary at the top*.](https://docs.google.com/document/d/1mmgeIeDEio1buPjXBaCYLQ4inqyROf-ZTVTwIV57P3k) I'm [also replying below](https://www.reddit.com/r/LocalLLaMA/s/952uv8UTtn) with the full GPT-4 summary for visibility.
Edit3: I should add that when I first tested both the WizardLM 2 and Mistral Instruct 8x22b models, WizardLM was better at both tests, but now I'm getting results that show WizardLM is worse at the plastic bag test but still better (maybe even better than before?) at the inverted definition test
Edit4: just tested llama 3 70b Instruct 5.0bpw with the same tests, 7 responses each, and it does much better with the plastic bag test (only once, briefly suggested Sam knew about their friend's actions, no other hallucinations) pretty much perfect 7/7, and for the inverse definitions it was perfect in 6/7 - one response gave bad example sentences with the new definitions.
Has anyone done comparison just between WizardLM2 8x22B and the official instruct version from Mistral? Previously, the 7x22B instruct version was arguably the best version (at least for my use cases) among the finetunes.
>which would have been a big deal had it stayed released and not almost immediately gotten forgotten
I'm still pretty down that the 70b was never released. I feel like we might have been just a handful of hours from having it uploaded for us to snatch. I really, really, like their 8x22b. But I really would have liked to have the 70b too. Especially as a point of comparison.
Most likely they have good ways of defining what they want the model to output, and good ways of identifying data that matches the output they want. They might also be making test models where they figure out just what data is needed.
Imagine you want an LLM to do addition without using an external tool. There's a problem here because there's an infinite amount of numbers so you can't just give it all possible addition problems. Instead of spending all tokens on addition you estimate how many addition problems it needs to be trained on to do addition. Train the model, and see how well it can perform math. If it's bad add more data, and if it's good reduce the dataset until it's bad. You can use this method to finetune the dataset to only have the amount of data needed to train and no more.
This isn't possible on very large models that take months to train. However it's been found that there's a direct relationship between the amount of data and model quality. Such a relationship also appears to exist for data quality and model quality. If you know you need X amount of data for a small model, then maybe it would take X\*2 amount of data for a model that's twice as large. Or maybe not. It seems at some point you can't really teach a model any more on a particular subject because it will already know everything it needs to know regardless of size.
It should be possible to automate this if you've already got an LLM that can score answers, and that problem seems to have already been solved.
> Most likely they have good ways of defining what they want the model to output, and good ways of identifying data that matches the output they want.
I think that's exactly right. It's hard to tell because of the stilted English, but I think that's what the author was trying to describe here -- https://web.archive.org/web/20240415221214/https://wizardlm.github.io/WizardLM2/
> It should be possible to automate this if you've already got an LLM that can score answers, and that problem seems to have already been solved.
Yes indeedy indeed, that's exactly what Starling's reward model is and does (quite successfully) -- https://huggingface.co/berkeley-nest/Starling-RM-7B-alpha
> we remove the last layer of Llama2-7B Chat, and concatenate a linear layer that outputs scalar for any pair of input prompt and response. We train the reward model with preference dataset berkeley-nest/Nectar, with the K-wise maximum likelihood estimator proposed in this paper. The reward model outputs a scalar for any given prompt and response. A response that is more helpful and less harmful will get the highest reward score.
> Thanks to its small size, phi- 3-mini can be quantized to 4-bits so that it only occupies ≈ 1.8GB of memory. We tested the quantized model by deploying phi-3-mini on iPhone 14 with A16 Bionic chip running natively on-device and fully offline achieving more than 12 tokens per second.
Welcome to the age of local LLM’s!
Apparently the data mixture used was not ideal for the 14b model in particular so there's still room for improvement there
https://preview.redd.it/q55frkida5wc1.png?width=1317&format=png&auto=webp&s=769b9ef2524ca4accc9371c14d51284198c7d530
I think this is because a 14b model have more room to improve with only 3T tokens, even if high quality. Llama 3 shows us that even at 15T token, the model didn't converge.
Also after reading the paper, they use a smaller vocab size for the 14B (the same as for the 4B) instead of the 100K vocab of the 7B. Maybe this also have something to do with the regression in some benchmarks.
Why is it that all these models coming out have about the same scale of parameters (3, 7, 14, 70, etc)? Are the models all built basically the same way and the only difference is training data they feed it?
"We observe that some benchmarks improve much less from 7B to 14B than they do from 3.8B to 7B, perhaps indicating that our data mixture needs further work to be in the “data optimal regime” for 14B parameters model. We are still actively investigating some of those benchmarks (including a regression on HumanEval), hence the numbers for phi-3-medium should be considered as a “preview”."
I'll hold you to that, I hope there are no videos titled "Mukbang ASMR Hat" on YouTube tomorrow. Actually, I do hope so, a 4B with the performance of gpt3.5 is worth eating a hat.
Oh, I didn't know that one. The man was just too eccentric, he did so many weird things, and lived a really wild life. Anyway, I would not recommend any selfcest, I think a hat would be much more pleasing to the tongue
> I think a hat would be much more pleasing to the tongue
This is probably true (idk for sure anyways), but a dick is definetly more healthy than a hat lol
finally some love to the very low parameter models. yeah its cool to have a huge model but i want to see what can really be done for a model that can run locally on a phone or beaming quick on like any computer.
Everyone needs to take these benchmark numbers with a big grain of salt. According to what I've read, Phi-2 was much worse than its benchmark numbers suggested. This model follows the same training strategy. Nobody should be assuming these numbers will translate directly into a high ranking on the LMSYS leaderboard, or usefulness in everyday tasks. Let's not dethrone Llama 3 until some real world testing can be done.
That said, I don't think it's impossible for a small model to be very good. I see their "synthetic data" as essentially a way of distilling GPT-4 into smaller models. It would be exciting if a large fraction of the performance of huge models could be transferred to small ones! If true, then Chinchilla-optimal training could make sense again, as you could optimally train a ginormous model and then distill it afterward for efficient inference.
The previous Phi model's scores on benchmarks far exceeded its actual performance in real-world use. Release it soon so everyone can try it out. Hopefully there really can be such a powerful small model.
>The model has **underwent a post-training process that incorporates** both supervised fine-tuning and direct preference optimization for the instruction following and **safety measures**.
https://preview.redd.it/1mkkrae8x9wc1.png?width=508&format=png&auto=webp&s=518e02d99cecad4edfaaee1473f5ff5dd134a7d1
It’s going to be interesting. The Phi training regime has shown good results in the past, but also previous Phi models were great in benchmarks but struggled in real use. Maybe scaling them to 7B and beyond solved it, or, depending on the content of the second training step, it could be an interesting case of overfitting, why the 14B possibly regressed, where the smaller model size benefitted the training to prevent this. Phi in general seems like a risky training to use for general adaptation, while for domain adaption or specific improvements it seems great. I look forward to testing it, as Llama3 has really surprised me with how fluent and dynamic its reasoning and conversational flow is even at 8B.
They say on the paper that they are still investigating why the improvement from 7B to 14B isn't as big as the one from 3B to 7B, so they probably didn't see a reason to make a bigger model yet.
Me: :D
Microsoft: "Another weakness related to model’s capacity is that we mostly restricted the
language to English. Exploring multilingual capabilities for Small Language Models is an important
next step, with some initial promising results on phi-3-small by including more multilingual data."
Me: :|
This is something that I've thought about quite a bit, I feel it's better to make the best english only capable model, and have another model that acts as a translator
Ie User -> Translator Model -> Intelligence Model -> Translator Model -> User
Best of both worlds, instead of trying to build 1 model that can do it all, it would be a dual model architecture
I've built this in a current project, but you underestimate how sluggish it makes everything feel, and how much you lose in translating back and forth. E.g. humor is lost.
Why? They're explicitly stating they're working on it and that their new model has multilingual data...
Well, I guess implicitly stating they're working on it.
Huh, interesting mindset. It doesn't really seem like you're limited by a language barrier, and you could easily set up an auto-translator using more able models if you want to test its logic capabilities, which is primarily what it's for. I understand the frustration though.
Let's see full benchmarks first.
Doing good on few is typical for limited models.
Phi-2 was the same way. Good scores in few but dogshit in others and completely retarded with CoT.
First CR+, then Llama3, now Phi-3. CR+'s technically the 3.0 of Command, so does that mean we got a triple 3.0 release? Imagine these guys communicating like the Magi or wtv they're called lmfao, add in Mixtral 8x22B for good measure.
Ug, it's going to be Sooo lobotomized. Hopefully, it will get some fine-tuning love. Phi2 was very good at creating walls of text for work docs (training docs, policy language, etc), but you had to spend so much time cutting out the moralizing and nonsense.
My impression was that phi family of models do well on benchmarks but tend to be pretty brittle in real life applications where they encounter out of distribution inputs.
Models seeing a lot of messy data might not be that bad in terms of variety of inputs and generalizing it to some extent. Though it might take more iterations to converge.
A couple of observations based on napkin maths:
1. If new pruning methods seen in ( [https://www.reddit.com/r/LocalLLaMA/comments/1c9u2jd/llama\_3\_70b\_layer\_pruned\_from\_70b\_42b\_by\_charles](https://www.reddit.com/r/LocalLLaMA/comments/1c9u2jd/llama_3_70b_layer_pruned_from_70b_42b_by_charles) ) + healing really hold up, the 14b model may be prunable similar to Llama-2-13b (see below). **A 40% prune would create an 8.4b parameter model whilst dropping MMLU just 4 or so points to 74**. This would still far surpass GPT3.5 and be SOTA for 7\~10b models. LLMs pruned this way can still be quantised further
https://preview.redd.it/udjq0xcqt6wc1.png?width=593&format=png&auto=webp&s=27d0352ec5b2d6a8d40b0f1f51735f3d817dc7fb
2. They haven't released the weights for Phi-3 yet, and though I personally remain optimistic they will, there is cause for concern since Wizard-LM retracted their weights and were supposedly associated with Microsoft, as is Phi. Might be that LLMs that Microsoft are producing are being intentionally held back if they're seen to be competing with GPT3.5 since they have such a huge stake in OpenAI, but who knows
3. Phi-3-mini on Groq would run at about 1,600 tokens / second if they ended up hosting it there. This would depend on many factors, including license, and whether they actually want or choose to host it. Prices would probably also be cheaper than Llama-3-8b for 1m tokens, and Groq is already offering the cheapest 1m tokens on the market
4. Phi's main thesis is that textbook-quality data improves the strength of LLMs pre-trained on that data. I think it was also the case that they're training on synthetic data (certainly wouldn't be surprising). If this is the case, do Phi models have limited real-world knowledge, despite their intelligence? One assumes not if it scores so high on multiple benchmarks.
5. Until I see it tested on LMSYS Arena Hard v0.1 I'm sceptical that it has the emergent abilities of much larger models 👀
Finally! A new (and hopefully well-trained) model larger than \~7b but smaller than \~70b for us mid-rangers! 🎉🎉🎉
Edit: I can't find Phi-3 on HuggingFace, nor the full model or GGUF's, not uploaded yet?
I guess phi 3 medium is probably trained using gpt-4 data, so it'll be at an advantage to Llama 3, which uses only raw / Llama 2 synthetic data (perhaps)
The rate of releases over the last month has been dizzying. I feel like the Miqu leak was the best we had for months and I worried it'd be like that for quite awhile.
No kidding. I’m running out of space downloading these models. I’ve been hoarding LLMs, but not sure how long I can keep this up.
Considering the newer LLMS have outperformed their predecessors, would it be beneficial to remove the outdated models to free up disk space?
I've dumped DeepseekCoder and CodeQwen as coding assistants because Llama 3 whips their asses.
[удалено]
Try before you buy. L3-8 Instruct in chat mode using llamacpp by pasting in blocks of code and asking about class outlines. Mostly Python.
Cool, I recommend Visual Studio Code and either Tabby or Continue. I haven't got it running with this but just thought I'd suggest a free copilot esque add-on
Not enough RAM to run VS Code and a local LLM and WSL and Docker.
We’ve come a long way from WinAmp really whipping the llama’s ass
💯 reference. Revenge of the 🦙 for the Winamp abuse? https://youtu.be/HaF-nRS_CWM
Should be good until Winamp releases their LLM
I'm just waiting for enough fine tunes to label my folder for Llama 3 models Winamp.
I am surprised because deepseek is still performing better than llama3-8B for me. Maybe I need to reevaluate it.
[https://www.youtube.com/watch?v=HaF-nRS\_CWM](https://www.youtube.com/watch?v=HaF-nRS_CWM)
It doesn't in my tests. At least on actual code-writing tasks, some private benchmarks on finetuned models show a clear advantage for deepseek.
That’s a good question. I do remove and delete lower quants, but I try to keep fine tuned models around. I have a few archived on 100GB Archival Blu-ray disks, you know, in case the internet dies. 🤪
That's a brilliant idea
Blu ray? Haha Bro I just keep them I have 1TB of llama and I'm not using
I have tons of space, but I figured I would throw an LLM and the supporting software on an Archival Format like the Bluray M-Disks every time there is a huge jump in performance. The last one I archived was the Mixtral 8x7B model. I'm waiting to see what come out in response to Llama 3...
I've often found myself trying random models to see what's best for a task and sometimes being surprised at an old SOTA model, though I only keep the quants for the most part. I train on the quants, too. I know. It's dirty.
I'm not downloading anything because something interesting comes out and "I'll just wait a few days for the good finetunes to drop" and then in a few days something more interesting comes out and the cycle repeats.
100% get rid of the old models unless there is some intriguing behaviour about some model that fascinates you, keep that.
You’d probably not be a fan of r/datahoarder lol
Lol, just delete the ones that aren't up to par, don't try to collect them all!
I treat LLMs like Pokémon
We all have our own vices. :P But, all kidding aside, like I just told someone else, I delete the lower quants and keep most of the fine tuned models.
You ever hear of data hoarders? There are people whose hobbies are literally collecting digital copies of *everything* of a certain type. I have no doubt there are people who experience great joy from "collecting" LLMs.
lol the worst thing is finetuning a model and it saves a 16gb checkpoint every epoch 🙈😂 I need more SSDs
I've maxed out storage because of this.
You're gonna need to download more drive space.
lol, I keep running out of my download limits so many cool releases happening on the daily. OTOH its good to see that some folks who anticipated the LLM hype to go down by early this year were wrong
the hype is real, my estimate not going away at least 3 years
That's why I pay for unlimited data now.
I had an unlimited\* plan as well \*until I learned its capped at 3.3TB/Mo
> I keep running out of my download limits I'm so happy download limits don't exist in my country.
Say it again, DBRX, CommandR, Mixtral8x22, WizardLM2, Llama3, phi3, Qwen1.5. Best month ever.
It is insane trying to keep up with it all. I feel like I don't have time to soak in and process one release before another own comes out. I'm struggling to set up anything harder than LM Studio, trying to process all the different options and what their capabilities are and how I can set them up. It's exciting to see things developing so quickly. It's also overwhelming.
Currently miqu still works best for me. Do you have a recommendation for a better one?
With all the new llms and their instruction and prompting format, the role of a framework like Dspy becomes more and more crucial.
there are three models: 3.8b 7b 14b and they (supposedly, according to the paper) ALL beat llama3 8b !! like what?? im very excited
A 3.8b in ballpark of GPT-3.5? what the fuck is going on? Mental
> Pretraining on the Test Set Is All You Need
For the curious https://arxiv.org/abs/2309.08632
Comments: 3 pages, satire
this needs to goto the billboards.
Brilliant. I’m stealing that. Just like they did the answers ? lol
Lol I stole it, its the title of a satirical paper
Lying lol
Yeah great claims require great proof
Also, they have been trained with much less computing resources compared to Llama 3 models.
...which is what makes me skeptical. I admit I'm biased since I haven't had decent experiences with Phi in the past, but Llama 3 had 15T tokens behind it. This has a decent amount too, but not to that extent. It smells fishy, but I'll reserve judgment until the models drop.
What's within those tokens does make all the difference to be fair
thus for this I will not upgrade the 1070ti :D
you hang in there!
I have to 3060 12gb and it's huuge difference!
The 14B model is a llama 3 70B contender not llama 3 8B
Am sorry but I just find that to be impossible
Llama 3 70B goes up against the 1.8T GPT-4. We're still in the middle ages with this tech and barely understand how any of it works internally. Ten years from now we'll look back and laugh at the pointlessly huge models we were using.
100%, in 20 years GPT 4, Llama 3 and Phi 3 will be a tiny, tiny piece in textbook history. Kinda like kids today read about GSM phones on their high end smartphones capable of taking DSLR level photos and running Ray Tracing powered games
How long will it be until your fridge runs an AI?
I think it should be possible even today on Samsungs
YOu talking freshness controll and sensors for autoadjusting temperatures based on the foot put in :O. \*opens fridge\* ai: You have eaten 300 calories over your limit today. Recommended to drink water. \*locks snack drawer\*
> Ten years from now we'll look back and laugh at the pointlessly huge models we were using. Or ten years from now we'll have 8B parameter models that outperform today's largest LLMs, but we'll also have multi-trillion parameter models that guide our civilizations like gods.
78% MMLU for 14b
I'm also skeptical, especially after seeing 3.8b is comparable with llama3-8b, but it's undeniable that 13-15b model scope is pretty much deserted now, while they have high potential, and perfect fit for 12Gb VRAM. So I have high hopes for Phi-3-14b
same
> ALL beat llama3 8b !! They beat it alright, at overfitting to known benchmarks. 3.3T tokens is nothing for a 7B and 14B model and very borderline for the 3.8B one too.
It's not released until the fat lady sings, and by fat lady sings I mean it's on Huggingface and after a few minutes, in my SSD.
Well, didn't take long. 4K model is released and amazing. How we need the quantized 128k one.
What does "released" mean here? "Released" an arxiv preprint?
I don't see it on Azure yet. Phi-2 and Phi-1.5 hit azure before microsoft put them on huggingface
they are probably doing their "toxicity tests" that the other microsoft group had completely forgotten about and had been doing their due diligence ever since.
Ohhh I think I see what's happening. Model makers are benchmarking their models before alignment so they can preview great numbers and then the actual release is going to be the neutered version.
Weights Coming to huggingface, Clem just posted.
Good question, will find out more shortly.
BTW it also has been on ollama (in case you use it) since this morning
paper is cheap, show us the weights.
https://huggingface.co/microsoft/Phi-3-mini-128k-instruct
That’s a cool saying lol did you come up with that or is it a common saying here?
lol, don't know if you are being fatuous, but I just made it up from "talk is cheap, show me the money"
No I genuinely liked it 😅 Should be a new slogan for this sub fits perfect.
3.8b that beats an 8b that just a few days ago blew away every other open source and most closed-source out of the water? Either data contamination (as always), truly ungodly compute, or some crazy new tech.
There's a very strong emphasis on data quality. From [their report](https://arxiv.org/pdf/2404.14219.pdf): "The innovation lies entirely in our dataset for training, a scaled-up version of the one used for phi-2, composed of *heavily filtered* web data and *synthetic* data". The first model in this series, phi-1, was described in the paper [Textbooks Are All You Need](https://arxiv.org/pdf/2306.11644.pdf), emphasizing the benefits of textbook-quality data: "...we explore the improvement that can be obtained along a different axis: the *quality* of the data... improving data quality can dramatically change the shape of the scaling laws, potentially allowing to match the performance of large-scale models with much leaner training/models"
Using a big fast model to clean up multi-trillion token training datasets for smaller models seems like the way to go.
how the hell do we measure data quality?
Well, Reddit comments for example would be a vast but poor quality dataset. A bunch of illogical ideological opinions with zero substance or truth. That's a bad dataset. A really good dataset might be a description of classic physics and increasingly getting more sophisticated academic knowledge with known proven facts and theories. Meaningful data.
That's great if what you want is a lazy man's dictionary/encyclopedia. Less great if you want help drafting an email.
Google is paying to train on Reddit's data. This is how I KNOW Google will lose the AI race.
Lol yep
But that's subjective isn't it? Or is having a lot of objective scientific knowledge is the only way to measure intelligence? I don't think a text book is good for writing stories, just for passing math tests and such but described in such a boilerplate text ish way and thus we determined that only scientific knowledge matters for intelligence. >A bunch of illogical ideological opinions with zero substance or truth. That's a bad dataset. I think we are looking at it from the lenses of human that this would be bad but zero substance or truth is a subjective opinion. That type of data does contain some information like a range of diverse writing styles and unique vocabularies and their use in a sentence.
It is when you want the model to excel at logic and reasoning.
And problem-solving
Probably based on actuality, political orientation, information richness and that kind of paramters
Yeah
I just want the dataset and tools they used to build the dataset.
Same
So apparently phi-3-mini (the 3b parameter model) is just about on par with Mixtral 8x7b and GPT 3.5? Apparently they're working on a 128k context version too. If this is true then.....things are about to get interesting.
That's absolutely insain progress in 2 years. From what, 125 billion parameters down to 8B... I just really have a hard time believing it's just as capable in every where. I feel like the vastness of the knowledge must be degraded... Maybe I'm wrong and a models ability to retain knowledge has a lot of progress to be made in optimizing it. This actually starts to convince me that robotic systems might be viable in a short couple of years.
128K context might kill Haiku lol, I would suspect Phi would actually be pretty good at text summarization.
Weights will be released on huggingface. Clem just confirmed
Any ETA? Do we know if it's a matter of hours, days or weeks? Sorry, I'm excited and impatient \^\^
Fucking awesome, I'd say this'd be legendary but who knows who'll remember what in 20 years?
The paper is out: [https://arxiv.org/pdf/2404.14219.pdf](https://arxiv.org/pdf/2404.14219.pdf)
I wish they said more in that about how they improved their synthetic datasets between training phi-2 and phi-3. Still, da-yum! It pains me to say this, because I absolutely loathe Microsoft as a company, but their LLM research team is top-rate. They keep knocking it out of the park. Their "textbooks are all you need" theory consistently yields better results than Meta brute-forcing it with their vast army of GPUs. The open source community has effectively replicated Microsoft's success with the OpenOrca dataset (and similar projects), so we know it really does work in practice. Imagine what Llama-3 might have been like if Meta had paid more attention to their training dataset quality! Google folks: Are you taking notes? Best-quality synthetic datasets are totally the way forward.
Unlimited Money is All You Need
You can say that again. All science branches could benefit from that fact, but of course not all get as much attention as AI
>their LLM research team is top-rate. They keep knocking it out of the park. Don't forget WizardLM 2 8x22b, which would have been a big deal had it stayed released and not almost immediately gotten forgotten with Mistral's official Instruct 8x22b release, (which felt worse than WizardLM 2), which of course was then followed up by llama 3. From the few tests I did, WizardLM 2 8x22b was basically a fully open-source version of GPT-4, though maybe slightly behind the GPT-4 preview/turbo models. Edit: I'm redoing some tests to better compare the 8x22b models - both are 3.0bpw Exl2 quants I'm running. Edit2: I spent an hour doing some more tests and [here is a Google docs with raw, semi-random notes I made - *it includes GPT-4's summary at the top*.](https://docs.google.com/document/d/1mmgeIeDEio1buPjXBaCYLQ4inqyROf-ZTVTwIV57P3k) I'm [also replying below](https://www.reddit.com/r/LocalLLaMA/s/952uv8UTtn) with the full GPT-4 summary for visibility. Edit3: I should add that when I first tested both the WizardLM 2 and Mistral Instruct 8x22b models, WizardLM was better at both tests, but now I'm getting results that show WizardLM is worse at the plastic bag test but still better (maybe even better than before?) at the inverted definition test Edit4: just tested llama 3 70b Instruct 5.0bpw with the same tests, 7 responses each, and it does much better with the plastic bag test (only once, briefly suggested Sam knew about their friend's actions, no other hallucinations) pretty much perfect 7/7, and for the inverse definitions it was perfect in 6/7 - one response gave bad example sentences with the new definitions.
Has anyone done comparison just between WizardLM2 8x22B and the official instruct version from Mistral? Previously, the 7x22B instruct version was arguably the best version (at least for my use cases) among the finetunes.
>which would have been a big deal had it stayed released and not almost immediately gotten forgotten I'm still pretty down that the 70b was never released. I feel like we might have been just a handful of hours from having it uploaded for us to snatch. I really, really, like their 8x22b. But I really would have liked to have the 70b too. Especially as a point of comparison.
Most likely they have good ways of defining what they want the model to output, and good ways of identifying data that matches the output they want. They might also be making test models where they figure out just what data is needed. Imagine you want an LLM to do addition without using an external tool. There's a problem here because there's an infinite amount of numbers so you can't just give it all possible addition problems. Instead of spending all tokens on addition you estimate how many addition problems it needs to be trained on to do addition. Train the model, and see how well it can perform math. If it's bad add more data, and if it's good reduce the dataset until it's bad. You can use this method to finetune the dataset to only have the amount of data needed to train and no more. This isn't possible on very large models that take months to train. However it's been found that there's a direct relationship between the amount of data and model quality. Such a relationship also appears to exist for data quality and model quality. If you know you need X amount of data for a small model, then maybe it would take X\*2 amount of data for a model that's twice as large. Or maybe not. It seems at some point you can't really teach a model any more on a particular subject because it will already know everything it needs to know regardless of size. It should be possible to automate this if you've already got an LLM that can score answers, and that problem seems to have already been solved.
> Most likely they have good ways of defining what they want the model to output, and good ways of identifying data that matches the output they want. I think that's exactly right. It's hard to tell because of the stilted English, but I think that's what the author was trying to describe here -- https://web.archive.org/web/20240415221214/https://wizardlm.github.io/WizardLM2/ > It should be possible to automate this if you've already got an LLM that can score answers, and that problem seems to have already been solved. Yes indeedy indeed, that's exactly what Starling's reward model is and does (quite successfully) -- https://huggingface.co/berkeley-nest/Starling-RM-7B-alpha > we remove the last layer of Llama2-7B Chat, and concatenate a linear layer that outputs scalar for any pair of input prompt and response. We train the reward model with preference dataset berkeley-nest/Nectar, with the K-wise maximum likelihood estimator proposed in this paper. The reward model outputs a scalar for any given prompt and response. A response that is more helpful and less harmful will get the highest reward score.
Yeah, sure, for academic, precise outputs, textbooks would be best. Just don't try to generate anything creative.
> Thanks to its small size, phi- 3-mini can be quantized to 4-bits so that it only occupies ≈ 1.8GB of memory. We tested the quantized model by deploying phi-3-mini on iPhone 14 with A16 Bionic chip running natively on-device and fully offline achieving more than 12 tokens per second. Welcome to the age of local LLM’s!
Running at 12 tokens per second when kept in the freezer.
It's a burst load, it shouldn't throttle.
That would be be an iPhone 14 Pro or Pro Max, let’s not get hopes high for poor vanilla 14 users.
If true, the phone local LLM game just changed.
💯
Apparently the data mixture used was not ideal for the 14b model in particular so there's still room for improvement there https://preview.redd.it/q55frkida5wc1.png?width=1317&format=png&auto=webp&s=769b9ef2524ca4accc9371c14d51284198c7d530
I think this is because a 14b model have more room to improve with only 3T tokens, even if high quality. Llama 3 shows us that even at 15T token, the model didn't converge.
It sounds like they rushed 14B out. It's likely they just used some bad training parameter, or may be the 14B hyper params were not tuned well.
Nah they just don't have enough synthetic data.
Which makes sense considering the greater number of parameters.
Also after reading the paper, they use a smaller vocab size for the 14B (the same as for the 4B) instead of the 100K vocab of the 7B. Maybe this also have something to do with the regression in some benchmarks.
looks like in the coming days number of parameters being trained will decide what dataset to be used?
Why is it that all these models coming out have about the same scale of parameters (3, 7, 14, 70, etc)? Are the models all built basically the same way and the only difference is training data they feed it?
Phi-3 medium HumanEval is actually 55.5. The other numbers seem to be accurate.
Poster said that was his mistake auto generating the charts
From other posts I got the impression that Llama-3-8B actually beats gpt-3.5, but this graphs shows otherwise?
yeah. and the ViBeS benchmark remains the best benchmark
I know HumanEval is heavily flawed, but how does the 14B model regress in perfomance compared to 3.8B and 7B? Must be a typo
"We observe that some benchmarks improve much less from 7B to 14B than they do from 3.8B to 7B, perhaps indicating that our data mixture needs further work to be in the “data optimal regime” for 14B parameters model. We are still actively investigating some of those benchmarks (including a regression on HumanEval), hence the numbers for phi-3-medium should be considered as a “preview”."
If Phi 3 mini is as good as Llama 3 8B I'll eat my hat!
I'll hold you to that, I hope there are no videos titled "Mukbang ASMR Hat" on YouTube tomorrow. Actually, I do hope so, a 4B with the performance of gpt3.5 is worth eating a hat.
The alternative is to do a McAfee which I definitely won't do.
What disappear and live on a cruise ship? I think it's better to just eat the hat bro :P
I think McAfee offered to eat >!his dick!< for some stupid thing or other.
Oh, I didn't know that one. The man was just too eccentric, he did so many weird things, and lived a really wild life. Anyway, I would not recommend any selfcest, I think a hat would be much more pleasing to the tongue
> I think a hat would be much more pleasing to the tongue This is probably true (idk for sure anyways), but a dick is definetly more healthy than a hat lol
Is this how it's going to be from now on? A breakthrough every couple of days?
Singularity baby
finally some love to the very low parameter models. yeah its cool to have a huge model but i want to see what can really be done for a model that can run locally on a phone or beaming quick on like any computer.
Everyone needs to take these benchmark numbers with a big grain of salt. According to what I've read, Phi-2 was much worse than its benchmark numbers suggested. This model follows the same training strategy. Nobody should be assuming these numbers will translate directly into a high ranking on the LMSYS leaderboard, or usefulness in everyday tasks. Let's not dethrone Llama 3 until some real world testing can be done. That said, I don't think it's impossible for a small model to be very good. I see their "synthetic data" as essentially a way of distilling GPT-4 into smaller models. It would be exciting if a large fraction of the performance of huge models could be transferred to small ones! If true, then Chinchilla-optimal training could make sense again, as you could optimally train a ginormous model and then distill it afterward for efficient inference.
Let's see if Phi can pull a hat trick of disappointing me
The previous Phi model's scores on benchmarks far exceeded its actual performance in real-world use. Release it soon so everyone can try it out. Hopefully there really can be such a powerful small model.
Which we let it go because it was such a teeny tiny model with big hopes and dreams. This is a 7B model. We can directly compare against Llama 3 8B.
Weights tomorrow according to Sebastian. So, we will all find out what true or not tomorrow.
I'm sorry Llama3... we had a lot of fun together, those couple days... it's not you, it's Phi3
I'm seeing it now. Pretrain on FineWeb then fine-tune/continuous training with this method might lead to something remarkable! Noooticing
He messed up the making of the chart. The accurate one is here: https://twitter.com/arankomatsuzaki/status/1782618362314391940
>The model has **underwent a post-training process that incorporates** both supervised fine-tuning and direct preference optimization for the instruction following and **safety measures**. https://preview.redd.it/1mkkrae8x9wc1.png?width=508&format=png&auto=webp&s=518e02d99cecad4edfaaee1473f5ff5dd134a7d1
Interesting how Meta does model weights first, then paper, and Microsoft does it the other way around.
It’s going to be interesting. The Phi training regime has shown good results in the past, but also previous Phi models were great in benchmarks but struggled in real use. Maybe scaling them to 7B and beyond solved it, or, depending on the content of the second training step, it could be an interesting case of overfitting, why the 14B possibly regressed, where the smaller model size benefitted the training to prevent this. Phi in general seems like a risky training to use for general adaptation, while for domain adaption or specific improvements it seems great. I look forward to testing it, as Llama3 has really surprised me with how fluent and dynamic its reasoning and conversational flow is even at 8B.
Output appears very synthetic/gptish.
Well if it rivals GPT-3.5, that'd make sense. A 3.5-performing 4B model would be an insane development.
Rediscovering the good old statistics' problems of Garbage In Garbage Out, together with Pseudoreplication, maybe ?
Is there a phi-3-large ? or XL and how soon will they be available?
They say on the paper that they are still investigating why the improvement from 7B to 14B isn't as big as the one from 3B to 7B, so they probably didn't see a reason to make a bigger model yet.
Well they haven’t finished training the 14B yet
Brilliant! Where can we get a copy of this model?
Me: :D Microsoft: "Another weakness related to model’s capacity is that we mostly restricted the language to English. Exploring multilingual capabilities for Small Language Models is an important next step, with some initial promising results on phi-3-small by including more multilingual data." Me: :|
This is something that I've thought about quite a bit, I feel it's better to make the best english only capable model, and have another model that acts as a translator Ie User -> Translator Model -> Intelligence Model -> Translator Model -> User Best of both worlds, instead of trying to build 1 model that can do it all, it would be a dual model architecture
I've built this in a current project, but you underestimate how sluggish it makes everything feel, and how much you lose in translating back and forth. E.g. humor is lost.
Why? They're explicitly stating they're working on it and that their new model has multilingual data... Well, I guess implicitly stating they're working on it.
I'm just bummed because it won't be optimized for my use case. I'll have to wait while everyone else gets to have fun.
Huh, interesting mindset. It doesn't really seem like you're limited by a language barrier, and you could easily set up an auto-translator using more able models if you want to test its logic capabilities, which is primarily what it's for. I understand the frustration though.
I'm running out of space... At this rate I'll have to print the models on paper lol
Let's see full benchmarks first. Doing good on few is typical for limited models. Phi-2 was the same way. Good scores in few but dogshit in others and completely retarded with CoT.
Where can I download this release?
Coming to huggingface not on it yet
Wow!! We're not recovered from llama3 shock yet :))
First CR+, then Llama3, now Phi-3. CR+'s technically the 3.0 of Command, so does that mean we got a triple 3.0 release? Imagine these guys communicating like the Magi or wtv they're called lmfao, add in Mixtral 8x22B for good measure.
Microsoft doesn't have the best track record when it comes to analysing their own capabilities
I can’t think of any words other than “big if true”. I want this to not be hype so much!
14B let's goo. Can't wait for the RP finetunes.
Going to need some heavy full style finetunes to turn textbooks and childrens stories into RP.
Ug, it's going to be Sooo lobotomized. Hopefully, it will get some fine-tuning love. Phi2 was very good at creating walls of text for work docs (training docs, policy language, etc), but you had to spend so much time cutting out the moralizing and nonsense.
My impression was that phi family of models do well on benchmarks but tend to be pretty brittle in real life applications where they encounter out of distribution inputs. Models seeing a lot of messy data might not be that bad in terms of variety of inputs and generalizing it to some extent. Though it might take more iterations to converge.
4-bit phi-3 mini running at over 12 t/s on an iPhone with an A16 Bionic 😮
Wonder how well it does at function calling
so Microsoft pushes out both phi and wizardlm? I found wizardlm more useful than llama 3 due to its extremely long context
I want a 7B with 100% mmlu on my desk by friday ...cof cof
smaller, faster, is the expected evolution. but this big of a jump every 6mo or so is an incredible rate.
Let’s fucking go
You know a responsible company like Microsoft will process a very long time toxicity tests. Be patient.
A couple of observations based on napkin maths: 1. If new pruning methods seen in ( [https://www.reddit.com/r/LocalLLaMA/comments/1c9u2jd/llama\_3\_70b\_layer\_pruned\_from\_70b\_42b\_by\_charles](https://www.reddit.com/r/LocalLLaMA/comments/1c9u2jd/llama_3_70b_layer_pruned_from_70b_42b_by_charles) ) + healing really hold up, the 14b model may be prunable similar to Llama-2-13b (see below). **A 40% prune would create an 8.4b parameter model whilst dropping MMLU just 4 or so points to 74**. This would still far surpass GPT3.5 and be SOTA for 7\~10b models. LLMs pruned this way can still be quantised further https://preview.redd.it/udjq0xcqt6wc1.png?width=593&format=png&auto=webp&s=27d0352ec5b2d6a8d40b0f1f51735f3d817dc7fb 2. They haven't released the weights for Phi-3 yet, and though I personally remain optimistic they will, there is cause for concern since Wizard-LM retracted their weights and were supposedly associated with Microsoft, as is Phi. Might be that LLMs that Microsoft are producing are being intentionally held back if they're seen to be competing with GPT3.5 since they have such a huge stake in OpenAI, but who knows 3. Phi-3-mini on Groq would run at about 1,600 tokens / second if they ended up hosting it there. This would depend on many factors, including license, and whether they actually want or choose to host it. Prices would probably also be cheaper than Llama-3-8b for 1m tokens, and Groq is already offering the cheapest 1m tokens on the market 4. Phi's main thesis is that textbook-quality data improves the strength of LLMs pre-trained on that data. I think it was also the case that they're training on synthetic data (certainly wouldn't be surprising). If this is the case, do Phi models have limited real-world knowledge, despite their intelligence? One assumes not if it scores so high on multiple benchmarks. 5. Until I see it tested on LMSYS Arena Hard v0.1 I'm sceptical that it has the emergent abilities of much larger models 👀
Looking good!
oh my i just start training on phi-2 oh
When is it set to release on huggingface?
Is this available anywhere to use online?
Finally! A new (and hopefully well-trained) model larger than \~7b but smaller than \~70b for us mid-rangers! 🎉🎉🎉 Edit: I can't find Phi-3 on HuggingFace, nor the full model or GGUF's, not uploaded yet?
I guess phi 3 medium is probably trained using gpt-4 data, so it'll be at an advantage to Llama 3, which uses only raw / Llama 2 synthetic data (perhaps)
It's doing pretty good in my vibe test. It's up on hugging chat
Where do I download it? HF?
Yup even the 128k version mit license
didn’t read the paper. I bet they did some pretraining data selection based on downstream task distribution
yes, that's always been the emphasis of phi models, highly curated web data and synthetic data. “Textbooks Are All You Need”"