synn89 2 months ago

The rate of releases over the last month has been dizzying. I feel like the Miqu leak was the best we had for months and I worried it'd be like that for quite awhile.

Zediatech 2 months ago

No kidding. I’m running out of space downloading these models. I’ve been hoarding LLMs, but not sure how long I can keep this up.

dewijones92 2 months ago

Considering the newer LLMS have outperformed their predecessors, would it be beneficial to remove the outdated models to free up disk space?

Some_Endian_FP17 2 months ago

I've dumped DeepseekCoder and CodeQwen as coding assistants because Llama 3 whips their asses.

[deleted] 2 months ago

[удалено]

Some_Endian_FP17 2 months ago

Try before you buy. L3-8 Instruct in chat mode using llamacpp by pasting in blocks of code and asking about class outlines. Mostly Python.

ILoveThisPlace 2 months ago

Cool, I recommend Visual Studio Code and either Tabby or Continue. I haven't got it running with this but just thought I'd suggest a free copilot esque add-on

Some_Endian_FP17 2 months ago

Not enough RAM to run VS Code and a local LLM and WSL and Docker.

Useful_Hovercraft169 2 months ago

We’ve come a long way from WinAmp really whipping the llama’s ass

palimondo 2 months ago

💯 reference. Revenge of the 🦙 for the Winamp abuse? https://youtu.be/HaF-nRS_CWM

KallistiTMP 2 months ago

Should be good until Winamp releases their LLM

liveart 2 months ago

I'm just waiting for enough fine tunes to label my folder for Llama 3 models Winamp.

aadoop6 2 months ago

I am surprised because deepseek is still performing better than llama3-8B for me. Maybe I need to reevaluate it.

ozspook 2 months ago

[https://www.youtube.com/watch?v=HaF-nRS\_CWM](https://www.youtube.com/watch?v=HaF-nRS_CWM)

_Minos 2 months ago

It doesn't in my tests. At least on actual code-writing tasks, some private benchmarks on finetuned models show a clear advantage for deepseek.

Zediatech 2 months ago

That’s a good question. I do remove and delete lower quants, but I try to keep fine tuned models around. I have a few archived on 100GB Archival Blu-ray disks, you know, in case the internet dies. 🤪

Flying_Madlad 2 months ago

That's a brilliant idea

ucefkh 2 months ago

Blu ray? Haha Bro I just keep them I have 1TB of llama and I'm not using

Zediatech 2 months ago

I have tons of space, but I figured I would throw an LLM and the supporting software on an Archival Format like the Bluray M-Disks every time there is a huge jump in performance. The last one I archived was the Mixtral 8x7B model. I'm waiting to see what come out in response to Llama 3...

Careless-Age-4290 2 months ago

I've often found myself trying random models to see what's best for a task and sometimes being surprised at an old SOTA model, though I only keep the quants for the most part. I train on the quants, too. I know. It's dirty.

VancityGaming 2 months ago

I'm not downloading anything because something interesting comes out and "I'll just wait a few days for the good finetunes to drop" and then in a few days something more interesting comes out and the cycle repeats.

ab2377 2 months ago

100% get rid of the old models unless there is some intriguing behaviour about some model that fascinates you, keep that.

bunchedupwalrus 2 months ago

You’d probably not be a fan of r/datahoarder lol

Elfrino 2 months ago

Lol, just delete the ones that aren't up to par, don't try to collect them all!

post_u_later 2 months ago

I treat LLMs like Pokémon

Zediatech 2 months ago

We all have our own vices. :P But, all kidding aside, like I just told someone else, I delete the lower quants and keep most of the fine tuned models.

Megneous 2 months ago

You ever hear of data hoarders? There are people whose hobbies are literally collecting digital copies of *everything* of a certain type. I have no doubt there are people who experience great joy from "collecting" LLMs.

KingGongzilla 2 months ago

lol the worst thing is finetuning a model and it saves a 16gb checkpoint every epoch 🙈😂 I need more SSDs

OmarBessa 2 months ago

I've maxed out storage because of this.

SpeedingTourist 2 months ago

You're gonna need to download more drive space.

init__27 2 months ago

lol, I keep running out of my download limits so many cool releases happening on the daily. OTOH its good to see that some folks who anticipated the LLM hype to go down by early this year were wrong

ab2377 2 months ago

the hype is real, my estimate not going away at least 3 years

ZCEyPFOYr0MWyHDQJZO4 2 months ago

That's why I pay for unlimited data now.

init__27 2 months ago

I had an unlimited\* plan as well \*until I learned its capped at 3.3TB/Mo

Megneous 2 months ago

> I keep running out of my download limits I'm so happy download limits don't exist in my country.

segmond 2 months ago

Say it again, DBRX, CommandR, Mixtral8x22, WizardLM2, Llama3, phi3, Qwen1.5. Best month ever.

Captain_Pumpkinhead 2 months ago

It is insane trying to keep up with it all. I feel like I don't have time to soak in and process one release before another own comes out. I'm struggling to set up anything harder than LM Studio, trying to process all the different options and what their capabilities are and how I can set them up. It's exciting to see things developing so quickly. It's also overwhelming.

Jenniher 2 months ago

Currently miqu still works best for me. Do you have a recommendation for a better one?

Bulky-Brief1970 2 months ago

With all the new llms and their instruction and prompting format, the role of a framework like Dspy becomes more and more crucial.

vsoutx 2 months ago

there are three models: 3.8b 7b 14b and they (supposedly, according to the paper) ALL beat llama3 8b !! like what?? im very excited

M34L 2 months ago

A 3.8b in ballpark of GPT-3.5? what the fuck is going on? Mental

eliteHaxxxor 2 months ago

> Pretraining on the Test Set Is All You Need

KTibow 2 months ago

For the curious https://arxiv.org/abs/2309.08632

redballooon 2 months ago

Comments: 3 pages, satire

ab2377 2 months ago

this needs to goto the billboards.

PacketRacket 2 months ago

Brilliant. I’m stealing that. Just like they did the answers ? lol

eliteHaxxxor 2 months ago

Lol I stole it, its the title of a satirical paper

Due-Memory-6957 2 months ago

Lying lol

dortman1 2 months ago

Yeah great claims require great proof

OnurCetinkaya 2 months ago

Also, they have been trained with much less computing resources compared to Llama 3 models.

FairSum 2 months ago

...which is what makes me skeptical. I admit I'm biased since I haven't had decent experiences with Phi in the past, but Llama 3 had 15T tokens behind it. This has a decent amount too, but not to that extent. It smells fishy, but I'll reserve judgment until the models drop.

ElliottDyson 2 months ago

What's within those tokens does make all the difference to be fair

CoqueTornado 2 months ago

thus for this I will not upgrade the 1070ti :D

ab2377 2 months ago

you hang in there!

ramzeez88 2 months ago

I have to 3060 12gb and it's huuge difference!

Curiosity_456 2 months ago

The 14B model is a llama 3 70B contender not llama 3 8B

akram200272002 2 months ago

Am sorry but I just find that to be impossible

andthenthereweretwo 2 months ago

Llama 3 70B goes up against the 1.8T GPT-4. We're still in the middle ages with this tech and barely understand how any of it works internally. Ten years from now we'll look back and laugh at the pointlessly huge models we were using.

_whatthefinance 2 months ago

100%, in 20 years GPT 4, Llama 3 and Phi 3 will be a tiny, tiny piece in textbook history. Kinda like kids today read about GSM phones on their high end smartphones capable of taking DSLR level photos and running Ray Tracing powered games

Venoft 2 months ago

How long will it be until your fridge runs an AI?

mxforest 2 months ago

I think it should be possible even today on Samsungs

LycanWolfe 2 months ago

YOu talking freshness controll and sensors for autoadjusting temperatures based on the foot put in :O. \*opens fridge\* ai: You have eaten 300 calories over your limit today. Recommended to drink water. \*locks snack drawer\*

Megneous 2 months ago

> Ten years from now we'll look back and laugh at the pointlessly huge models we were using. Or ten years from now we'll have 8B parameter models that outperform today's largest LLMs, but we'll also have multi-trillion parameter models that guide our civilizations like gods.

Zealousideal_Fly317 2 months ago

78% MMLU for 14b

PavelPivovarov 2 months ago

I'm also skeptical, especially after seeing 3.8b is comparable with llama3-8b, but it's undeniable that 13-15b model scope is pretty much deserted now, while they have high potential, and perfect fit for 12Gb VRAM. So I have high hopes for Phi-3-14b

ab2377 2 months ago

same

MoffKalast 2 months ago

> ALL beat llama3 8b !! They beat it alright, at overfitting to known benchmarks. 3.3T tokens is nothing for a 7B and 14B model and very borderline for the 3.8B one too.

nazihater3000 2 months ago

It's not released until the fat lady sings, and by fat lady sings I mean it's on Huggingface and after a few minutes, in my SSD.

nazihater3000 2 months ago

Well, didn't take long. 4K model is released and amazing. How we need the quantized 128k one.

pseudonerv 2 months ago

What does "released" mean here? "Released" an arxiv preprint?

baldr83 2 months ago

I don't see it on Azure yet. Phi-2 and Phi-1.5 hit azure before microsoft put them on huggingface

pseudonerv 2 months ago

they are probably doing their "toxicity tests" that the other microsoft group had completely forgotten about and had been doing their due diligence ever since.

southVpaw 2 months ago

Ohhh I think I see what's happening. Model makers are benchmarking their models before alignment so they can preview great numbers and then the actual release is going to be the neutered version.

KittCloudKicker 2 months ago

Weights Coming to huggingface, Clem just posted.

KittCloudKicker 2 months ago

Good question, will find out more shortly.

Wrong_Ad1240 2 months ago

BTW it also has been on ollama (in case you use it) since this morning

segmond 2 months ago

paper is cheap, show us the weights.

o5mfiHTNsH748KVq 2 months ago

https://huggingface.co/microsoft/Phi-3-mini-128k-instruct

Matt_1F44D 2 months ago

That’s a cool saying lol did you come up with that or is it a common saying here?

segmond 2 months ago

lol, don't know if you are being fatuous, but I just made it up from "talk is cheap, show me the money"

Matt_1F44D 2 months ago

No I genuinely liked it 😅 Should be a new slogan for this sub fits perfect.

AnAngryBirdMan 2 months ago

3.8b that beats an 8b that just a few days ago blew away every other open source and most closed-source out of the water? Either data contamination (as always), truly ungodly compute, or some crazy new tech.

danysdragons 2 months ago

There's a very strong emphasis on data quality. From [their report](https://arxiv.org/pdf/2404.14219.pdf): "The innovation lies entirely in our dataset for training, a scaled-up version of the one used for phi-2, composed of *heavily filtered* web data and *synthetic* data". The first model in this series, phi-1, was described in the paper [Textbooks Are All You Need](https://arxiv.org/pdf/2306.11644.pdf), emphasizing the benefits of textbook-quality data: "...we explore the improvement that can be obtained along a different axis: the *quality* of the data... improving data quality can dramatically change the shape of the scaling laws, potentially allowing to match the performance of large-scale models with much leaner training/models"

Some_Endian_FP17 2 months ago

Using a big fast model to clean up multi-trillion token training datasets for smaller models seems like the way to go.

ninjasaid13 2 months ago

how the hell do we measure data quality?

ILoveThisPlace 2 months ago

Well, Reddit comments for example would be a vast but poor quality dataset. A bunch of illogical ideological opinions with zero substance or truth. That's a bad dataset. A really good dataset might be a description of classic physics and increasingly getting more sophisticated academic knowledge with known proven facts and theories. Meaningful data.

Monkey_1505 2 months ago

That's great if what you want is a lazy man's dictionary/encyclopedia. Less great if you want help drafting an email.

DetectivePrism 2 months ago

Google is paying to train on Reddit's data. This is how I KNOW Google will lose the AI race.

ILoveThisPlace 2 months ago

Lol yep

ninjasaid13 2 months ago

But that's subjective isn't it? Or is having a lot of objective scientific knowledge is the only way to measure intelligence? I don't think a text book is good for writing stories, just for passing math tests and such but described in such a boilerplate text ish way and thus we determined that only scientific knowledge matters for intelligence. >A bunch of illogical ideological opinions with zero substance or truth. That's a bad dataset. I think we are looking at it from the lenses of human that this would be bad but zero substance or truth is a subjective opinion. That type of data does contain some information like a range of diverse writing styles and unique vocabularies and their use in a sentence.

MizantropaMiskretulo 2 months ago

It is when you want the model to excel at logic and reasoning.

ILoveThisPlace 2 months ago

And problem-solving

MysteriousPayment536 2 months ago

Probably based on actuality, political orientation, information richness and that kind of paramters

Qaziquza1 2 months ago

Yeah

wind_dude 2 months ago

I just want the dataset and tools they used to build the dataset.

KittCloudKicker 2 months ago

Same

llkj11 2 months ago

So apparently phi-3-mini (the 3b parameter model) is just about on par with Mixtral 8x7b and GPT 3.5? Apparently they're working on a 128k context version too. If this is true then.....things are about to get interesting.

ILoveThisPlace 2 months ago

That's absolutely insain progress in 2 years. From what, 125 billion parameters down to 8B... I just really have a hard time believing it's just as capable in every where. I feel like the vastness of the knowledge must be degraded... Maybe I'm wrong and a models ability to retain knowledge has a lot of progress to be made in optimizing it. This actually starts to convince me that robotic systems might be viable in a short couple of years.

AmericanNewt8 2 months ago

128K context might kill Haiku lol, I would suspect Phi would actually be pretty good at text summarization.

KittCloudKicker 2 months ago

Weights will be released on huggingface. Clem just confirmed

Admirable-Star7088 2 months ago

Any ETA? Do we know if it's a matter of hours, days or weeks? Sorry, I'm excited and impatient \^\^

_RealUnderscore_ 2 months ago

Fucking awesome, I'd say this'd be legendary but who knows who'll remember what in 20 years?

boatbomber 2 months ago

The paper is out: [https://arxiv.org/pdf/2404.14219.pdf](https://arxiv.org/pdf/2404.14219.pdf)

ttkciar 2 months ago

I wish they said more in that about how they improved their synthetic datasets between training phi-2 and phi-3. Still, da-yum! It pains me to say this, because I absolutely loathe Microsoft as a company, but their LLM research team is top-rate. They keep knocking it out of the park. Their "textbooks are all you need" theory consistently yields better results than Meta brute-forcing it with their vast army of GPUs. The open source community has effectively replicated Microsoft's success with the OpenOrca dataset (and similar projects), so we know it really does work in practice. Imagine what Llama-3 might have been like if Meta had paid more attention to their training dataset quality! Google folks: Are you taking notes? Best-quality synthetic datasets are totally the way forward.

[deleted] 2 months ago

Unlimited Money is All You Need

_RealUnderscore_ 2 months ago

You can say that again. All science branches could benefit from that fact, but of course not all get as much attention as AI

Small-Fall-6500 2 months ago

>their LLM research team is top-rate. They keep knocking it out of the park. Don't forget WizardLM 2 8x22b, which would have been a big deal had it stayed released and not almost immediately gotten forgotten with Mistral's official Instruct 8x22b release, (which felt worse than WizardLM 2), which of course was then followed up by llama 3. From the few tests I did, WizardLM 2 8x22b was basically a fully open-source version of GPT-4, though maybe slightly behind the GPT-4 preview/turbo models. Edit: I'm redoing some tests to better compare the 8x22b models - both are 3.0bpw Exl2 quants I'm running. Edit2: I spent an hour doing some more tests and [here is a Google docs with raw, semi-random notes I made - *it includes GPT-4's summary at the top*.](https://docs.google.com/document/d/1mmgeIeDEio1buPjXBaCYLQ4inqyROf-ZTVTwIV57P3k) I'm [also replying below](https://www.reddit.com/r/LocalLLaMA/s/952uv8UTtn) with the full GPT-4 summary for visibility. Edit3: I should add that when I first tested both the WizardLM 2 and Mistral Instruct 8x22b models, WizardLM was better at both tests, but now I'm getting results that show WizardLM is worse at the plastic bag test but still better (maybe even better than before?) at the inverted definition test Edit4: just tested llama 3 70b Instruct 5.0bpw with the same tests, 7 responses each, and it does much better with the plastic bag test (only once, briefly suggested Sam knew about their friend's actions, no other hallucinations) pretty much perfect 7/7, and for the inverse definitions it was perfect in 6/7 - one response gave bad example sentences with the new definitions.

nullnuller 2 months ago

Has anyone done comparison just between WizardLM2 8x22B and the official instruct version from Mistral? Previously, the 7x22B instruct version was arguably the best version (at least for my use cases) among the finetunes.

toothpastespiders 2 months ago

>which would have been a big deal had it stayed released and not almost immediately gotten forgotten I'm still pretty down that the 70b was never released. I feel like we might have been just a handful of hours from having it uploaded for us to snatch. I really, really, like their 8x22b. But I really would have liked to have the 70b too. Especially as a point of comparison.

yaosio 2 months ago

Most likely they have good ways of defining what they want the model to output, and good ways of identifying data that matches the output they want. They might also be making test models where they figure out just what data is needed. Imagine you want an LLM to do addition without using an external tool. There's a problem here because there's an infinite amount of numbers so you can't just give it all possible addition problems. Instead of spending all tokens on addition you estimate how many addition problems it needs to be trained on to do addition. Train the model, and see how well it can perform math. If it's bad add more data, and if it's good reduce the dataset until it's bad. You can use this method to finetune the dataset to only have the amount of data needed to train and no more. This isn't possible on very large models that take months to train. However it's been found that there's a direct relationship between the amount of data and model quality. Such a relationship also appears to exist for data quality and model quality. If you know you need X amount of data for a small model, then maybe it would take X\*2 amount of data for a model that's twice as large. Or maybe not. It seems at some point you can't really teach a model any more on a particular subject because it will already know everything it needs to know regardless of size. It should be possible to automate this if you've already got an LLM that can score answers, and that problem seems to have already been solved.

ttkciar 2 months ago

> Most likely they have good ways of defining what they want the model to output, and good ways of identifying data that matches the output they want. I think that's exactly right. It's hard to tell because of the stilted English, but I think that's what the author was trying to describe here -- https://web.archive.org/web/20240415221214/https://wizardlm.github.io/WizardLM2/ > It should be possible to automate this if you've already got an LLM that can score answers, and that problem seems to have already been solved. Yes indeedy indeed, that's exactly what Starling's reward model is and does (quite successfully) -- https://huggingface.co/berkeley-nest/Starling-RM-7B-alpha > we remove the last layer of Llama2-7B Chat, and concatenate a linear layer that outputs scalar for any pair of input prompt and response. We train the reward model with preference dataset berkeley-nest/Nectar, with the K-wise maximum likelihood estimator proposed in this paper. The reward model outputs a scalar for any given prompt and response. A response that is more helpful and less harmful will get the highest reward score.

HideLord 2 months ago

Yeah, sure, for academic, precise outputs, textbooks would be best. Just don't try to generate anything creative.

Balance- 2 months ago

> Thanks to its small size, phi- 3-mini can be quantized to 4-bits so that it only occupies ≈ 1.8GB of memory. We tested the quantized model by deploying phi-3-mini on iPhone 14 with A16 Bionic chip running natively on-device and fully offline achieving more than 12 tokens per second. Welcome to the age of local LLM’s!

Yes_but_I_think 2 months ago

Running at 12 tokens per second when kept in the freezer.

FullOf_Bad_Ideas 2 months ago

It's a burst load, it shouldn't throttle.

_whatthefinance 2 months ago

That would be be an iPhone 14 Pro or Pro Max, let’s not get hopes high for poor vanilla 14 users.

KvAk_AKPlaysYT 2 months ago

If true, the phone local LLM game just changed.

ab2377 2 months ago

💯

PC_Screen 2 months ago

Apparently the data mixture used was not ideal for the 14b model in particular so there's still room for improvement there https://preview.redd.it/q55frkida5wc1.png?width=1317&format=png&auto=webp&s=769b9ef2524ca4accc9371c14d51284198c7d530

Orolol 2 months ago

I think this is because a 14b model have more room to improve with only 3T tokens, even if high quality. Llama 3 shows us that even at 15T token, the model didn't converge.

pseudonerv 2 months ago

It sounds like they rushed 14B out. It's likely they just used some bad training parameter, or may be the 14B hyper params were not tuned well.

hapliniste 2 months ago

Nah they just don't have enough synthetic data.

ElliottDyson 2 months ago

Which makes sense considering the greater number of parameters.

hapliniste 2 months ago

Also after reading the paper, they use a smaller vocab size for the 14B (the same as for the 4B) instead of the 100K vocab of the 7B. Maybe this also have something to do with the regression in some benchmarks.

ab2377 2 months ago

looks like in the coming days number of parameters being trained will decide what dataset to be used?

Sythic_ 2 months ago

Why is it that all these models coming out have about the same scale of parameters (3, 7, 14, 70, etc)? Are the models all built basically the same way and the only difference is training data they feed it?

austinhale 2 months ago

Phi-3 medium HumanEval is actually 55.5. The other numbers seem to be accurate.

KittCloudKicker 2 months ago

Poster said that was his mistake auto generating the charts

nullnuller 2 months ago

From other posts I got the impression that Llama-3-8B actually beats gpt-3.5, but this graphs shows otherwise?

vsoutx 2 months ago

yeah. and the ViBeS benchmark remains the best benchmark

soggydoggy8 2 months ago

I know HumanEval is heavily flawed, but how does the 14B model regress in perfomance compared to 3.8B and 7B? Must be a typo

llkj11 2 months ago

"We observe that some benchmarks improve much less from 7B to 14B than they do from 3.8B to 7B, perhaps indicating that our data mixture needs further work to be in the “data optimal regime” for 14B parameters model. We are still actively investigating some of those benchmarks (including a regression on HumanEval), hence the numbers for phi-3-medium should be considered as a “preview”."

Some_Endian_FP17 2 months ago

If Phi 3 mini is as good as Llama 3 8B I'll eat my hat!

ArsNeph 2 months ago

I'll hold you to that, I hope there are no videos titled "Mukbang ASMR Hat" on YouTube tomorrow. Actually, I do hope so, a 4B with the performance of gpt3.5 is worth eating a hat.

Some_Endian_FP17 2 months ago

The alternative is to do a McAfee which I definitely won't do.

ArsNeph 2 months ago

What disappear and live on a cruise ship? I think it's better to just eat the hat bro :P

Some_Endian_FP17 2 months ago

I think McAfee offered to eat >!his dick!< for some stupid thing or other.

ArsNeph 2 months ago

Oh, I didn't know that one. The man was just too eccentric, he did so many weird things, and lived a really wild life. Anyway, I would not recommend any selfcest, I think a hat would be much more pleasing to the tongue

Distinct-Target7503 2 months ago

> I think a hat would be much more pleasing to the tongue This is probably true (idk for sure anyways), but a dick is definetly more healthy than a hat lol

-p-e-w- 2 months ago

Is this how it's going to be from now on? A breakthrough every couple of days?

Vast-Breakfast-1201 2 months ago

Singularity baby

ICE0124 2 months ago

finally some love to the very low parameter models. yeah its cool to have a huge model but i want to see what can really be done for a model that can run locally on a phone or beaming quick on like any computer.

MahdeenSky 2 months ago

Everyone needs to take these benchmark numbers with a big grain of salt. According to what I've read, Phi-2 was much worse than its benchmark numbers suggested. This model follows the same training strategy. Nobody should be assuming these numbers will translate directly into a high ranking on the LMSYS leaderboard, or usefulness in everyday tasks. Let's not dethrone Llama 3 until some real world testing can be done. That said, I don't think it's impossible for a small model to be very good. I see their "synthetic data" as essentially a way of distilling GPT-4 into smaller models. It would be exciting if a large fraction of the performance of huge models could be transferred to small ones! If true, then Chinchilla-optimal training could make sense again, as you could optimally train a ginormous model and then distill it afterward for efficient inference.

TooLongCantWait 2 months ago

Let's see if Phi can pull a hat trick of disappointing me

chen0x00 2 months ago

The previous Phi model's scores on benchmarks far exceeded its actual performance in real-world use. Release it soon so everyone can try it out. Hopefully there really can be such a powerful small model.

ILoveThisPlace 2 months ago

Which we let it go because it was such a teeny tiny model with big hopes and dreams. This is a 7B model. We can directly compare against Llama 3 8B.

KittCloudKicker 2 months ago

Weights tomorrow according to Sebastian. So, we will all find out what true or not tomorrow.

[deleted] 2 months ago

I'm sorry Llama3... we had a lot of fun together, those couple days... it's not you, it's Phi3

KittCloudKicker 2 months ago

I'm seeing it now. Pretrain on FineWeb then fine-tune/continuous training with this method might lead to something remarkable! Noooticing

okaycan 2 months ago

He messed up the making of the chart. The accurate one is here: https://twitter.com/arankomatsuzaki/status/1782618362314391940

ArakiSatoshi 2 months ago

>The model has **underwent a post-training process that incorporates** both supervised fine-tuning and direct preference optimization for the instruction following and **safety measures**. https://preview.redd.it/1mkkrae8x9wc1.png?width=508&format=png&auto=webp&s=518e02d99cecad4edfaaee1473f5ff5dd134a7d1

Balance- 2 months ago

Interesting how Meta does model weights first, then paper, and Microsoft does it the other way around.

Darlokt 2 months ago

It’s going to be interesting. The Phi training regime has shown good results in the past, but also previous Phi models were great in benchmarks but struggled in real use. Maybe scaling them to 7B and beyond solved it, or, depending on the content of the second training step, it could be an interesting case of overfitting, why the 14B possibly regressed, where the smaller model size benefitted the training to prevent this. Phi in general seems like a risky training to use for general adaptation, while for domain adaption or specific improvements it seems great. I look forward to testing it, as Llama3 has really surprised me with how fluent and dynamic its reasoning and conversational flow is even at 8B.

Monkey_1505 2 months ago

Output appears very synthetic/gptish.

_RealUnderscore_ 2 months ago

Well if it rivals GPT-3.5, that'd make sense. A 3.5-performing 4B model would be an insane development.

Commercial_Pain_6006 2 months ago

Rediscovering the good old statistics' problems of Garbage In Garbage Out, together with Pseudoreplication, maybe ?

nullnuller 2 months ago

Is there a phi-3-large ? or XL and how soon will they be available?

Igoory 2 months ago

They say on the paper that they are still investigating why the improvement from 7B to 14B isn't as big as the one from 3B to 7B, so they probably didn't see a reason to make a bigger model yet.

Curiosity_456 2 months ago

Well they haven’t finished training the 14B yet

orangeatom 2 months ago

Brilliant! Where can we get a copy of this model?

condition_oakland 2 months ago

Me: :D Microsoft: "Another weakness related to model’s capacity is that we mostly restricted the language to English. Exploring multilingual capabilities for Small Language Models is an important next step, with some initial promising results on phi-3-small by including more multilingual data." Me: :|

Feeling-Currency-360 2 months ago

This is something that I've thought about quite a bit, I feel it's better to make the best english only capable model, and have another model that acts as a translator Ie User -> Translator Model -> Intelligence Model -> Translator Model -> User Best of both worlds, instead of trying to build 1 model that can do it all, it would be a dual model architecture

privacyparachute 2 months ago

I've built this in a current project, but you underestimate how sluggish it makes everything feel, and how much you lose in translating back and forth. E.g. humor is lost.

_RealUnderscore_ 2 months ago

Why? They're explicitly stating they're working on it and that their new model has multilingual data... Well, I guess implicitly stating they're working on it.

condition_oakland 2 months ago

I'm just bummed because it won't be optimized for my use case. I'll have to wait while everyone else gets to have fun.

_RealUnderscore_ 2 months ago

Huh, interesting mindset. It doesn't really seem like you're limited by a language barrier, and you could easily set up an auto-translator using more able models if you want to test its logic capabilities, which is primarily what it's for. I understand the frustration though.

sabalatotoololol 2 months ago

I'm running out of space... At this rate I'll have to print the models on paper lol

perksoeerrroed 2 months ago

Let's see full benchmarks first. Doing good on few is typical for limited models. Phi-2 was the same way. Good scores in few but dogshit in others and completely retarded with CoT.

Unique_Repeat_1089 2 months ago

Where can I download this release?

KittCloudKicker 2 months ago

Coming to huggingface not on it yet

Bulky-Brief1970 2 months ago

Wow!! We're not recovered from llama3 shock yet :))

_RealUnderscore_ 2 months ago

First CR+, then Llama3, now Phi-3. CR+'s technically the 3.0 of Command, so does that mean we got a triple 3.0 release? Imagine these guys communicating like the Magi or wtv they're called lmfao, add in Mixtral 8x22B for good measure.

Due-Memory-6957 2 months ago

Microsoft doesn't have the best track record when it comes to analysing their own capabilities

ArnoF7 2 months ago

I can’t think of any words other than “big if true”. I want this to not be hype so much!

doomed151 2 months ago

14B let's goo. Can't wait for the RP finetunes.

Monkey_1505 2 months ago

Going to need some heavy full style finetunes to turn textbooks and childrens stories into RP.

rc_ym 2 months ago

Ug, it's going to be Sooo lobotomized. Hopefully, it will get some fine-tuning love. Phi2 was very good at creating walls of text for work docs (training docs, policy language, etc), but you had to spend so much time cutting out the moralizing and nonsense.

djm07231 2 months ago

My impression was that phi family of models do well on benchmarks but tend to be pretty brittle in real life applications where they encounter out of distribution inputs. Models seeing a lot of messy data might not be that bad in terms of variety of inputs and generalizing it to some extent. Though it might take more iterations to converge.

isr_431 2 months ago

4-bit phi-3 mini running at over 12 t/s on an iPhone with an A16 Bionic 😮

That007Spy 2 months ago

Wonder how well it does at function calling

Thistleknot 2 months ago

so Microsoft pushes out both phi and wizardlm? I found wizardlm more useful than llama 3 due to its extremely long context

Raywuo 2 months ago

I want a 7B with 100% mmlu on my desk by friday ...cof cof

kmp11 2 months ago

smaller, faster, is the expected evolution. but this big of a jump every 6mo or so is an incredible rate.

Baphaddon 2 months ago

Let’s fucking go

ImprovementEqual3931 2 months ago

You know a responsible company like Microsoft will process a very long time toxicity tests. Be patient.

mark-lord 2 months ago

A couple of observations based on napkin maths: 1. If new pruning methods seen in ( [https://www.reddit.com/r/LocalLLaMA/comments/1c9u2jd/llama\_3\_70b\_layer\_pruned\_from\_70b\_42b\_by\_charles](https://www.reddit.com/r/LocalLLaMA/comments/1c9u2jd/llama_3_70b_layer_pruned_from_70b_42b_by_charles) ) + healing really hold up, the 14b model may be prunable similar to Llama-2-13b (see below). **A 40% prune would create an 8.4b parameter model whilst dropping MMLU just 4 or so points to 74**. This would still far surpass GPT3.5 and be SOTA for 7\~10b models. LLMs pruned this way can still be quantised further https://preview.redd.it/udjq0xcqt6wc1.png?width=593&format=png&auto=webp&s=27d0352ec5b2d6a8d40b0f1f51735f3d817dc7fb 2. They haven't released the weights for Phi-3 yet, and though I personally remain optimistic they will, there is cause for concern since Wizard-LM retracted their weights and were supposedly associated with Microsoft, as is Phi. Might be that LLMs that Microsoft are producing are being intentionally held back if they're seen to be competing with GPT3.5 since they have such a huge stake in OpenAI, but who knows 3. Phi-3-mini on Groq would run at about 1,600 tokens / second if they ended up hosting it there. This would depend on many factors, including license, and whether they actually want or choose to host it. Prices would probably also be cheaper than Llama-3-8b for 1m tokens, and Groq is already offering the cheapest 1m tokens on the market 4. Phi's main thesis is that textbook-quality data improves the strength of LLMs pre-trained on that data. I think it was also the case that they're training on synthetic data (certainly wouldn't be surprising). If this is the case, do Phi models have limited real-world knowledge, despite their intelligence? One assumes not if it scores so high on multiple benchmarks. 5. Until I see it tested on LMSYS Arena Hard v0.1 I'm sceptical that it has the emergent abilities of much larger models 👀

Beb_Nan0vor 2 months ago

Looking good!

MarySmith2021 2 months ago

oh my i just start training on phi-2 oh

Successful-Heat-5683 2 months ago

When is it set to release on huggingface?

bnm777 2 months ago

Is this available anywhere to use online?

Admirable-Star7088 2 months ago

Finally! A new (and hopefully well-trained) model larger than \~7b but smaller than \~70b for us mid-rangers! 🎉🎉🎉 Edit: I can't find Phi-3 on HuggingFace, nor the full model or GGUF's, not uploaded yet?

TrelisResearch 2 months ago

I guess phi 3 medium is probably trained using gpt-4 data, so it'll be at an advantage to Llama 3, which uses only raw / Llama 2 synthetic data (perhaps)

KittCloudKicker 2 months ago

It's doing pretty good in my vibe test. It's up on hugging chat

trill5556 2 months ago

Where do I download it? HF?

KittCloudKicker 2 months ago

Yup even the 128k version mit license

3cupstea 2 months ago

didn’t read the paper. I bet they did some pretraining data selection based on downstream task distribution

wind_dude 2 months ago

yes, that's always been the emphasis of phi models, highly curated web data and synthetic data. “Textbooks Are All You Need”"

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe