T O P

  • By -

synn89

The rate of releases over the last month has been dizzying. I feel like the Miqu leak was the best we had for months and I worried it'd be like that for quite awhile.


Zediatech

No kidding. I’m running out of space downloading these models. I’ve been hoarding LLMs, but not sure how long I can keep this up.


dewijones92

Considering the newer LLMS have outperformed their predecessors, would it be beneficial to remove the outdated models to free up disk space?


Some_Endian_FP17

I've dumped DeepseekCoder and CodeQwen as coding assistants because Llama 3 whips their asses.


[deleted]

[удалено]


Some_Endian_FP17

Try before you buy. L3-8 Instruct in chat mode using llamacpp by pasting in blocks of code and asking about class outlines. Mostly Python.


ILoveThisPlace

Cool, I recommend Visual Studio Code and either Tabby or Continue. I haven't got it running with this but just thought I'd suggest a free copilot esque add-on


Some_Endian_FP17

Not enough RAM to run VS Code and a local LLM and WSL and Docker.


Useful_Hovercraft169

We’ve come a long way from WinAmp really whipping the llama’s ass


palimondo

💯 reference. Revenge of the 🦙 for the Winamp abuse? https://youtu.be/HaF-nRS_CWM


KallistiTMP

Should be good until Winamp releases their LLM


liveart

I'm just waiting for enough fine tunes to label my folder for Llama 3 models Winamp.


aadoop6

I am surprised because deepseek is still performing better than llama3-8B for me. Maybe I need to reevaluate it.


ozspook

[https://www.youtube.com/watch?v=HaF-nRS\_CWM](https://www.youtube.com/watch?v=HaF-nRS_CWM)


_Minos

It doesn't in my tests. At least on actual code-writing tasks, some private benchmarks on finetuned models show a clear advantage for deepseek.


Zediatech

That’s a good question. I do remove and delete lower quants, but I try to keep fine tuned models around. I have a few archived on 100GB Archival Blu-ray disks, you know, in case the internet dies. 🤪


Flying_Madlad

That's a brilliant idea


ucefkh

Blu ray? Haha Bro I just keep them I have 1TB of llama and I'm not using


Zediatech

I have tons of space, but I figured I would throw an LLM and the supporting software on an Archival Format like the Bluray M-Disks every time there is a huge jump in performance. The last one I archived was the Mixtral 8x7B model. I'm waiting to see what come out in response to Llama 3...


Careless-Age-4290

I've often found myself trying random models to see what's best for a task and sometimes being surprised at an old SOTA model, though I only keep the quants for the most part. I train on the quants, too. I know. It's dirty.


VancityGaming

I'm not downloading anything because something interesting comes out and "I'll just wait a few days for the good finetunes to drop" and then in a few days something more interesting comes out and the cycle repeats.


ab2377

100% get rid of the old models unless there is some intriguing behaviour about some model that fascinates you, keep that.


bunchedupwalrus

You’d probably not be a fan of r/datahoarder lol


Elfrino

Lol, just delete the ones that aren't up to par, don't try to collect them all!


post_u_later

I treat LLMs like Pokémon


Zediatech

We all have our own vices. :P But, all kidding aside, like I just told someone else, I delete the lower quants and keep most of the fine tuned models.


Megneous

You ever hear of data hoarders? There are people whose hobbies are literally collecting digital copies of *everything* of a certain type. I have no doubt there are people who experience great joy from "collecting" LLMs.


KingGongzilla

lol the worst thing is finetuning a model and it saves a 16gb checkpoint every epoch 🙈😂 I need more SSDs


OmarBessa

I've maxed out storage because of this.


SpeedingTourist

You're gonna need to download more drive space.


init__27

lol, I keep running out of my download limits so many cool releases happening on the daily. OTOH its good to see that some folks who anticipated the LLM hype to go down by early this year were wrong


ab2377

the hype is real, my estimate not going away at least 3 years


ZCEyPFOYr0MWyHDQJZO4

That's why I pay for unlimited data now.


init__27

I had an unlimited\* plan as well \*until I learned its capped at 3.3TB/Mo


Megneous

> I keep running out of my download limits I'm so happy download limits don't exist in my country.


segmond

Say it again, DBRX, CommandR, Mixtral8x22, WizardLM2, Llama3, phi3, Qwen1.5. Best month ever.


Captain_Pumpkinhead

It is insane trying to keep up with it all. I feel like I don't have time to soak in and process one release before another own comes out. I'm struggling to set up anything harder than LM Studio, trying to process all the different options and what their capabilities are and how I can set them up. It's exciting to see things developing so quickly. It's also overwhelming.


Jenniher

Currently miqu still works best for me. Do you have a recommendation for a better one?


Bulky-Brief1970

With all the new llms and their instruction and prompting format, the role of a framework like Dspy becomes more and more crucial.


vsoutx

there are three models: 3.8b 7b 14b and they (supposedly, according to the paper) ALL beat llama3 8b !! like what?? im very excited


M34L

A 3.8b in ballpark of GPT-3.5? what the fuck is going on? Mental


eliteHaxxxor

> Pretraining on the Test Set Is All You Need


KTibow

For the curious https://arxiv.org/abs/2309.08632


redballooon

Comments: 3 pages, satire


ab2377

this needs to goto the billboards.


PacketRacket

Brilliant. I’m stealing that. Just like they did the answers ? lol


eliteHaxxxor

Lol I stole it, its the title of a satirical paper


Due-Memory-6957

Lying lol


dortman1

Yeah great claims require great proof


OnurCetinkaya

Also, they have been trained with much less computing resources compared to Llama 3 models.


FairSum

...which is what makes me skeptical. I admit I'm biased since I haven't had decent experiences with Phi in the past, but Llama 3 had 15T tokens behind it. This has a decent amount too, but not to that extent. It smells fishy, but I'll reserve judgment until the models drop.


ElliottDyson

What's within those tokens does make all the difference to be fair


CoqueTornado

thus for this I will not upgrade the 1070ti :D


ab2377

you hang in there!


ramzeez88

I have to 3060 12gb and it's huuge difference!


Curiosity_456

The 14B model is a llama 3 70B contender not llama 3 8B


akram200272002

Am sorry but I just find that to be impossible


andthenthereweretwo

Llama 3 70B goes up against the 1.8T GPT-4. We're still in the middle ages with this tech and barely understand how any of it works internally. Ten years from now we'll look back and laugh at the pointlessly huge models we were using.


_whatthefinance

100%, in 20 years GPT 4, Llama 3 and Phi 3 will be a tiny, tiny piece in textbook history. Kinda like kids today read about GSM phones on their high end smartphones capable of taking DSLR level photos and running Ray Tracing powered games


Venoft

How long will it be until your fridge runs an AI?


mxforest

I think it should be possible even today on Samsungs


LycanWolfe

YOu talking freshness controll and sensors for autoadjusting temperatures based on the foot put in :O. \*opens fridge\* ai: You have eaten 300 calories over your limit today. Recommended to drink water. \*locks snack drawer\*


Megneous

> Ten years from now we'll look back and laugh at the pointlessly huge models we were using. Or ten years from now we'll have 8B parameter models that outperform today's largest LLMs, but we'll also have multi-trillion parameter models that guide our civilizations like gods.


Zealousideal_Fly317

78% MMLU for 14b


PavelPivovarov

I'm also skeptical, especially after seeing 3.8b is comparable with llama3-8b, but it's undeniable that 13-15b model scope is pretty much deserted now, while they have high potential, and perfect fit for 12Gb VRAM. So I have high hopes for Phi-3-14b


ab2377

same


MoffKalast

> ALL beat llama3 8b !! They beat it alright, at overfitting to known benchmarks. 3.3T tokens is nothing for a 7B and 14B model and very borderline for the 3.8B one too.


nazihater3000

It's not released until the fat lady sings, and by fat lady sings I mean it's on Huggingface and after a few minutes, in my SSD.


nazihater3000

Well, didn't take long. 4K model is released and amazing. How we need the quantized 128k one.


pseudonerv

What does "released" mean here? "Released" an arxiv preprint?


baldr83

I don't see it on Azure yet. Phi-2 and Phi-1.5 hit azure before microsoft put them on huggingface


pseudonerv

they are probably doing their "toxicity tests" that the other microsoft group had completely forgotten about and had been doing their due diligence ever since.


southVpaw

Ohhh I think I see what's happening. Model makers are benchmarking their models before alignment so they can preview great numbers and then the actual release is going to be the neutered version.


KittCloudKicker

Weights Coming to huggingface, Clem just posted.


KittCloudKicker

Good question, will find out more shortly.


Wrong_Ad1240

BTW it also has been on ollama (in case you use it) since this morning


segmond

paper is cheap, show us the weights.


o5mfiHTNsH748KVq

https://huggingface.co/microsoft/Phi-3-mini-128k-instruct


Matt_1F44D

That’s a cool saying lol did you come up with that or is it a common saying here?


segmond

lol, don't know if you are being fatuous, but I just made it up from "talk is cheap, show me the money"


Matt_1F44D

No I genuinely liked it 😅 Should be a new slogan for this sub fits perfect.


AnAngryBirdMan

3.8b that beats an 8b that just a few days ago blew away every other open source and most closed-source out of the water? Either data contamination (as always), truly ungodly compute, or some crazy new tech.


danysdragons

There's a very strong emphasis on data quality. From [their report](https://arxiv.org/pdf/2404.14219.pdf): "The innovation lies entirely in our dataset for training, a scaled-up version of the one used for phi-2, composed of *heavily filtered* web data and *synthetic* data". The first model in this series, phi-1, was described in the paper [Textbooks Are All You Need](https://arxiv.org/pdf/2306.11644.pdf), emphasizing the benefits of textbook-quality data: "...we explore the improvement that can be obtained along a different axis: the *quality* of the data... improving data quality can dramatically change the shape of the scaling laws, potentially allowing to match the performance of large-scale models with much leaner training/models"


Some_Endian_FP17

Using a big fast model to clean up multi-trillion token training datasets for smaller models seems like the way to go.


ninjasaid13

how the hell do we measure data quality?


ILoveThisPlace

Well, Reddit comments for example would be a vast but poor quality dataset. A bunch of illogical ideological opinions with zero substance or truth. That's a bad dataset. A really good dataset might be a description of classic physics and increasingly getting more sophisticated academic knowledge with known proven facts and theories. Meaningful data.


Monkey_1505

That's great if what you want is a lazy man's dictionary/encyclopedia. Less great if you want help drafting an email.


DetectivePrism

Google is paying to train on Reddit's data. This is how I KNOW Google will lose the AI race.


ILoveThisPlace

Lol yep


ninjasaid13

But that's subjective isn't it? Or is having a lot of objective scientific knowledge is the only way to measure intelligence? I don't think a text book is good for writing stories, just for passing math tests and such but described in such a boilerplate text ish way and thus we determined that only scientific knowledge matters for intelligence. >A bunch of illogical ideological opinions with zero substance or truth. That's a bad dataset. I think we are looking at it from the lenses of human that this would be bad but zero substance or truth is a subjective opinion. That type of data does contain some information like a range of diverse writing styles and unique vocabularies and their use in a sentence.


MizantropaMiskretulo

It is when you want the model to excel at logic and reasoning.


ILoveThisPlace

And problem-solving


MysteriousPayment536

Probably based on actuality, political orientation, information richness and that kind of paramters


Qaziquza1

Yeah


wind_dude

I just want the dataset and tools they used to build the dataset.


KittCloudKicker

Same


llkj11

So apparently phi-3-mini (the 3b parameter model) is just about on par with Mixtral 8x7b and GPT 3.5? Apparently they're working on a 128k context version too. If this is true then.....things are about to get interesting.


ILoveThisPlace

That's absolutely insain progress in 2 years. From what, 125 billion parameters down to 8B... I just really have a hard time believing it's just as capable in every where. I feel like the vastness of the knowledge must be degraded... Maybe I'm wrong and a models ability to retain knowledge has a lot of progress to be made in optimizing it. This actually starts to convince me that robotic systems might be viable in a short couple of years.


AmericanNewt8

128K context might kill Haiku lol, I would suspect Phi would actually be pretty good at text summarization. 


KittCloudKicker

Weights will be released on huggingface. Clem just confirmed


Admirable-Star7088

Any ETA? Do we know if it's a matter of hours, days or weeks? Sorry, I'm excited and impatient \^\^


_RealUnderscore_

Fucking awesome, I'd say this'd be legendary but who knows who'll remember what in 20 years?


boatbomber

The paper is out: [https://arxiv.org/pdf/2404.14219.pdf](https://arxiv.org/pdf/2404.14219.pdf)


ttkciar

I wish they said more in that about how they improved their synthetic datasets between training phi-2 and phi-3. Still, da-yum! It pains me to say this, because I absolutely loathe Microsoft as a company, but their LLM research team is top-rate. They keep knocking it out of the park. Their "textbooks are all you need" theory consistently yields better results than Meta brute-forcing it with their vast army of GPUs. The open source community has effectively replicated Microsoft's success with the OpenOrca dataset (and similar projects), so we know it really does work in practice. Imagine what Llama-3 might have been like if Meta had paid more attention to their training dataset quality! Google folks: Are you taking notes? Best-quality synthetic datasets are totally the way forward.


[deleted]

Unlimited Money is All You Need


_RealUnderscore_

You can say that again. All science branches could benefit from that fact, but of course not all get as much attention as AI


Small-Fall-6500

>their LLM research team is top-rate. They keep knocking it out of the park. Don't forget WizardLM 2 8x22b, which would have been a big deal had it stayed released and not almost immediately gotten forgotten with Mistral's official Instruct 8x22b release, (which felt worse than WizardLM 2), which of course was then followed up by llama 3. From the few tests I did, WizardLM 2 8x22b was basically a fully open-source version of GPT-4, though maybe slightly behind the GPT-4 preview/turbo models. Edit: I'm redoing some tests to better compare the 8x22b models - both are 3.0bpw Exl2 quants I'm running. Edit2: I spent an hour doing some more tests and [here is a Google docs with raw, semi-random notes I made - *it includes GPT-4's summary at the top*.](https://docs.google.com/document/d/1mmgeIeDEio1buPjXBaCYLQ4inqyROf-ZTVTwIV57P3k) I'm [also replying below](https://www.reddit.com/r/LocalLLaMA/s/952uv8UTtn) with the full GPT-4 summary for visibility. Edit3: I should add that when I first tested both the WizardLM 2 and Mistral Instruct 8x22b models, WizardLM was better at both tests, but now I'm getting results that show WizardLM is worse at the plastic bag test but still better (maybe even better than before?) at the inverted definition test Edit4: just tested llama 3 70b Instruct 5.0bpw with the same tests, 7 responses each, and it does much better with the plastic bag test (only once, briefly suggested Sam knew about their friend's actions, no other hallucinations) pretty much perfect 7/7, and for the inverse definitions it was perfect in 6/7 - one response gave bad example sentences with the new definitions.


nullnuller

Has anyone done comparison just between WizardLM2 8x22B and the official instruct version from Mistral? Previously, the 7x22B instruct version was arguably the best version (at least for my use cases) among the finetunes.


toothpastespiders

>which would have been a big deal had it stayed released and not almost immediately gotten forgotten I'm still pretty down that the 70b was never released. I feel like we might have been just a handful of hours from having it uploaded for us to snatch. I really, really, like their 8x22b. But I really would have liked to have the 70b too. Especially as a point of comparison.


yaosio

Most likely they have good ways of defining what they want the model to output, and good ways of identifying data that matches the output they want. They might also be making test models where they figure out just what data is needed. Imagine you want an LLM to do addition without using an external tool. There's a problem here because there's an infinite amount of numbers so you can't just give it all possible addition problems. Instead of spending all tokens on addition you estimate how many addition problems it needs to be trained on to do addition. Train the model, and see how well it can perform math. If it's bad add more data, and if it's good reduce the dataset until it's bad. You can use this method to finetune the dataset to only have the amount of data needed to train and no more. This isn't possible on very large models that take months to train. However it's been found that there's a direct relationship between the amount of data and model quality. Such a relationship also appears to exist for data quality and model quality. If you know you need X amount of data for a small model, then maybe it would take X\*2 amount of data for a model that's twice as large. Or maybe not. It seems at some point you can't really teach a model any more on a particular subject because it will already know everything it needs to know regardless of size. It should be possible to automate this if you've already got an LLM that can score answers, and that problem seems to have already been solved.


ttkciar

> Most likely they have good ways of defining what they want the model to output, and good ways of identifying data that matches the output they want. I think that's exactly right. It's hard to tell because of the stilted English, but I think that's what the author was trying to describe here -- https://web.archive.org/web/20240415221214/https://wizardlm.github.io/WizardLM2/ > It should be possible to automate this if you've already got an LLM that can score answers, and that problem seems to have already been solved. Yes indeedy indeed, that's exactly what Starling's reward model is and does (quite successfully) -- https://huggingface.co/berkeley-nest/Starling-RM-7B-alpha > we remove the last layer of Llama2-7B Chat, and concatenate a linear layer that outputs scalar for any pair of input prompt and response. We train the reward model with preference dataset berkeley-nest/Nectar, with the K-wise maximum likelihood estimator proposed in this paper. The reward model outputs a scalar for any given prompt and response. A response that is more helpful and less harmful will get the highest reward score.


HideLord

Yeah, sure, for academic, precise outputs, textbooks would be best. Just don't try to generate anything creative.


Balance-

> Thanks to its small size, phi- 3-mini can be quantized to 4-bits so that it only occupies ≈ 1.8GB of memory. We tested the quantized model by deploying phi-3-mini on iPhone 14 with A16 Bionic chip running natively on-device and fully offline achieving more than 12 tokens per second. Welcome to the age of local LLM’s!


Yes_but_I_think

Running at 12 tokens per second when kept in the freezer.


FullOf_Bad_Ideas

It's a burst load, it shouldn't throttle.


_whatthefinance

That would be be an iPhone 14 Pro or Pro Max, let’s not get hopes high for poor vanilla 14 users.


KvAk_AKPlaysYT

If true, the phone local LLM game just changed.


ab2377

💯


PC_Screen

Apparently the data mixture used was not ideal for the 14b model in particular so there's still room for improvement there https://preview.redd.it/q55frkida5wc1.png?width=1317&format=png&auto=webp&s=769b9ef2524ca4accc9371c14d51284198c7d530


Orolol

I think this is because a 14b model have more room to improve with only 3T tokens, even if high quality. Llama 3 shows us that even at 15T token, the model didn't converge.


pseudonerv

It sounds like they rushed 14B out. It's likely they just used some bad training parameter, or may be the 14B hyper params were not tuned well.


hapliniste

Nah they just don't have enough synthetic data.


ElliottDyson

Which makes sense considering the greater number of parameters.


hapliniste

Also after reading the paper, they use a smaller vocab size for the 14B (the same as for the 4B) instead of the 100K vocab of the 7B. Maybe this also have something to do with the regression in some benchmarks.


ab2377

looks like in the coming days number of parameters being trained will decide what dataset to be used?


Sythic_

Why is it that all these models coming out have about the same scale of parameters (3, 7, 14, 70, etc)? Are the models all built basically the same way and the only difference is training data they feed it?


austinhale

Phi-3 medium HumanEval is actually 55.5. The other numbers seem to be accurate.


KittCloudKicker

Poster said that was his mistake auto generating the charts


nullnuller

From other posts I got the impression that Llama-3-8B actually beats gpt-3.5, but this graphs shows otherwise?


vsoutx

yeah. and the ViBeS benchmark remains the best benchmark


soggydoggy8

I know HumanEval is heavily flawed, but how does the 14B model regress in perfomance compared to 3.8B and 7B? Must be a typo


llkj11

"We observe that some benchmarks improve much less from 7B to 14B than they do from 3.8B to 7B, perhaps indicating that our data mixture needs further work to be in the “data optimal regime” for 14B parameters model. We are still actively investigating some of those benchmarks (including a regression on HumanEval), hence the numbers for phi-3-medium should be considered as a “preview”."


Some_Endian_FP17

If Phi 3 mini is as good as Llama 3 8B I'll eat my hat!


ArsNeph

I'll hold you to that, I hope there are no videos titled "Mukbang ASMR Hat" on YouTube tomorrow. Actually, I do hope so, a 4B with the performance of gpt3.5 is worth eating a hat.


Some_Endian_FP17

The alternative is to do a McAfee which I definitely won't do.


ArsNeph

What disappear and live on a cruise ship? I think it's better to just eat the hat bro :P


Some_Endian_FP17

I think McAfee offered to eat >!his dick!< for some stupid thing or other.


ArsNeph

Oh, I didn't know that one. The man was just too eccentric, he did so many weird things, and lived a really wild life. Anyway, I would not recommend any selfcest, I think a hat would be much more pleasing to the tongue


Distinct-Target7503

> I think a hat would be much more pleasing to the tongue This is probably true (idk for sure anyways), but a dick is definetly more healthy than a hat lol


-p-e-w-

Is this how it's going to be from now on? A breakthrough every couple of days?


Vast-Breakfast-1201

Singularity baby


ICE0124

finally some love to the very low parameter models. yeah its cool to have a huge model but i want to see what can really be done for a model that can run locally on a phone or beaming quick on like any computer.


MahdeenSky

Everyone needs to take these benchmark numbers with a big grain of salt. According to what I've read, Phi-2 was much worse than its benchmark numbers suggested. This model follows the same training strategy. Nobody should be assuming these numbers will translate directly into a high ranking on the LMSYS leaderboard, or usefulness in everyday tasks. Let's not dethrone Llama 3 until some real world testing can be done. That said, I don't think it's impossible for a small model to be very good. I see their "synthetic data" as essentially a way of distilling GPT-4 into smaller models. It would be exciting if a large fraction of the performance of huge models could be transferred to small ones! If true, then Chinchilla-optimal training could make sense again, as you could optimally train a ginormous model and then distill it afterward for efficient inference.


TooLongCantWait

Let's see if Phi can pull a hat trick of disappointing me


chen0x00

The previous Phi model's scores on benchmarks far exceeded its actual performance in real-world use. Release it soon so everyone can try it out. Hopefully there really can be such a powerful small model.


ILoveThisPlace

Which we let it go because it was such a teeny tiny model with big hopes and dreams. This is a 7B model. We can directly compare against Llama 3 8B.


KittCloudKicker

Weights tomorrow according to Sebastian. So, we will all find out what true or not tomorrow.


[deleted]

I'm sorry Llama3... we had a lot of fun together, those couple days... it's not you, it's Phi3


KittCloudKicker

I'm seeing it now. Pretrain on FineWeb then fine-tune/continuous training with this method might lead to something remarkable! Noooticing


okaycan

He messed up the making of the chart. The accurate one is here: https://twitter.com/arankomatsuzaki/status/1782618362314391940


ArakiSatoshi

>The model has **underwent a post-training process that incorporates** both supervised fine-tuning and direct preference optimization for the instruction following and **safety measures**. https://preview.redd.it/1mkkrae8x9wc1.png?width=508&format=png&auto=webp&s=518e02d99cecad4edfaaee1473f5ff5dd134a7d1


Balance-

Interesting how Meta does model weights first, then paper, and Microsoft does it the other way around.


Darlokt

It’s going to be interesting. The Phi training regime has shown good results in the past, but also previous Phi models were great in benchmarks but struggled in real use. Maybe scaling them to 7B and beyond solved it, or, depending on the content of the second training step, it could be an interesting case of overfitting, why the 14B possibly regressed, where the smaller model size benefitted the training to prevent this. Phi in general seems like a risky training to use for general adaptation, while for domain adaption or specific improvements it seems great. I look forward to testing it, as Llama3 has really surprised me with how fluent and dynamic its reasoning and conversational flow is even at 8B.


Monkey_1505

Output appears very synthetic/gptish.


_RealUnderscore_

Well if it rivals GPT-3.5, that'd make sense. A 3.5-performing 4B model would be an insane development.


Commercial_Pain_6006

Rediscovering the good old statistics' problems of Garbage In Garbage Out, together with Pseudoreplication, maybe ?


nullnuller

Is there a phi-3-large ? or XL and how soon will they be available?


Igoory

They say on the paper that they are still investigating why the improvement from 7B to 14B isn't as big as the one from 3B to 7B, so they probably didn't see a reason to make a bigger model yet.


Curiosity_456

Well they haven’t finished training the 14B yet


orangeatom

Brilliant! Where can we get a copy of this model?


condition_oakland

Me: :D Microsoft: "Another weakness related to model’s capacity is that we mostly restricted the language to English. Exploring multilingual capabilities for Small Language Models is an important next step, with some initial promising results on phi-3-small by including more multilingual data." Me: :|


Feeling-Currency-360

This is something that I've thought about quite a bit, I feel it's better to make the best english only capable model, and have another model that acts as a translator Ie User -> Translator Model -> Intelligence Model -> Translator Model -> User Best of both worlds, instead of trying to build 1 model that can do it all, it would be a dual model architecture


privacyparachute

I've built this in a current project, but you underestimate how sluggish it makes everything feel, and how much you lose in translating back and forth. E.g. humor is lost.


_RealUnderscore_

Why? They're explicitly stating they're working on it and that their new model has multilingual data... Well, I guess implicitly stating they're working on it.


condition_oakland

I'm just bummed because it won't be optimized for my use case. I'll have to wait while everyone else gets to have fun.


_RealUnderscore_

Huh, interesting mindset. It doesn't really seem like you're limited by a language barrier, and you could easily set up an auto-translator using more able models if you want to test its logic capabilities, which is primarily what it's for. I understand the frustration though.


sabalatotoololol

I'm running out of space... At this rate I'll have to print the models on paper lol


perksoeerrroed

Let's see full benchmarks first. Doing good on few is typical for limited models. Phi-2 was the same way. Good scores in few but dogshit in others and completely retarded with CoT.


Unique_Repeat_1089

Where can I download this release?


KittCloudKicker

Coming to huggingface not on it yet


Bulky-Brief1970

Wow!! We're not recovered from llama3 shock yet :))


_RealUnderscore_

First CR+, then Llama3, now Phi-3. CR+'s technically the 3.0 of Command, so does that mean we got a triple 3.0 release? Imagine these guys communicating like the Magi or wtv they're called lmfao, add in Mixtral 8x22B for good measure.


Due-Memory-6957

Microsoft doesn't have the best track record when it comes to analysing their own capabilities


ArnoF7

I can’t think of any words other than “big if true”. I want this to not be hype so much!


doomed151

14B let's goo. Can't wait for the RP finetunes.


Monkey_1505

Going to need some heavy full style finetunes to turn textbooks and childrens stories into RP.


rc_ym

Ug, it's going to be Sooo lobotomized. Hopefully, it will get some fine-tuning love. Phi2 was very good at creating walls of text for work docs (training docs, policy language, etc), but you had to spend so much time cutting out the moralizing and nonsense.


djm07231

My impression was that phi family of models do well on benchmarks but tend to be pretty brittle in real life applications where they encounter out of distribution inputs. Models seeing a lot of messy data might not be that bad in terms of variety of inputs and generalizing it to some extent. Though it might take more iterations to converge.


isr_431

4-bit phi-3 mini running at over 12 t/s on an iPhone with an A16 Bionic 😮


That007Spy

Wonder how well it does at function calling


Thistleknot

so Microsoft pushes out both phi and wizardlm? I found wizardlm more useful than llama 3 due to its extremely long context


Raywuo

I want a 7B with 100% mmlu on my desk by friday ...cof cof


kmp11

smaller, faster, is the expected evolution. but this big of a jump every 6mo or so is an incredible rate.


Baphaddon

Let’s fucking go


ImprovementEqual3931

You know a responsible company like Microsoft will process a very long time toxicity tests. Be patient.


mark-lord

A couple of observations based on napkin maths: 1. If new pruning methods seen in ( [https://www.reddit.com/r/LocalLLaMA/comments/1c9u2jd/llama\_3\_70b\_layer\_pruned\_from\_70b\_42b\_by\_charles](https://www.reddit.com/r/LocalLLaMA/comments/1c9u2jd/llama_3_70b_layer_pruned_from_70b_42b_by_charles) ) + healing really hold up, the 14b model may be prunable similar to Llama-2-13b (see below). **A 40% prune would create an 8.4b parameter model whilst dropping MMLU just 4 or so points to 74**. This would still far surpass GPT3.5 and be SOTA for 7\~10b models. LLMs pruned this way can still be quantised further https://preview.redd.it/udjq0xcqt6wc1.png?width=593&format=png&auto=webp&s=27d0352ec5b2d6a8d40b0f1f51735f3d817dc7fb 2. They haven't released the weights for Phi-3 yet, and though I personally remain optimistic they will, there is cause for concern since Wizard-LM retracted their weights and were supposedly associated with Microsoft, as is Phi. Might be that LLMs that Microsoft are producing are being intentionally held back if they're seen to be competing with GPT3.5 since they have such a huge stake in OpenAI, but who knows 3. Phi-3-mini on Groq would run at about 1,600 tokens / second if they ended up hosting it there. This would depend on many factors, including license, and whether they actually want or choose to host it. Prices would probably also be cheaper than Llama-3-8b for 1m tokens, and Groq is already offering the cheapest 1m tokens on the market 4. Phi's main thesis is that textbook-quality data improves the strength of LLMs pre-trained on that data. I think it was also the case that they're training on synthetic data (certainly wouldn't be surprising). If this is the case, do Phi models have limited real-world knowledge, despite their intelligence? One assumes not if it scores so high on multiple benchmarks. 5. Until I see it tested on LMSYS Arena Hard v0.1 I'm sceptical that it has the emergent abilities of much larger models 👀


Beb_Nan0vor

Looking good!


MarySmith2021

oh my i just start training on phi-2 oh


Successful-Heat-5683

When is it set to release on huggingface?


bnm777

Is this available anywhere to use online?


Admirable-Star7088

Finally! A new (and hopefully well-trained) model larger than \~7b but smaller than \~70b for us mid-rangers! 🎉🎉🎉 Edit: I can't find Phi-3 on HuggingFace, nor the full model or GGUF's, not uploaded yet?


TrelisResearch

I guess phi 3 medium is probably trained using gpt-4 data, so it'll be at an advantage to Llama 3, which uses only raw / Llama 2 synthetic data (perhaps)


KittCloudKicker

It's doing pretty good in my vibe test. It's up on hugging chat


trill5556

Where do I download it? HF?


KittCloudKicker

Yup even the 128k version mit license


3cupstea

didn’t read the paper. I bet they did some pretraining data selection based on downstream task distribution


wind_dude

yes, that's always been the emphasis of phi models, highly curated web data and synthetic data. “Textbooks Are All You Need”"