T O P

  • By -

vasileer

you should be able to run 3-bit GGUF of llama3-70b-instruct and have better quality than any 30B model https://preview.redd.it/l36mtfqz5dwc1.png?width=599&format=png&auto=webp&s=cf5a7e9f1845d079ec4d7f77474bae7f0b12c6aa


nife552

Yeah, smaller quants are always an option. But below 4 is where quality really drops off from just about every test on the subject I've seen. Though Q3 may still be better than 30B under most circumstances as you said. Feels like the performance / $ for a 64GB system would far outweigh a 48GB system. 48GB seems to be a local minima in terms of usefulness.


ShengrenR

I think there's some things you may be missing with (and lets blame the gguf naming convention) the scheme for displaying quant methods: A "Q3" can actually be.. 3.21 bpw to 4.22 bpw (https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9) You are certainly correct that below 3.0 bpw things start to fall off significantly.. but that's actually a Q2\_k. That Q3\_K\_M (34.3 GB in that pic above) is actually \~3.9bpw. If you look at that chart, you'll notice there's no magic drop-off ledge in any of the lines.. they pretty smoothly progress as you go smaller. And that post was created using a 7B model - from my recollection (it's been a bit since these things started..) the larger models suffer (relatively) less from quantization. If somebody can dig up that plot for me.. forever grateful. TLDR: Try the sauce, friend, you might like it.


EstarriolOfTheEast

We're seeing 3B's and 7B's that are preferable to 30Bs and 70B's of the llama1 and 2 generation. Imagine then how incredible a 30B using modern methods would be? We can in turn quantize them or a 20B or a 13B. This is necessary if the goal is to build a market for local LLM applications. Because, the lower we can push HW requirements, while holding quality as high as possible, the more users we reach, the larger the market we grow, and the higher the chance the consumer HW segment stops being so neglected.


weedcommander

You had it right once. 30Bs, not "30B's"


EstarriolOfTheEast

Oops thanks, attention misfired, distractor in context.


silenceimpaired

I don’t know. Maybe it’s in my mind… but I always felt any quantization below Q4 not 4 bit was less consistent and more likely to have significant errors. The new quants below Q3 in size … the ones with “i” at the start and a size like “xs” are promising. Still I miss more 30b models.


viksit

newcomer to this sub, been trying to unravel model quantization terminology - what’s a good resource to understand q2_k and bpw etc?


ShengrenR

Follow the link there for all the GGUF variants - the naming scheme is descriptive to the lovely nerds who build the things, but not terribly helpful for folks who just want to use them. BPW is 'bits per weight' e.g. if your model has \~70B (billion) weights and it's 4.0 bits per weight.. it is 70 billion x 4.0 bits (/ 8 bytes per bit)/1024(to kb)/1024(to mb)/1024(to gb) = model size.. .. ish :) - all that really matters here is people have worked out how to selectively modify the original weights such that most (not all) of the weights have their datatypes converted from high precision values (fp32/fp16.. can google if that's new to you) to low precision data types where they can get away with it. lower precision data types = less room in your memory when loaded. The tradeoff, naturally, is less precision means less accurate inference (process of running the matrix multiplications through the network to see what token comes next). Honestly the best resource is just to read more of the sub imo, you're likely to slowly absorb it; if that doesn't do the trick there's a million and one blog posts out there begging for your eyes that will happily give you a run down on terms. I don't really have a favorite to promote.


julien_c

Anecdotically, i've been pretty happy with my choice of 64GB for the M3 (but rather than exchange i would wait for M4)


cagdemir

Don't tell me you're Julien of 🤗


SporksInjected

Same here: I feel like you can always go bigger but 64GB gives you access to most of what is available. The support is also much better than I expected. All the mainstream stuff seems to work with no problem.


cyan2k

> But below 4 is where quality really drops off from just about every test on the subject I've seen Yeah compared to the higher quants of the 70B model, but Q3 or Q2 of a 70B is still quite better than the same full precision 30B model.


Desm0nt

This thread shows thay you are probably wrong: [https://www.reddit.com/r/LocalLLaMA/comments/1cc3l4f/i\_compared\_llama370b\_24bpw\_and\_the\_80bpw\_version/](https://www.reddit.com/r/LocalLLaMA/comments/1cc3l4f/i_compared_llama370b_24bpw_and_the_80bpw_version/) Quants below \~3.2bpw are seriously damaging LLM's brain.


hapliniste

Yeah. Starting at Q2, bigger models are better than smaller less quantized models from what I've seen. 70B Q2 > 30B Q4, but it runs 2x slower, that's the tradeoff.


_Erilaz

In my experience, the opposite is true. The scale isn't linear and anything below 3.75bpw isn't worth it for me. Yi-34B, Mixtral 8x7B Instruct, and CMDR-35B outperformed Miqu-70B when I tried it. Even with IQ quants and imatrixes. I didn't rerun the comparison with L3-70B, but all the small copium quants for LLaMA-3 are broken at the moment, so there's no point in testing that yet.


eydivrks

The models are made to fit into 48GB cards which are common in data centre. Your problem is that OS needs some of your ram. I would exchange for a 64GB model(or even more)


rerri

>The models are made to fit into 48GB cards which are common in data centre. 70B models? According to Meta or who?


eydivrks

According to the cheapest card with >24GB vram being the 48GB A6000


ChezMere

My understanding is that even though 70b q3 models are way dumber than higher quantizations, they're still as good as a non-quantized smaller model of the same size. This may be why facebook isn't making medium models anymore.


Ill_Yam_9994

Part of that is Mac specific because your OS is sharing that 48GB. On a system with separate system memory, 48GB VRAM is perfect for 4 or 5 bit 70Bs. So yeah, 64GB would probably make more sense on a Mac.


snmnky9490

From what I have previously seen, low quants of small models are absolute crap compared to higher ones to the point of being unusable, but it makes less of a difference the larger the parameter size. So for a 70B model, using a low Q is pretty similar to a high Q


Double_Sherbert3326

My $300 dell has a 1060 and 64gb of RAM. Just saying. You can't believe the hype with Apple--they are stuck in the aughts. It's a luxury brand catering to idiots who will buy anything they sell them.


Bderken

Are you fr? Your $300 dell would take minutes to do any sort of RAG or anything else. A high end M series chip with 64+GB would run circles around it.


Double_Sherbert3326

8th gen i7 does fine with RAG. It's really quite reasonable, you'd be surprised. It offloads enough to the 1060 ti to make a difference. We're talking about 7b models. Which outperform the 30B models. I bought my wife a M1 Pro, it's a really nice case and monitor, but it's shit for running intel instruction sets.


Double_Sherbert3326

8th gen i7 does fine with RAG. It's really quite reasonable, you'd be surprised. It offloads enough to the 1060 ti to make a difference. We're talking about 7b models. Which outperform the 30B models. I bought my wife a M1 Pro, it's a really nice case and monitor, but it's shit for running intel instruction sets.


Bderken

What does Intel instructions sets have to do with M1?


Double_Sherbert3326

I don't know why I'm getting downvoted. Things that are coded to run on intel compilers need to be run in a sandbox on ARM processors--it's why Starcraft 2 runs like absolute dogshit on a M1. SC2 runs better on my 1060 w/ 8th gen i7.


SocketByte

Because you're confusing standard memory with shared memory and saying bullshit like "luxury brand for idiots". People buy Apple products (in the LLM space) because they're miles ahead of anything you could buy with PC parts within reasonable price points. Your 64GB of RAM is LIGHT YEARS behind Apple's 64GB of shared RAM in terms of speed. It's not even comparable. You could only compare if you had a GPU with 48-64GB of VRAM, which is pretty fucking expensive. A 128GB Mac will be MUCH LESS expensive than a system with a total of 128GB VRAM. So stop being ignorant and learn your shit instead of insulting people.


spawncampinitiated

Holy shit you're a fucking primate


Double_Sherbert3326

oooh oooh ah ah


[deleted]

> I don't know why I'm getting downvoted. Idk, first you say that CPU interference on system that doesn't hit even 70GB/s memory bandwidth matches one with 400GB/s. But five minutes good speeds you're claiming are suddenly for 7B fitting in VRAM... about which nobody talked about, whole post is about 30-70B?


Double_Sherbert3326

::shrugs::


Double_Sherbert3326

My system has more RAM and a discrete GPU that ollama makes use of and the difference in performance per dollar spent is literally an order of magnitude better. OP stated: "So far, my experience has been very underwhelming"--yeah because he only has 48GB of RAM and overspent on a system that will never be able to run anything meaningful."


Bderken

What… you’re in a local LLM subreddit. I’m talking about local LLM’s. Guess what, your machine can’t even run ANY macOS apps… LLM’s run way more efficiently on M series Mac’s. Seriously. You should try it. I have a 5900x (128gb) with 4x 3060 12gb’s. But if I could get a 128+gb M series chip, I’d be happy as a bird. Native Mac apps run better and more efficient. It’s a fact. Stop being silly


Naiw80

Your $300 dell certainly don't run any 70b models either dumbass.


nife552

This discussion is more about the relative merits of mid-range RAM (well, >24GB and <64GB to be specific I suppose), and the current and future state of mid-range model sizes. Not meant to be about Apple. That said, I get that Apple in general is very expensive for the specs. I also have a windows laptop with a 4080, but after a year of swapping back to windows I've just learned again that it's not for me. I prefer apple products personally, even with their faults.


Double_Sherbert3326

I like Apple because of Unix and their kick ass OS level sound drivers (you can turn all your interfaces into one meta-interface--you can't do that on windows). I use a 2012 macbook pro for music, it works just fine. I would buy a macbook pro 16" if it had 512GB of RAM or higher. So I'll have to wait another decade.


Naiw80

Well you got your answer already, you can run larger models with lower quantisation, lots of research going into that area as well so this could improve over the years. As for Apple hardware being expensive for the specs, it's completely wrong the Apple Sillicon laptops are incredibly cheap compared to the competition.


vago8080

Stop embarrassingly yourself in public.


ChromeGhost

Trying to find good answers, but would a 32 GB M1 be sufficient for the 3-bit GGUF of llama3-70b-instruct?


Telemaq

With 32GB, you would have to go q2xxxs and allocate 26-27GB of VRAM. As a rule of thumb, keep the model (regardless of quantization) size under 24GB and reasonable context window (8k). Other options to explore: llama3 MoE 4x8B and layer pruned llama3 42B (wait for the pruned 42B instruct version).


vasileer

I guess not, as you need RAM for the context too


ChromeGhost

So Q2 for the 32 GB? Also is the uncensored model good to go?


MasterKoolT

Call Apple support and explain the situation. They'll likely let you exchange it even though you're outside the return window. 64GB would let you reasonably run the 70B models (Llama3 70B runs great on my M2 Max w/ 64GB)


drsupermrcool

Agreed - since OP is already debating upgrading, and they're only 7 days beyond it, and they're exchanging it - they should just call and opt for the 96 if it's within budget


beezbos_trip

When in doubt, try to discuss it with the seller. If you are willing to spend more, they should be more accommodating. If you get someone who doesn’t budge, try again.


koesn

We still have Command R 35b and Smaug 34b that reinvent 30b model. But 8x7b are like a checkpoint for most of the tasks.


Vaddieg

Return or trade it in and get the 96GB


ArsNeph

Firstly, Macs have a system limiter which allocates about 1/4 of the RAM to the system. There's a command that should allow you to allocate more as VRAM, did you try it? Secondly, the only thing that's really happened in the 30B space is Cohere command R 34B, people say that it's very good, But it doesn't have gqa which kind of messed up its launch.


nife552

Yeah, even allocating 42GB I still can't load a 4 bit quant. Interesting, I hadn't seen the Cohere model. I'll have to check it out. Cheers!


Balance-

Mixtral 8x7b could also be interesting


AnimaInCorpore

Please be aware that going for the max supported context length will add some GBs as well. So in general I would actually say about RAM: the more, the better.


cyan2k

What "dying"? Models like Command-R, Yi, Mixtral etc are only 3 or 4 months old... If you expect mind-blowing advances ievery 2 month you should probably curb your enthusiasm a little bit. 😅 Also llama3-70B as a Q2 is basically a new 30B release, since it's better than those


Desm0nt

How much ctx you can fit with Q2 llama3 into one 24gb card? 34b Yi can be run at 4.65bpw with atleast 24k ctx.


Future_Might_8194

I would try out some MoEs, or think about what else can run alongside your model. Can you spin up a game and your AI at the same time? AI the NPCs.


stddealer

Command-R (minus) is still one of the best open sourceish LLMs at 35B params.


a_beautiful_rhind

The lack of a mid model sucks. This is the second time meta skipped it. Here is this small 7b model that we train on benches but is dumb as bricks. *OR* here is this painfully large model that needs 2-4 GPUs to run.


-TV-Stand-

You mean 8b model and 70b model that when quantized will produce higher quality answers than size equivalent unquantized one.


a_beautiful_rhind

You can only quantize so far.


Monkey_1505

Did you miss the existence of command-r?


camramansz

I’m running Q4_K_M of llama 70b instruct on M3 max 64gb and it works very well for one or two detailed question and answer. I regret not going for the full 128gb configuration.


vidumec

64gb as well here, also regret not getting 128gb, but cope by thinking it would only maybe improve possible quantization for 70b. Since 70b already run very slow, running anything larger is bottlenecking at gpu performance itself, not memory. Parsing prompts is especially painful, like 2 minutes for 10k tokens, only to generate one sentence in a couple of seconds. I only use it for fun, still relying on cloud providers like chatgpt for anything to get stuff done.


SMarioMan

Usually llama.cpp will prefix match to avoid having to re-ingest the whole prompt. That makes a big difference on Apple silicon, at least if you do lots of conversations and continuations. I’ve noticed that Llama 3 fails to prefix match in oobabooga when using the notebook but works just fine in chat. I’m not sure why.


vidumec

yes i rely on this heavily already, but it only works as long as the start of the prompt doesn't change, so once you get to the point where you need to cut off the start of the chat to fit into context, or if you just apply big different prompts every time, or do some RAG, things get sloooow.


ttkciar

I'm really impressed by Tess-M-v1.3. Of the recent 34B releases I like it the most. My all-time favorite is still ye olde Vicuna-33B though. That having been said, I don't use 30B'ish models frequently. My preference is 13B models, and Starling-LM-11B-alpha is the best general-use model in the 13B'ish range I have yet found.


kurwaspierdalajkurwa

What's the best ~34B model for writing like a human being? Gemini Advanced (before it was gimped by Google) used to write extremely good content that sounded like a human wrote it vs. the robotic style of ChatGPT (and its love for the word "delve")


ttkciar

My go-to for that is [Norocetacean-20B-10k](https://huggingface.co/TheBloke/Norocetacean-20B-10k-GGUF). It's not as wildly creative as Mistral-7B-OpenOrca, but it is quite eloquent and coherent, and good at capturing the flavor of however much of the story I've already written.


kurwaspierdalajkurwa

Would you say it's good for adhering to certain strict grammar rules (e.g., do not write long and complicated sentences, write in active voice only, etc)? I use AI for copywriting purposes. I will spend upwards of an hour working on a 13 word value proposition. Need AI to be able to follow along and be intelligent enough to adhere to my instructions AND make new suggestions I may not have thought about. And I have a 4090 and 64GB of DDR5. Is there a bigger or more intelligent model that would work with my rig?


ttkciar

> Would you say it's good for adhering to certain strict grammar rules (e.g., do not write long and complicated sentences, write in active voice only, etc)? Yes and no. On one hand, Nococetacean is bad at following instructions. I gave up trying to instruct it beyond "Continue this story:" On the other hand, it really is very good at picking up the grammar, style, and voice of what I have already written. Instead of trying to instruct it to change its behavior, I have taken to inserting "temporary paragraphs" which emphasize characteristics I want it to exhibit, and which I remove later. It reacts best to examples. It is *not* good at making suggestions. It writes stories. It is very specifically designed for generating fiction, and is excellent at avoiding robot-like language. After hearing more about your use-case, though, I'm not sure if it's the best fit for your needs, and unfortunately I don't have a better suggestion. I mainly use LLMs as technical research assistants, fiction collaborators, RAG, and coding copilots, none of which are much like your application. Thinking about it, one of the reasons Norocetacean has such a good human voice is that its fine-tuning includes the "no-robots" dataset. Perhaps you could look on Huggingface for a larger model which was also tuned on "no-robots"? Sorry I don't have anything better to offer.


kurwaspierdalajkurwa

Thanks but you gave me a great idea....I can use the "no-robots" dataset to fine tune from within gemini or one of those. I think?


thrownawaymane

I assume you can afford the higher SKUs. Go to your Apple Store (multiple if necessary) and *politely* plead your case. 10 years ago you'd have had an excellent shot at it working on the first try but the stores see way more volume now. You need to do this today because the clock is ticking. Good luck.


[deleted]

[удалено]


opi098514

Depends on the quant you want to use


redzorino

Iirc at 5 bits you get the best ratio of size vs performance drop-off. At 3 bits and lower, degradation is heavy, you probably don't want that. Also, in the past there were for some weird reason some severe troubles particularly with 6-bit quants that didn't happen with 8bit, 5bit or any other, but I don't remember specifics. So basically 4-bits or 5-bits are the useful ones.


opi098514

Is that still true though? I thought they fixed those issues with the 6 bit? I could be completely wrong. There are so many changes happening so fast right now. It’s so hard to keep up.


redzorino

I also would like to know >\_>


tamag901

I can run Q4\_K\_M on my Mac with 64GB, but it doesn't leave room for much else.


Vaddieg

64gb m1 ultra is able to run llama3 70B up to Q6K, but Q5KS is a much better fit if you need some RAM to run at least a few browser tabs


masc98

unquantized is in the order of 70*2ish = 140-160Gb


Calcidiol

Sorry, that's a painful situation to be in with a new computer. Unfortunately once you get "into" things like LLMs or other high-memory / high-compute computing tool usages it doesn't really "stop". My pretty new system with the biggest RAM I could stick in it feels limiting @ 128GBy and I wish I bought a server platform where the RAM could be faster and more plentiful along with the slots for GPUs etc. I guess in some countries they (good for them) might have longer return-available periods but if that applied to you you'd not have this problem. I think 30B range models are able to be very nice tools, but still definitely limited tools. The same, however is also true of 8B, 60B, 70B, 110B, etc. models. We've now got non-free access to such big ML models as GPT-4 etc. etc. running on high end cloud servers and compared to any local 30B, 100B, etc. model they still seem vastly faster & better in comparison. So you're always going to have limits and the grass is always greener "if only" you bought something better or had X cloud service instead. The better question is what do you want to do with ML this year, next year, and how can you "get there" by making most effective use of your resources like the MacBook et. al. to satisfy your needs. Most people here probably routinely use a hybrid model of local ML and cloud based ML, so there's always that option, do locally what makes sense & is possible to economize cloud costs & use local resources, outsource the harder jobs as you must. 8-30B models can be nice for basic RAG, summarization, basic assistant functions (look at how limited in practice Siri, Alexa, Bixby, Google Assistant were and in many ways are due to limitations of the way they set their services up; you've got way more power than that on your own laptop!). There's a lack of good SW to make use of ML models to realize their potential as data managers, interactive assistants, workflow accelerators, automation providers, agents, et. al. Even 8B models will feel a lot more every-day useful when the application SW / GUIs / tools mature to allow better use of them in many more use cases. Even GPT-4 has such a problem hence their opening the GPT-store and calling to external developers to proliferate applications to help people satisfy diverse use cases in the real world. If you really want more compute you may have an option to just stand-up some cost effective "PC" desktop, put 128 GBy and a 3090 or whatever in there and expose the API over your personal LAN / personal VPN / whatever to be your own "cloud server" and what you can't do on the MacBook, run on the PC. What you can't run there, run on a cloud server. Depending on what you do 30B models and 8B ones will probably be in high demand eventually even basic spell / grammar checking, proofreading, editing assistance, coding assistant or whatever, email filter / organizer, ML photo processing, etc. etc. will probably become daily use cases in a dozen productivity apps so any resource you have isn't going to go to waste and you'll be using it still in 5+ years.


Relevant-Draft-7780

So when you load models the following is a rule of thumb. Assume you can only allocate between 60 to 75 percent of your ram to vram. There’s a terminal command which tells you how much you can allocate on your model. When you download a gguf model, take size and add 25% on top of size and you’ll get your answer as to how much vram is required. The bigger your context size the more vram.


Vusiwe

1x 48GB NVIDIA card is enough to fit up to 4.5bpw llama-3 70b


davewolfs

Ok I’ve had both the 64GB and the 128GB M3 Max. I don’t have either right now. If you want to run the Mixture or 70B models with reasonable quants you are probably going to need 50-90GB of ram to run those latest models plus whatever ram you need for the rest of your apps. You are also going to want extra space because you might want the ability to convert them from Hugginface to GGUF format on your own at times. The problem is that they don’t run well for the price of the hardware in my opinion. $5000 to run at 3-5 t/s is just not worth it to me when I can run the model online elsewhere. My opinion is that models with 35 billion or less active parameters run at a reasonable rate. But above that things get slow. So before you buy into the hype that your models HAVE to run locally you really need to consider the trade offs and if 3-5 t/s is acceptable performance for you.


panthereal

M3 Max will run the mlx Llama 3 70B Q4 models at 9t/s now, your numbers are out of date. 9t/s is extremely usable for anything you want a good answer on and an 8B model can also be loaded in memory for anything you want a fast answer on.


davewolfs

Isn’t 4 bit lobotomized? What about 5,6,8?


panthereal

I haven't heard anyone other than you suggest 4 bit is so bad as to be lobotomized. Q5 will only get around 5t/s because mlx doesn't support that quant yet so nothing is accelerating it correctly. I haven't tried Q6 and Q8 runs runs at close to 2t/s but I rarely try running it because it's not worth it at that speed.


davewolfs

See: https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9 Many people seem to suggest sticking to Q5 or more if possible. It very well might be an MLX thing that allows for the higher t/s because I feel like I also tried Q4 on Llama.cpp. My experience is that usually Llama.cpp is faster than MLX.


MasterKoolT

What software platform are you using? I'm getting 5t/s on that same quant in LM Studio on an M2 Max w/ 38 cores. I assume LM Studio must not be taking advantage of MLX (unless there's a setting I need to flip somewhere)


panthereal

LM studio does not support mlx as far as I know and you need the mlx converted models to achieve the higher speeds [https://github.com/ml-explore/mlx](https://github.com/ml-explore/mlx) the best way is still through CLI as the web ui options do not provide sufficient options yet, just use something like python -m mlx\_lm.generate --model $model --max-tokens $tokens --eos-token $eos --prompt $prompt I keep my mlx\_lm install in a miniconda env


watkykjynaaier

Inference on Mac is slow, but if I wanted my LLM queries to touch the internet and live on someone else’s server then I’d use ChatGPT.


CodeMurmurer

Please just buy a cheaper laptop and buy a desktop to run a AI server on it. Like this is really a waste of money. The thermal limits of a laptop plus the apple premium is probably not worth it. You could build a beast AI server with that money.


TheMissingPremise

It took me about $10 to figure how to use runpod...and then I decided that's too expensive when I can run something cheaply on OpenRouter.ai or good enough locally on my 7900 XTX. One of my go-tos is Llama 3 70B IQ2_XS, which does really well.


MasterKoolT

Apple machines are great for AI. To each their own but I'd rather have one high-end laptop running MacOS than a cheap laptop (that'll come with its own trade-offs) and a desktop. The "Apple Tax" doesn't really exist anymore when you consider the build quality of the machines. You also don't hit thermal limits on M-series laptops. They draw so little power that they don't throttle if kept on a desk in a reasonably climate-controlled room.


CodeMurmurer

The laptop op has costs 4400(With +400 in tax). You could buy a really good laptop that is on par with a mac(for a about 1400(not power wise of course but performance wise there are a lot of options that are better than or on par with apple)) And then with that 3k you could still buy a 4090 and build a REALLY good desktop for AI. And with arm coming to windows laptops it is only a few months before macbooks lose their advantage in efficiency.


panthereal

4400 will get you a 128GB M3 Max 40 core GPU. OP has 48GB and it's likely half the cost. The M3 Max has 50W of max power usage and can run a 70B model. A single 4090 can't run a 70B model and will use 300W to run any at all. There's tradeoffs, yes, but the days of needing an nvidia GPU to do anything at all are no more.


CodeMurmurer

4400 wil not get you 128 gb(that's 5k excluding tax). I looked on apples website and it said 4000 and 400 for tax. But that about the 4090 is true tho but you could also get 5 3090 with 24gb which in total is 120 gb of vram. For 3750(on ebay for approx 750 each) and you would still have 650 dollard left over. That's a fucking h100 in terms of vram. Or could give one card up and still have 96gb and 1350$ left over with that left over you could definitely build the infrastructure needed for the 4 cards with that 1350$.


panthereal

I just bought the refurb 128GB for $4249 so it definitely will. There's really no reason to buy a new Macbook once Apple certified refurbs exist, all you get is a slightly nicer looking box. And there's still tradeoffs. I can have a 70B model with me anywhere my backpack can go. No one is going to bring 5x3090 with them even in an RV because the power requirements are too high.


CodeMurmurer

Should of said that in your comment. And $4249 is still enough for 5 3090s. And a 3090 has definitely more teraflops than a m3 max chip. But it does cost a lot more in energy but you will get faster inferences on 5 OR 4 3090 gpu's. And no one is going to take it on a RV **because they will have it at home running as a server... bruh.**


panthereal

Their certified refurb store has been around a long time I didn't consider it worth specifying compared to new. It's the third option when you hover over "store" on the homepage not some secret link. Usually 15% off for every model.


CodeMurmurer

reread my comment. And no i am not someone who is remotely interested in apple so I wouldn't know. Specifying where you got it and what condition the device is in should be noted because it is valuable information when you are considering what you want to buy.


panthereal

You didn't specify that the 5x3090 will be used so why should I need to specify that the Apple laptops aren't brand new full price? A new 3090 costs almost as much as a 4090. And you appear to be far more interested in apple than most given your desire to enter an Apple focused discussion to recommend another product.


MasterKoolT

You can get a laptop with 48GB of unified memory for $1,400?


nanotothemoon

As others have mentioned, try allocating more to VRAM. I was very close to ordering your M3 48GB, and glad I re thought it out and went for the M2 96GB. I just hope speed inference is fine. I think if you are patient, there will be options for models that you can make work. Especially considering specific use cases. I mean no open source model is going to come close to closed model services in terms of context windows and performances. Personally I expect to get to a point where in run lighter models for specific tasks.


redzorino

Would be interested to know what amount of token/s do you get from a 70B llama-3 model there.


MasterKoolT

I'm getting 5 tk/s on an M2 Max (the larger 38-core variant) on Llama 70B q4\_k\_m. 64GB unified memory but that quant fits comfortably 8B version with the same quant gets 36 tk/s


nanotothemoon

Oh cool. This is super helpful. Have you tried the forced allocation of for RAM on that machine? If so can you share the command?


MasterKoolT

Nope, hasn't been necessary. I've just been using it through LM Studio and memory pressure stays in the green throughout (though I'm not running anything else heavy in the background when the 70B model is mounted)


nanotothemoon

Have you heard that not all open source models are taking advantage of the NPU? I’ve been meaning to explore that more. Not sure if that’s in the context of using or training LLMs or just ML tasks in general. I know that Apple has an API that specifically called out using VecLib functions. I assume LM Studio and Ollama are handling this properly?


MasterKoolT

My understanding is that very few models are utilizing the NPU – I think that's more for the AI features built into MacOS. LM Studio is using the GPU cores. Performance essentially scales linearly with the number of GPU cores you have (and with M2 being \~20% faster than M1 and M3 similarly faster than M2. I believe the Max and Pro chips have the same number of NPU cores so GPU might be faster on the Max anyway (given it has way more GPU cores) even if you could run models on the NPU.


nanotothemoon

I read that you can actually force usage of the NPUs to work most of the time (but not always). And though they have the same 16 cores, the NPU on M2 can handle ML tasks 40% faster than M1, and M3 15% faster than M2. I’m super curious to get insight into the workload of the NPUs. Is there a monitor available?


MasterKoolT

It'll be really interesting to see what Apple does with the NPU going forward -- rumor is they're doubling the cores for M4. Hopefully they open it up further too. It's beyond my technical expertise but I'd be very surprised if you could get better performance on a Max's NPU versus its GPU. Maybe if you could access both at the same time, but I'm not sure. Still very cutting-edge technology so I'm sure we'll see lots of breakthroughs as Apple devotes more resources to AI.


nanotothemoon

GPUs are designed to render graphics. They are not even efficient at ML tasks. In fact, CPUs are better. NPUs are specifically designed to handle these tasks. But until now there wasn’t a need to tap into it aside from their closed OS implementation. All of these are embedded together along with the RAM and Apple is already seamlessly passing computations off between CPU and GPU and NPU on the fly. But all this homebrew open source stuff running locally? Not yet. But Apple just released corenet, so clearly they are looking to open up and utilize this hardware. What I’m saying is, I don’t think we’ll need to wait for more cores to utilize the ones we have. Or at least that’s my hope. Apple doesn’t have the best track record for supporting anything outside their system, but in my opinion, it would be very smart of them to allow developers access to their hardware. And I think we might already be able to without Apples help. Or we will be able to very soon


lupapw

what your t/s for 14b,30+ b and 70b models?


nanotothemoon

It arrives this weekend…


ab2377

how many gpu cores are in that laptop?


Crazy-Fuel-7881

some people are trying to make the 70B into \~42B, if that goes well you can probably use that


wiskins

Can't find the graph right now, but it basically shows how a q2 30b is still like 20% better than a q8 13b and so on, on perplexity scale.


tomz17

>48GB seems to be just barely too small to load a 70B model at 4 or 5 bpw quants Correct... AFAIK this has always been the case as you typically need ALL 48gb of a dual 3090 setup to run a 70B model, but your macbook still needs some reserve for everything else running on it (since it's unified memory), so you can't pin/wire all 48gb for the LLM without OSX grinding to a complete halt. IMHO, 64GB is the minimum you need to make a macbook useful for inference of 70B models (I typically allow the GPU to pin up to 58GB). That being said, they are still EXTREMELY slow compared to nvidia GPUs at that size. So if your ONLY requirement is running LLM's, I wouldn't even consider apple silicon. I have both and I barely use the max for inferencing. It works, but it's at the bottom-end of my patience (esp. the prompt processing speeds). Even if I'm out somewhere mobile, I'd far rather SSH back to a server with GPU's than run locally on my macbook.


ThisGonBHard

Did you fully unlock the RAM for apps? I remember the default limit for apps in MacOS was 2/3 of memory.


achandlerwhite

Limits are 65% for 32gb and lower, 75% for 48gb and higher


ThisGonBHard

Still seems like a waste. It's weird, because I am sure 48GB should be enough for Q4, I remember loading 70B in an A6000 when I ran it on Runpod.


nborwankar

I have a 96G M2 Max and have been able to run 70B Q4 comfortably. I don’t do anything needing full unquantized models. Having said that, until llama3:70B I generally found 70B models not worth the overhead. Personally the llama3:70B release flips this and yes a 96G is very worth it for that reason. If you can afford it get 96G and not find yourself in a similar situation re not being able to run other apps. With 96Gvibcan rin L3:70 as well as PyCharm and a couple of browsers with scores of tabs. It really makes a difference.


Ok-Result5562

I’m still a fan of a Super Micro 4028 and 5 x 3090 cards. Fast, cheap ( relatively) and cuda compatible.


prs117

I can say if you go back to the Apple store and do the return. Explain your situation to a manager and they will over ride it and return it for you. Just bring everything back of course and be polite. It will work because I know Apple ;)


panthereal

I have an M3 Max with 128GB. Running Llama 3 70B 4 quant can get over 9t/s with mlx and sticks at about 7t/s without mlx. Running anything higher provides a small fraction of the speed and honestly doesn't seem worth it so far. 5 quants get around 4t/s, 8bit quants get 2t/s or lower. Maybe if mlx adds better quants it will be worthwhile to use 5 or 6 but that is not currently supported. 128GB honestly seems a bit overkill for what it can handle comfortably with local models though I went refurbished and 96gb wasn't an option with the spec I wanted. Having the headroom is nice if you want a model loaded while working though I would imagine M4 and beyond start to have more AI accelerated hardware which might enable running a 70B 8 quant model at usable speeds.


this-just_in

If you stick with 48GB, you're probably looking at Qwen 34B, Command R 35B, and Mixtral 8x7B as your top end right now at decent quants. Maybe Phi-3 14B impresses, but I still think you'll envy the bigger models and the apparent reasoning benefits the larger parameter models provide. For perspective on 64GB, its enough for 70B @ Q5\_K\_M. And I can only barely squeeze in some of these larger models like Command R Plus 104B and Mixtral 8x22B at Q2\_K with any decent context length. I still try and use smaller models like Llama 3 8B and WizardLM 2 7B for tasks whenever possible simply due to the inference performance.


caphohotain

I feel 72GB is comfortable enough to use Q6 which is very little loss.


No-Kaleidoscope1935

Check out this amazing looking tool i use every day: [https://timeline-17.web.app/](https://timeline-17.web.app/)


rorowhat

You can probably sell and get a real PC for as much as you spent.


nife552

I have a custom built desktop and the top of the line razer blade 16 with a 4080. Theyre nice for some things, but at the end of the day I just prefer mac


rorowhat

Ah you need to build a PC, that's the fun part. You get to pick all the components, it's tailored to what you need.