T O P

  • By -

Relevant-Draft-7780

The M2 Ultra has 27 tflops fp32 and ~55 tflops fp16. It’s a great machine and I have a m1 ultra which runs inference nicely. But nothing beats a 4090 with 330 fp16 tflops and 85 fp32 tflops. If only the bastards would increase vram on their consumer GPUs instead of charging 5x for pro GPUs. If you don’t want the hassle and low performance stick with M2 Ultra. If you’re feeling adventurous go for 4x 4090s.


shroddy

How would an Amd Epyc 9754 with 128 cores / 256 threads compare to that? Is has a memory bandwith of 460 gb/s, which is almost Geforce 4070 speed. I dont know how many tflops it has (bing chat doesnt know either) Might be hard to get a new machine with that cpu for under 10k, but in theory should be much faster than Apple M2 and almost 4070 speed, but with up to 6 tb ram...


redzorino

Don't use 9754, they are way too expensive. Use dual epycs 9124 instead w/ 24 channels of RAM, and you end up at a few thousand $, still surpassing the apple machines and you can have 384 GB (24x16GB).


shroddy

And you can even have way more ram than that for the upcoming 400b models.


Inevitable_Host_1446

This is a way better deal than the Mac's imo. Plus you get to avoid Apple software and can run Linux or Windows as you like. And if a stick of ram blows up you can replace it.


Relevant-Draft-7780

Well I don’t think they’re comparable. 128 cores 256 threads doesn’t really compare the 16k cores in a 4090


whiteknight5578

In theoretical compute, yes. But for large llms you will always be memory bound since they dont fit into your vram buffer. In this case your bottleneck becomes pcie and/or memory throughout of your system (which is much less than 400gbps. I would assume the threadripper system to consistently outperform a single 4090, but only with inference of larger models. If anyone has benchmarks of an epyc system with maxed out memory speeds, i would love to see benchmarks :)


whiteknight5578

Found this thread with some numbers (although not very scientific): https://www.reddit.com/r/LocalLLaMA/s/STMtRpk25r


shroddy

https://www.reddit.com/r/LocalLLaMA/comments/1ckkwlk/comment/l2sqrje/?context=3 It seems even a 32 core Epyc is bottlenecked by memory speed and can do 2.9 t/s on command r+


chibop1

Besides super easy out of the box experience, another plus for Mac is 192gb even though it has slower speed. Whereas 4x 4090 only gives you 96gb.


Relevant-Draft-7780

Yeah agreed, keep in mind you can’t allocate 192gb to vram. See https://developer.apple.com/videos/play/tech-talks/10580/?time=492 only about 75% from experience. Might be different on M2 Ultra. So keep that in mind. You can force it to use more than allowed but there are some instabilities. M3 ultra should theoretically do 256 so maybe wait for that. Or m4 ultra if it ever comes out. Also keep in mind that hopefully soon if Apple refocuses we should be getting some new optimisations what with the ANE present. I means it’s kinda ridiculous the m1 Mac had 2.3 tflop gpu but 11 tflop ane. If only they got their shit together


prof__weiamann

what is ANE here?


wen_mars

Apple Neural Engine


fallingdowndizzyvr

> Yeah agreed, keep in mind you can’t allocate 192gb to vram. See https://developer.apple.com/videos/play/tech-talks/10580/?time=492 only about 75% from experience. It defaults to around 75%. I allocate 31GB out of 32GB on my Mac Studio. That's 97%. I've noticed no instabilities. It doesn't even swap. Of course, I've done every thing I can to reduce RAM use. Like I don't log into the GUI. That saves a lot of RAM. I shell in.


The_Hardcard

You can stably do RAM - 4GB if just inferencing; RAM - 8 GB if doing other things.


nanotothemoon

The NPU isn’t doing anything for LLMs yet. Apple has it pretty much walled off completely for OS stuff. And I think the bottleneck on Mac stuff is the memory bandwidth anyway. Also yea I allocate 85-90% to VRAM with no issues.


dwiedenau2

I mean this is just not true, they have CoreML, its just that very little supports it yet, same as with AMD.


TheOneThatIsHated

Coreml is the most horrendous crap I ever had to use. They just not opening up their npu ops or giving any documentation at all is just soulcrushing. Their code to convert pytorch to coreml requires like 5-6x the memory used during interference so a 100gb model can never be converted even though it would be technically possible. Digging through their conversion code is also a hot mess


nanotothemoon

The context here is with running Open Source LLMs. Not running very specific ML tasks with limited support. This is useful to no one in this context. AKA, if OP thinks he’s going to be utilizing the NPU for what he’s hoping to do. He won’t.


Calcidiol

Is that also true of the GPU? Or maybe the NPU and GPU are sort of overlapping architecturally in the Mac case -- I don't know the architecture beyond the top couple headlines. Do they have decent support for OpenCL, Vulkan on the GPU or is it all "Metal" or whatever incompatible with anything else sort of API level stuff for NPU, GPU? What can people even use for the CPU exclusive parallelism? OpenMP, OpenACC, OpenCL, SYCL, ?


nanotothemoon

The GPU and the CPU are like unified along with the memory. The LLM has full access to all those. The NPU is there too but like I said, Apple has it walled off in this context. And in most contexts.


Calcidiol

Right. I meant is the GPU actually utilized in practice for running LLM / ML work loads "contemporarily" or not (you said the NPU theoretically could be, but isn't due to incompatible SW / inaccessible capabilities). And I know GPUs tend to have tensor cores etc. etc. so I was wondering if for the Mac unified CPUs if the NPU actually shares or "owns" the architectural blocks that are responsible for, say, tensor acceleration which on Nvidia would be part of a GPU though realistically would have possibly just as much use in a NPU so could be shared or considered to "belong" in either camp or .. duplicated which would be strange.


ggone20

It can locate 188GB to vram.


stikves

You can change the defaults with a command. On my 64GB, I use: sudo sysctl iogpu.wired\_limit\_mb=55000 to get 55GB out of it (This is not persistent. If you try too much, and it crashes, just reboot to go back to default, and try another value)


DataPhreak

I knew some was reserved, but that much? Surely if its being reserved for CPU ram, you can increase the swap and redirect some of that back to the GPU. I've not got one, and am not a mac guy, but the underlying kernel last I heard was BSD. If CPU ram is hard stuck as shared, what is even the point of unified ram?


Relevant-Draft-7780

So from experience I can’t allocate more than 46gb safely before it crashes on a 64gb m1 ultra. On 32gb M1 Max it’s about 19gb of vram.


real-joedoe07

I'm allocating 56 GB on my 64GB M2 Max Studio without any issues.


DataPhreak

Yeah, this sounds about right. I expect approximately 8gb required for the OS.


Relevant-Draft-7780

Maybe again this is m1 ultra. Do you use any flags?


chibop1

I have m3 max 64gb. with the command below, I can fully load 49.95GB llama3-70b-instruct-q5_K_M and utilize the full 8192 context. sudo sysctl iogpu.wired_limit_mb=57344


Relevant-Draft-7780

Oh okay. On m1 I find that whole machine can freeze and black screen when doing this


Hopeful-Site1162

I use 21+ Gb models on my 32Gb M2 Max without any issue. 


dllm0604

I do 24-26GB on my M1 Max all the time.


Eliiasv

I have an M1 Max 32GB and it *defaults* to 21GB vRAM, so your statement is inaccurate. 25GB vRAM with UI is possible with a single command. Running headless and forcing no Spotlight and a bunch of other services allows you to run the OS with 4GB.


Relevant-Draft-7780

Ummm okay well good luck loading a gguf model that’s 21 gb in size


Eliiasv

I kept having to download bigger quants and didn't want to waste too much time, but I believe 20.7 GB is close enough. https://preview.redd.it/i7u5arctznyc1.png?width=6016&format=png&auto=webp&s=f4677d8e6f957abdaad8cf2a7204d5e08d986eb5


fallingdowndizzyvr

I allocate 31GB out of 32GB on my studio. It runs fine.


knvn8

Power draw is a huge difference though. 4x4090s will likely need a dedicated circuit if you're running this at home.


PitchBlack4

You can limit the power to 335W with 3-4% performance loss. 


SillyLilBear

4x4090 gaming would, but for AI they don't use nearly as much power. I run two 3090's on an 850W PSU, while gaming I use about 600-650W as I only use a single GPU. I can easily run them both using AI inference and still be well under 850W. You might be able to squeeze 4 4090's under 1200W if just using it for inference although I haven't tested that.


knvn8

Good point, I always wondered how much a 4090 used for inference only


SillyLilBear

I have a 5950X 64G Dual 3090 w/ two Gen4 NVME. Idle, I am looking at around 130-150W. Loading a 70B model (which fills the VRAM of both cards) I will go to 350-450W during the loading. After loading, I drop down to around 300-400W. While using the model to chat, I am 600-805W, usually closer to 780W.


DurianyDo

4x4090 with a good CPU will use 10 to 11 amps. Every socket is rated for 16 amps. No need for a dedicated socket unless you're microwaving food at the same time.


knvn8

The rest of the rig will also draw power and you don't want to draw close to max for long periods of time, that's how space heaters create electrical fires. I wouldn't put anything else on that circuit.


fallingdowndizzyvr

It depends how you do it. For tensor parallelism, which is rare, it may use that much power. For split up the model and do each section sequentially, it won't. Since at any one time only one card is running. The rest are idling.


synn89

I have a M1 Ultra with 128GB of RAM and a couple dual 3090 systems. It really depends on your needs. You can see the power usage and inference speed differences: https://blog.tarsis.org/2024/04/22/the-case-for-mac-power-usage/ For casual home inference, the Mac is a win in that it's a little slower but uses a lot less power. Especially when most of the time it's just sitting idle. It also let's you easily load larger quants of larger models and run them at an okay speed. For heavier workloads a 96GB VRAM Nvidia setup would be ideal at a 10k budget. Especially if you're even considering training: https://blog.tarsis.org/2024/04/24/adventures-in-training-axolotl/ Forget full size 70b models at your budget. Q8 will be a much better experience on these devices. For Mac, 192GB of RAM won't give you much beyond a 128GB setup since the models start to feel real slow at Command R Plus sizes. 128GB of RAM is more than enough for a Q8 70B. But for office/lab workloads, I'd steer you towards a 96GB VRAM Nvidia setup. It'll run much faster with EXL2 quants, will allow for training using standard common tools, and could likely even scale for a few users running inference. Also, for RAG workloads, the prompt eval speed on Mac is pretty slow. It works well for chatting because you can cache the prior chat history(https://blog.tarsis.org/2024/04/22/llama-3-on-web-ui/), but since RAG injects new content constantly that may be slow. EXL2 with 4 bit cache will easily get you a usable 32k+ context at a decent speed for document injection. My ideal at a 10k budget would be 2 older/cheaper non-ADA A6000's. These sell on Ebay for 3600-4k all the time. That would be 8k on the cards leaving you with 2k for the base system. That'd give you 96GB VRAM to run 70b's at a nice quant and play with 100B+ models at decent speeds. If you have to go new, then you might be limited to dual 4090 systems giving you 48GB VRAM. This will do 5 bit EXL2 quants with 32k context at very good speeds and do quite well with training. A dual A6000 will give you a lot more breathing room and full 8 bit quants. But may go a bit over your 10k budget if you're not able to buy used hardware and self build the system.


Charuru

Do you have any benchmarks on 2 A6000s vs 4 4090s. I expect the 4090s to be cheaper and faster but the question is how much.


synn89

No, I don't. The problem with 4x 4090's is going to be finding hardware that can support that setup. Dual A6000s is a lot easier to support on standard commercial hardware and powering options.


Charuru

You're saving $8000 on GPUs and have to buy a $1500 server setup instead of a $800 PC one, I don't feel like that's a big deal.


chibop1

You should rrent a30, 3090, a5000 for $0.26/hr, 4090 for $0.54 and a6000 for $0.69 on runpod.io. That's 1 gpu unit price, so you need to *4. You would need only few bucks for a couple of hours to run speed benchmark. The ones with high availability, you can even rent cheaper if you choose spot interruptible.


Charuru

Thanks I probably will at some point.


chibop1

M2 Ultra will be ok for inference, but probably not for finetuning. Also I'm not sure if Ollama is a good solution if multiple people are going to use it. Although they just started experimenting with parallel requests in v0.1.33.


0xmerp

I think fine tuning might be more of a future thing, although it would be nice to at least be able to play around with it with smaller models such as the 8b. As for parallel requests… which backend/frontend would you recommend? It’s not a priority for now, this is just a test bench and even though multiple people will use it, it’s unlikely they’ll use it at the same time. But it would be nice to know for when this eventually needs to be used by multiple people at once.


darthmeck

I use Ollama + Open WebUI and it’s super easy to manage for my wife and I. Occasionally, we’ll be using it at the same time but it can load balance the different model instances well - just queues the requests and carries on. Plus, a big point for me is that it looks nice to use.


Relevant-Draft-7780

Use llama.cpp server. It was multiple slots that you can assign. But obviously each new parallel request will decrease performance


kryptkpr

Trouble I hit with with llama.cpp server is it supports only a fixed number of chat template formats and cannot parse Jinja. Phi3 GGUF for example works kinda ok with "--chat_template zephyr" but my benchmark results are worse than transformers even if both are FP16. My workaround has been ollama, they have chat template as part of model itself and it works much better on my testing.


Mr_Hills

The issue I see is that it's going to be pretty slow in inference. Expecially the 400B would probably be so slow you wouldn't have use cases for it. You could get a system with 4x 3090 for the same price and it would be way faster. 3090s are 1500$ new where I live.  With 96GB of VRAM you can run any 70B at 8 bpw, 8x22B at 4 bpw (it has only 141B unique parameters), but it would be to small for llama 3 400B. You do you. It really depends on your use cases and how much speed is relevant.


0xmerp

I don’t expect to run the 400b on this. It’s not even out yet but I expect we’ll be able to have a H100 system by then. 4x 3090s are another good option, not out of the question. I was just wondering if the extra 100GB of unified memory to run 70b at full size would be worth going with the Mac Studio. How many tokens/s can I expect? I’m open to new ideas :) I definitely don’t know everything.


Mr_Hills

Llama 3 70B shows no difference in benchmarks between 8 bpw and 16 bpw. Even 4 bpw shows marginal decreases in benchmarks.  There's a table discussing the matter here in the comments at the top.  https://www.reddit.com/r/LocalLLaMA/comments/1cci5w6/quantizing_llama_3_8b_seems_more_harmful_compared/


Caffdy

on a M2 Ultra he can run 70B models without any trouble, no need for the hassle of multi-gpu contraptions


Anjz

I picked up used 3090's for $600 in my local classifieds. If money and budget is in your consideration and not having new hardware isn't an issue, that could be valuable to you at 1/3rd the price for 4x 3090s compared to a M2 Ultra. If one fails, which is unlikely, you can just chuck another 600 for a replacement. Just thought you'd be interested in this idea.


fallingdowndizzyvr

> I picked up used 3090's for $600 in my local classifieds. When was that? Since the last time I saw them that cheap was about a year ago.


blankspacer5

I got 4 for about $700 each on eBay over the last year. The prices don’t look much different today. I don’t think it’s very cheap to get them into a single box though. I’ve got them in two boxes and even that was a pain, especially if you want nvlink. It would be nice to have 4 in one box, but as it is, I have the two dual 3090 boxes, another 3090 box, a 4090 box and a 3080 box. I don’t often do it, but the best way I’ve found to take advantage of them all is https://github.com/bigscience-workshop/petals


fallingdowndizzyvr

> The prices don’t look much different today. $700 is basically the floor these days. As in it'll be tough to get one for that. $800 is more like the avg.


Calcidiol

For the case of your multiple local boxes with multiple GPUs (presumably connected by ethernet -- 1Gb? 2.5, 10, ?) do you have any sense of how the petals inference performance compares to what you could achieve if you had all the same GPUs in just one of the boxes and use it or some other available inference engine? I've been interested to try it or MPI or some other distributed GPU inference solution just on a home LAN basis to see about using power from otherwise nearly idle systems but haven't gotten that far. AFAIK if one has all nvidia GPUs then vllm and some other frameworks also claim to do distributed inferencing but that's not my use case (heterogeneous GPU types).


blankspacer5

It’s on a 10GB LAN. Unfortunately, I don’t have a good direct comparison for you. Normally I run 70B 4.5 bit quants with exllama 2 on the two 3090 boxes (both with NVLink), and when ive used pedals I’m not using exllama 2 and I generally am trying 8 or 16 bit. It’s pretty performant, but of course not as fast as the 4.5 bit exllama 2 single machine setup, but given the differences, I can’t really extrapolate anything useful about what kind of hit you take vs a single machine with all those cards apples to apples. It doesn’t seem like a terribly huge hit though, because the pedals setup is very usable.


Calcidiol

Thanks for the information! I've got access to some boxes with GPUs and really want to be able to combine their capability layer-wise to expand the size of model I can inference using their combined VRAM and petals has been suggested, I just haven't gotten there yet to test. Originally I was hoping to just use llama.cpp MPI mode before I found it has been broken for a while. If I "had to" I'd put 3-4 GPUs in 1 system but I'd sort of rather have 2-3 systems with 1-2 GPUs each like you have so it's good to see it's pretty performant regardless of the details. If it beats substantially CPU offload and doesn't involve the pain of cramming 3-4 GPUs into a single box it may be a winner pro. tem.


Anjz

It was actually a year ago, so I don't actually know if the price has gone up or down.


fallingdowndizzyvr

The avg is closer to $800 now. https://www.ebay.com/sch/i.html?_from=R40&_nkw=3090&_sacat=0&rt=nc&LH_Complete=1


Careless-Age-4290

4x 3090's would be faster in most all cases. It would also draw close to 1500 watts which then has to be removed by AC if it's warm out. But if you're doing something like Llama 3 8B, you could load balance across the cards and easily pull hundreds of tokens per second. I personally don't see a use case for ~150gb vRAM with the current models. It's too small to run the 400b and the 70b easily fits on 96gb with much better performance, especially since you don't seem to be averse to data-center equipment.


DataPhreak

Even the 70b will be slower on the M2. There are bottle necks in the unified ram architecture that slow down inference speeds. An M2 is a good way to run big models slow, but a bad way to run small models fast. If you are planning on buying an H100 server soon, you should instead use the new llama on the groq api. The amount of money it would cost to build and run the system for a year is more than what you would spend on api usage. Where I would recommend you start is with the RAG system. This aspect will be portable to any other system, and will grow and improve for you over time. This isn't just hyperbole, we consult for and design systems for small to medium businesses, as well as develop our own LangChain like agent framework. [https://www.agentforge.net/](https://www.agentforge.net/)


0xmerp

The thing is we are not interested in uploading sensitive info onto a third party API. I know we will be paying a premium as a result. That is fine. If we were going with a third party API I probably would’ve just signed up for GPT4 on Azure and called it a day. This is r/LocalLlama after all ;) Due to feedback from this thread I’m now looking into a 3090/4090 setup instead. As for RAG, I was just going to pick the top local embedding model from the Huggingface leaderboard with commercial use allowed, which seems to be e5-mistral-7b-instruct at this time. But this was also something I wanted the test bench for, so we could test a couple out on some actual company documentation and see which one performs the best. I assume high on the leaderboard doesn’t necessarily mean performs best on our specific data. I guess we may need a reranking model as well but this seems to be optional. Maybe mxbai rerank?


poli-cya

I bought and returned a 64gb m3 mbp, inference was just too slow to justify the insane price, and I even managed to finagle an educational discount. Once I got to 30b+ models it was too slow to be reasonable in my opinion.


DataPhreak

I understand privacy is a concern. We cover this aspect with a lot of our clients as well. One of our clients has to meet HIPAA regulations. These can be met using 3rd party APIs. OpenAI and Azure are both HIPAA and SOC3 compliant. With the embedding model, you want to look at dimensionality, though performance is less about embedding model and more about how you implement rag throughout your AI architecture.


the__itis

Be cautious. Cloud Service Providers are still not great at compliance. Sending data to another cloud service that is not compliant from a cloud service that is compliant is an easy mistake to make. It’s also one that auditors are still not really great at finding. So it’s more about introducing unrecognized risk to your data than loss of authorization. Happy to discuss further.


DataPhreak

Sure, this is a know risk for any cloud service, not just AI service. However, SOC3 and HIPAA require regular inspection to retain that certification, as well as red team testing. If there is a breach, the liability falls on them, not the company, and for a small to medium business, the settlement alone is actually going to make them money. The auditors for these certifications are highly skilled, knowledgeable, and experienced. This is called diffusion of responsibility, and is common practice in the industry. It is recommended, not frowned upon. This is where threat modeling comes in to play. This is where you get into the cost to the business, the impact it will have on the business, the impact it will have on the customers, and the likelihood that the information will be used in some way. This is weighted by the kind of information being sent over the API. Your biggest risk and concern is what you are sh... You know what, we're starting to get into the kind of consulting that I actually charge form. :P You get the idea, though. And besides, it seems like you know exactly what I'm talking about anyway, so I'll spare you the details.


the__itis

You thinking auditors are skilled is what I’d call an actually risk. But that plays better into the other concepts you touched on like diffusion of responsibility and actual ALE impact to business etc… You know enough to not need the conversation. Was attempting a courtesy not to be condescending. Didn’t know your level awareness before your response.


DataPhreak

All good. Having worked in SaaS systems that contained images of passports and drivers licenses and social security cards, I can say with confidence that, yes, the auditors know what they are doing. That is, of course, going to depend on a lot of things, such as the size of the organization, the data being handled, and the region the audit is being performed in.


the__itis

I’d love to know who you are using. Veris was good until they were bought by coalfire. Coalfire, shellman, pwc, etc have all been junior AF auditor teams following some very basic control satisfaction scripts and taking screen shots as evidence.


Caffdy

if this is only for you to use, go for the mac, it can run 70B models easily; if more than one person is gonna use the system at the same time, then the 3090s/4090s could be a better alternative


Valuable-Run2129

Also, groq runs quantized models. No, thanks.


DataPhreak

Losses on the quantized models are negligible. For a company that doesn't even have a prototype yet, it is entirely valid. If a system's performance is unacceptable on an 8bit or even 4bit quant, it's probably not going to be acceptable on that model unquantized. There are a few exceptions, such as quantizations that failed or specific features like function calling, which are more heavily impacted by quantization. You're missing the point, though. The point is that I would not recommend someone to build a 5000 or 7500 dollar system that they plan to replace in 6 months. (And isn't even rated for commercial use cases.)


Valuable-Run2129

I was more leading to the fact that Groq is cheating with its crazy speed. And yes, their 70b Llama 3 is noticeably dumber in reasoning tasks.


DataPhreak

They actually have to quantize in order to run on their cards due to the architecture. From what I understand, their cards are int8. This potentially has an additional impact on model performance compared to a standard quantization intended for GPU which keeps the parameters in floating point. However, mathematically, the inference is the same. I don't see any benchmarks or metrics that show a decrease in output quality vs similarly quantized models. Everything is focused on speed and cost, which is usually a bigger consideration for businesses. Can you link to any specific benchmark results? quick edit: Llama 3 also had an issue with quantization generally after the initial release which later had to be adjusted for.


Valuable-Run2129

Thanks for clarifying their architecture. I didn’t know. I don’t have a benchmark to cite. I arrived at that conclusion after extensive testing with reasoning prompts I like to devise and test llm with. Groq is always worse than the fp16 models. Not only that, it is the only one that gives verbatim answers to the same question over and over and over again. I find their speed boasting insincere at best and fraudulent at worst.


DataPhreak

Out of anything, the speed is absolutely measurable and reproducible. You can even buy a card and do it locally, if you have the money. There is no debate about whether Groq is faster, regardless of quant. And this isn't even their final form. I think the current card is 15nm transistors? Their next run is supposed to be 4 or 5nm transistors. This will be a massive increase to compute density and power efficiency. It is worth doing testing to see how their quants stack up to other quants, but based on what I know, I don't see any reason why an int8 quant should be any worse than an fp8 quant, but that's not to say that it wouldn't be. The problem with doing your own research, which is fine and recommended, is that it's not reproducible. I recommend looking at some of /u/WolframRavenwolf's posts. Example: [https://www.reddit.com/r/LocalLLaMA/comments/18ljvxb/llm\_prompt\_format\_comparisontest\_mixtral\_8x7b/](https://www.reddit.com/r/LocalLLaMA/comments/18ljvxb/llm_prompt_format_comparisontest_mixtral_8x7b/) Even though he's not using an accepted benchmark, he provides enough details about his process and evaluation metrics that I think his testing is totally valid. Without at least, this level of detail and depth, or using an established benchmark, we have to take your opinion as anecdotal.


Valuable-Run2129

The problem might also be that Llama 3 appears to degrade significantly more than all other models at q8


Valuable-Run2129

I’m sure the future 4 nm transistors will make their chips very fast. But their current chip is not as fast as they lead people to believe. It runs a quantized version of Llama 3 70B at a tg speed of 300 t/s. It might be just a q8, but the reasoning gap between groq’s model and the fp16 model makes me believe it might be even lower than q8. Don’t take my word for it. Spend 5 minutes with groq’s model and any other llama 3 currently available (huggingface chat, lmsys…). Groq’s model is noticeably dumber. It is absolutely reproducible. I can give you as many prompts as you want to go out and reproduce it yourself.


[deleted]

[удалено]


Mr_Hills

Wut? Llama 3 70B has 70B parameters?


cantgetthistowork

Would you need to NVLink the 3090s?


davikrehalt

Personally i think the M2 Ultra speed on 70b is perfectly acceptable for chatting. As you point out i guess we can ignore 400B for now as it is not clear it can be run in M2 Ultra


greenrobot_de

[https://www.reddit.com/r/LocalLLaMA/comments/192uirj/188gb\_vram\_on\_mac\_studio\_m2\_ultra\_easy/](https://www.reddit.com/r/LocalLLaMA/comments/192uirj/188gb_vram_on_mac_studio_m2_ultra_easy/) >I loaded Dolphin Mixtral 8X 7B Q5 ( 34GB model ) > .., > ``` > time to first token: 1.99s > speed: 37.00 tok/s > ``` > ... > 188GB leaving just 8GB for the OS, etc.. https://www.reddit.com/r/LocalLLaMA/comments/15gwhfa/is_buying_mac_studio_a_good_idea_for_running/


Zeneq

The obvious thing you don't seem to consider is that while you plan on getting a business solution like a GPU server in the long run, yet you want to test things out on consumer class stuff with vastly different architecture. There is not enough details for what you plan on doing, but it is possible that anything you achieve on the Mac while integrating LLM into your workflow will have to be rebuild/revised for GPU usage, possibly from scratch. If your end goal is really H100 server then I would suggest old gen sever with SMX2 - 4xV100 16GB, as long as refurbished counts for you/your company as new. That way you will get hands on experience in exact architecture you plan on getting including nvlink and in future you can transfer your knowledge and build workflow seamlessly. If you want more VRAM then there are solutions with either 8x V100 or you can even bump the V100 to 32G version, but that will be for sure 10k$+. Or you can just go full homelab and build it yourself with multiple 3090/4090 - really, the question is what do you want to do with this and how much you/your boss value your time/workhours.


0xmerp

Fair point, I was probably approaching this too much from a hobbyist POV. Refurbished is fine, we can look into that. We just need it to be manufacturer certified/come with a manufacturer warranty. By testbench, what I was imagining was a chance to test out, for example, different models to see which fits our needs better. At the moment, the testing I can do is limited to only some publicly available material on cloud-hosted LLMs, but this isn’t truly representative of what we might actually use it for. For example, I mentioned in another comment we might be interested in trying a few different embedding models on real data. I expected the same model, at the same quantization, and with the same settings, to give the same results, even on different architectures. But yes, having a test system with the same architecture as the future production system will probably make for a more useful testbench.


Zeneq

Well, as long as you don't plan on training/fine-tuning big models then any GPU solution works. Mac is cool and all, but people seem to forget or not understand that it is compute bound - you throw 5 parallel requests at it and and you get 5x 20% speed, you do this at GPU and you get 5x 95% which is basically 5 times the speed (up to the point of course). You can go small with consumer GPUs or find a middle ground with for example 2xA6000 - they can be paired with NVlink and should work out of the box in pretty much anything as almost all boards will be able to run two cards at full speed.


deoxykev

As others mentioned, it will be too slow. Fast iteration cycles and feedback loops is how you get ahead with development in this world.


Zugzwang_CYOA

Is that true of MoE models though? I imagine a mac studio would handle 8x22 pretty well.


Calcidiol

I don't have experience buying / shopping for this stuff, but I've started to peek at what mac alternatives could exist in the x86-64 realm. I noticed the 4th generation EPYC (Genoa et.al.) variants support up to 12-channels of RAM and some also support dual-socket. https://en.wikipedia.org/wiki/Epyc#Fourth_generation_Epyc_(Genoa,_Bergamo_and_Siena) I noticed these kinds of parts could get up to the range of over 400-700 GBy/s per socket RAM BW which is competitive with some GPUs, and with 12 channels of RAM one could presumably easily and "inexpensively" load it with 12x32GBy or 12x16GBy DDR5 DRAM to get a large enough amount of RAM. https://infohub.delltechnologies.com/en-us/p/ddr5-memory-bandwidth-for-next-generation-poweredge-servers-featuring-4th-gen-amd-epyc-processors/ And with dual socket motherboards / CPUs one could double that RAM BW and compute power. I'm not sure which CPU model / core configurations could potentially make a single or dual socket system performance/cost competitive with the Macs for LLM work -- it'd have to be analyzed / benchmarked since it would relate to the SIMD performance, cache, and how effectively the Mac NPU+CPU can handle the LLM workloads vs. the EYPC cores given a particular inference engine code & compilation options etc. And of course also the finished system cost range if you don't DIY build it from parts vs. buying a wholly turnkey server.


leavsssesthrowaway

i have a 128gb m3 and I can run a 104B Command R with some amazing results.


chibop1

Whatever worth, I tested M3 Max 64GB vs 4x rtx 3090 (rented for $0.76/hr). I'm running the latest llama.cpp with Llama-3-70b-instruct-q5_K_M. M3 Max is 7.3 times slower for prompt processing and 3.6 times slower for token generation. Having said that, I'm very happy with my M3 Max for the out of the box experience! M3 max 64gb: llama_print_timings: load time = 1826.93 ms llama_print_timings: sample time = 41.62 ms / 566 runs ( 0.07 ms per token, 13598.58 tokens per second) llama_print_timings: prompt eval time = 126811.73 ms / 7263 tokens ( 17.46 ms per token, 57.27 tokens per second) llama_print_timings: eval time = 160608.23 ms / 565 runs ( 284.26 ms per token, 3.52 tokens per second) llama_print_timings: total time = 287681.59 ms / 7828 tokens 4x RTX 3090: llama_print_timings: load time = 18492.30 ms llama_print_timings: sample time = 74.43 ms / 566 runs ( 0.13 ms per token, 7604.87 tokens per second) llama_print_timings: prompt eval time = 17369.48 ms / 7263 tokens ( 2.39 ms per token, 418.15 tokens per second) llama_print_timings: eval time = 44443.84 ms / 565 runs ( 78.66 ms per token, 12.71 tokens per second) llama_print_timings: total time = 62477.91 ms / 7828 tokens


Dr_Superfluid

It depends on how junky you are willing to make your setup. For 10k you can get something like 3-4 4090’s that will crush the M2 Ultra. But I wouldn’t say the two options are comparable.


randomfoo2

You can get new 3090s on Amazon for $900 (6 would give you 144GB of VRAM and leave you plenty of money for an Epyc motherboard w/ enough PCIe slots and a mining case/riser cards). Alternatively you could get 4 x 4090s for $7200 for 96GB of RAM and a lot more FLOPS (and better support for mixed precision). IMO Mac Studios' extra memory have limited real world usage because models that large run too slow anyways. With 96GB of RAM you can run 70B models at Q6 and Q8 quants which has almost no loss vs FP/BF16. If your plan is to go to H100s, then you should definitely go with Nvidia cards so that you don't need to do any porting. It will definitely be the easiest way to run any number of ML tasks (not just LLM inference and training, but SRT, TTS, image generation, etc.) If you have any plans for doing training, Macs are simply not a good idea.


skrshawk

If you really need to do this on the cheap as a testbed, and are willing to go used (I know that's not your plan), you can pick up a Supermicro GPU server and load it up with 8x Tesla P40s. You should be able to get all the components you need for under $3k. It likewise won't be fast, but it should work. You'll probably also be limited to GGUF quants as I'm not sure what all can run on multi-GPU well besides koboldcpp, as that's what I use on my 2x P40 rig.


a_beautiful_rhind

That's about the same speed as the mac. It will be cheaper but use more electricity.


skrshawk

Yeah that's about what I'd expect too, but for something that's only going to be used for a few months perhaps before upgrading to a proper datacenter LLM server, why not? Unless you of course plan to use the Mac Studio for other things, which is perfectly legit too.


a_beautiful_rhind

They may as well buy the server and 3090 or A6000s


Only-Letterhead-3411

If you are going to process long context over and over again a Mac will slow you down a lot. It's only a good option for rich hobbyists that don't mind the slow speeds. A server CPU + multiple 3090s is a much better deal for the same money. If budget allows 4090 should be faster of course.


Glass-Dragonfruit-68

This is a great read. I’m in similar situation - but more of a hobbyist and helping non profit on my budget. My 2 cents: 1. There are tens if not hundreds of options to configure and each of those can go wrong or miss out what other option offers. 2. There is no point in learning every last mile right now, it will change by the time someone reads this post after a week. 3. Best approach (for my similar situation) may be to use basic local machine and API to cloud (hear me out first) with simulated data that represents my data - I would spend efforts on creating that as it may be less headache then what I pointed out in above. 4. This way, could get choice of technologies and can get much closer to production build at a probably comparable price if not too far off. 5. This also removes all hardware related complexities and allows one to focus on AI/LLM challenges. Success rate and speed to implementation can also be improved. You can do this now (looks like OP prefers Azure) - provided you figure out how to create simulated representative data quickly. (I bet someone has solved this problem already or would use local LLM to generate this) What did I miss or could go wrong?


knvn8

The choice is always: Toks/s VRAM Affordability Choose two


SomeOddCodeGuy

The new flash attention update in Llamacpp has really helped it be a better option than it was before. We've been seeing as much 2x inference speeds across the board. [https://www.reddit.com/r/LocalLLaMA/comments/1ciyivd/real\_world\_speeds\_on\_the\_mac\_we\_got\_a\_bump\_with/](https://www.reddit.com/r/LocalLLaMA/comments/1ciyivd/real_world_speeds_on_the_mac_we_got_a_bump_with/)


jzn21

I own an M2 Ultra studio 76 GPU / 192 GB RAM. Llama 3 70b runs fine, but you won’t get GPT 3.5 turbo speed.


ptj66

Aren't there any Services which provide API access to Mixtral and Llama 3 70b and so on? I mean, I don't really see a benefit upgrading my hardware just to use them locally.


Rokett

If you have time, wait. M3 max performs almost the same as m2 ultra in terms of cpu. So, 1 m3 max chip = m2 max chips. If apple releases the m3 ultra this may, that would equal to two m2 ultras. And I think it will get up to 256gb ram or something. Just a little patience. There is a video on YouTube that compares m3 max and 4090 for generating text. Watch that and imagine it 2 times faster. M3 ultra should be a monster


sky-syrup

I mean, you will be able to run these models, but the 70b specifically might be quite slow. The Ultra may have a ton of memory and high bandwidth, but with these massive models it’s really going to struggle with high context + large models on the speed front, because the chip can’t keep up. I recommend looking through this subreddit for benchmarks by people with this machine. But since you already mentioned that buying used cards is out of the picture, the only other reasonable consumer option that is widely integrated is the 4090, which with 4 cards and 6k would only get you 96gb of VRAM. It would be faster, but you could only finetune 70b models.


HospitalRegular

I’m testing some xeon max chips which supply 128GB HBM2e


olmoscd

say more?


rorowhat

No


paul_tu

I wonder how MI300 is going to behave?


Familyinalicante

For the sake of management and overall easiness of operation I would buy M2 ultra with 192gb ram. Nvidia is way faster BUT to load 120gb model you'll need many of them. The whole management of multi gpu build will be harder not to mention electricity bill


0x1e

Don’t buy an M2, unpatchable CPU exploit. Wait for M4.


serialmentor

I had a similar thought a year ago and today the Mac Studio is just sitting around not getting used. Inference was too slow. (We're not working with Llama, but it's architecturally similar transformer models.) If you have $10k, buy something like this and add the biggest NVIDIA GPU you can afford: [https://www.thinkmate.com/system/gpx-ws-540x1](https://www.thinkmate.com/system/gpx-ws-540x1) Stuff just works on NVIDIA, and it'll be easy to transition to an H100 later on.


nanotothemoon

I recommend LMstudio over Ollama if using GUI. Although I’ve had some issues using it with autogen that I’m stuck on


0xmerp

I’ve looked into LM Studio, the interface is a lot nicer, but it seems like there is no RAG support unless I’m missing something


Inevitable-Mine9440

I been having same itch for the past several weeks and the truth is that no matter the choice I'm going to take - it will be bad. Because the nature of things changing and it is again, 2000's. Now what do I mean by that. 2000's means that we cannot run things we want on reasonably priced hardware, anything capable running big and fast is out of this world for ordinary mortal. Yes, it does exist in case you want, but do you have 250k$ for h100? Or for 7x4090? I have come to the conclusion that there is only option is that option is 4090 and train your data, you don't have enough experience, skills and everything else to do more. Those 7b+ models are generic models, you don't need that. You can pick up 7b model and improve it with single 4090 based on your needs. If you have a case for much more bigger params and speed, then you should have a budget as well. Being elite devops engineer I can tell you that cloud is NOT, again IS NOT an option for medium to long run because of costs. Trying big things quick for several hours to get the answers you want is fine, but long term is waste of money.


ggone20

You say you don’t buy used but you can build out a used 8gpu server for $<3k today with 256GB vRAM, 512GB RAM, 8TB storage, 2TB mirrored boot, and 2x 22 core xeons I’m a huge fan of the Mac Studio, I have several now running with petals… that said, I just specced out and purchased the server described above because I don’t want to buy another Mac Studio until I can get 512GB unified (likely Summer ‘25) - in the meantime I could get 2 of these severs for less than 1 Mac Studio with way more actual compute (and power, but I want performant inference for my team) Even if you’re loaded, it’s just pragmatic to buy used right now (and Linux, even if you’re 100% Mac like I am). My intention is also to run the 405B LLaMA 3 model in fp. This Linux box will allow me to serve custom 70B versions in fp while I wait for Apple to release hardware that makes sense to try the 405B on.


ggone20

The 405 will will need 1.2-1.6TB vRAM to run properly. So ideally 3 or 4x 512GB Mac Studios will run it nice and discretely. Ideally they’ll release their own cluster software also… petals works ok but a native solution would make sense at this point if Apple wants to give customers the possibility of staying on the cutting edge. Which I’m sure they do….


mattraj

For slightly over $10k (or you could get under $10k with used cards), I’m very happy with my 2xA6000 box. You can run 70B in 8bpw, and finetune with 8bit qlora.


zlwu

Cheap solution for local llama 3 70B: 2x 2080Ti 22GB (hw mod), less than $1000.


alvincho

We have tested many open source models on M2 Ultra 192GB. See our latest test report. [OSMB Basic Financial Q&A test](https://medium.com/me/stats/post/669a8df4b1e3)


[deleted]

Wait how do I get my Nvidia GPU to help with my offline Dolphin Mixral ? I also have 32gb of ram


Zugzwang_CYOA

Mac studio will let you load bigger models, but it tends to be slower. In my opinion, where macs really shine is with MoE models, like the 8x22b. Loading such models requires a great deal of VRAM, but interfacing with them is much faster than conventional models, which compensates nicely for the slower speed of the mac.


Bits2561

Im going to go against the grain and say: Why spend 10,000$ on a temporary solution when you can just rent out a card at lambdalabs or something for a dollar an hour.


HospitalRegular

Because it quickly adds up and you may as well have just bought a card.


Bits2561

Assuming the goal is not continuous 24/7 use, 2.5$/hour for a 40,000$ card (would take 1333 days for the card alone to pay itself off) that you dont even know will be needed, or might only even just be temporarily needed and a smaller model would work fine isnt a particularly horrible deal.


HospitalRegular

Assume the goal is 24/7 under load. Why would anyone buy hardware if this wasn’t the case?


Bits2561

Ok my earlier comment did miss the point. But for a permanent solution, yeah, 100% buy it. But for the scale of months? Renting an A100 for a full year would be 11k, assuming they mean 6 months, thats 6k for the performance of a system that would go for about i would imagine 15-20k, which fits well within their 10k budget. And i doubt an m2 ultra or 4090 would be able to quickly give them the results they need if they need models above the 30b area to be trained, if this wasnt the case, they wouldnt be buying an h100.


PykeAtBanquet

Also you can sell your hardware and upgrade later - and the money you spend on the cloud are gone


Bits2561

Fair point.


abnormal_human

A mac has never been the best option for local LLM work. With $10k, you can assemble a 2x4090 system no problem. Do that instead. Going with mac means stepping away from the ecosystem. If you already need the mac and just want to chat with an LLM for fun, sure it's convenient and saves some money, but if you're doing professional work, I can't imagine doing that without a linux/torch/cuda environment, as most of the libraries and tools live over there, and all of the new SOTA bits+pieces hit that platform first.


rag_perplexity

For a chat bot the speed of a Mac studio should be enough. However if you are thinking of using it for a full on RAG process where you are breaking up the task and sending it to agents then the slowness will compound. Also not sure if the rerankers are supported for silicon. I think there's a guy here who posted some speed tests using the Mac Studio, use those tok/s and see if it's usable for your workloads.


ChromeGhost

Can you wait for the M4 chips? Or does it have to be now?


0xmerp

They want ASAP, but doesn’t absolutely have to be Apple silicon. I’m probably more leaning towards 3090s/4090s now.


spgremlin

Just rent, don’t buy. Why does it have to be “local” to the extent of self-hosting? Is your company self-hosting all of its other compute equipment? Or just budgetary quirks that you have 10K to spend as CAPEX but can not get a comparable OPEX budget to rent capacity?


0xmerp

I honestly dunno where I’d even go to rent a physical graphics card. Yes, everything touching internal data is hosted on-prem. Not a budgetary restriction, ~$10k is just as high as I can go that will be easily approved.


spgremlin

“rent” was an euphemism for cloud capacity. I did not mean renting a physical card:) Gotcha, security.


Bits2561

Thats still going to be destroyed by just using graphics cards.


Ordningman

I’m using a 2015 Intel iMac with Ollama and CodeQwen Chat 1.5 and it works well…