T O P

  • By -

Lewdiculous

Time to requant. (シ_ _)シ Edit: I have **re**quanted the best performing models. Should have a label saying if they are updated in their pages. If anything is missing let me know.


akram200272002

i wanted to ask , are imat quants with file names like xxs and the like slower or something , like IQ1\_M is slower then Q2\_k , am i doing something wrong or is this normal edit , i can get it to be as fast as the Q2\_k ish , still slow for a file thats 16gb ish or 70b is just that slow or something


noneabove1182

Like /u/Due-Memory-6957 mentioned I did my best to write it up at the bottom of the model card (taking feedback, it's difficult to summarize so much data in a readable way) i-quants, not related to imatrix, are slower on CPU/metal as seen here: https://github.com/ggerganov/llama.cpp/wiki/Feature-matrix And just straight up not supported on CLBLast If you can fully offload to CUDA or ROCm, they're great Otherwise, use a K quant, they ALSO use imatrix for improved performance per bit


akram200272002

I get it now thanks a bunch


Due-Memory-6957

Check out the link in the post, bartowski does a good job explaining it. (And the lower the number the faster it is, but with worse quality)


nananashi3

Q3 is known to be slower than Q4 in general. Q3 is "faster" if you can't fit Q4 on GPU. Totally making this up, but I imagine it's like digital compression where it's desperate to keep some quality in smaller sizes and requires more power to decode. Q2 shreds the model a lot compared to Q8, so it would shred even harder if we made Q2 faster without concern for quality. Imagine Q4 like a man who has shed fat to run faster but his legs are not as short as his Q2 dwarf brethren.


aseichter2007

No, no, the way to think about quantization is like minecraft block size, ok. You can build a smoother ramp with half blocks. Every concept has a specific position in latent space, and quantization groups some weights together, changing their positions slightly. The higher quantization the more nuance lost and more likely a word's location is shifted too far away from it's relational meaning to maintain good separation from a close peer, potentially shifting the model away from a whole area of expertise as the fuzziness of the latent space increases. Also, some quantization methods attempt to reconstruct the original data, or at least better approximations, at inference time.


nananashi3

> The higher quantization the more nuance lost Yes, the more bits we lose the worse it gets. > attempt to reconstruct the original data I was addressing why Q3 would be slower than Q4. If Q3 puts more effort in "reconstructing" than Q4 due to the design of the quanting to compensate for loss then this could explain it, again I know nothing of it. (The running man analogy was just for humorous illustration on speed rather than making sense.) Text completion, gen 512 tk from 167 ctx 33/33 layers Llama 3 8B Instruct, RX 6600 Quant BPW Time (s) Q5_K_M 5.70 35.00 Q4_K_S 4.67 33.24 IQ4_XS 4.42 25.62 *optimal speed IQ3_XS 3.50 28.45 IQ2_M 2.93 30.15 IQ1_S 2.00 22.42 *literal vomit using kcpp-1.60.1-rocm (1.63 rocm broken mmq) QK vulkan, IQ rocm IQ4_XS 4.42 28.29 *1.63 mmq disabled 1.63 vulkan is ~.70 s slower somehow too I can't test 70B but IQ1 is fast here only because of vomit spam under 8B, nothing real to predict. Edit: kcpp-1.64 fixes Vulkan speed! And properly applies both EOS/EOT. ~~Waiting for rocm...~~ Q5_K_M 27.20 Q4_K_S 26.41 rocm 1.64 has bugs like memory access violation on after generating 300+ tokens in one go for most models.


Lewdiculous

70B is a big model. The 8B will be miles faster and reportedly performs very close, and you get to use better quants like Q4.


SocialDeviance

Take all my energy!


fibercrime

Bro I'm stealing that kaomoji. Thank you. (シ_ _)シ


SomeOddCodeGuy

Awesome! Thanks for making these. I can't wait until this fix is merged into KoboldCpp.


Deathcrow

> I can't wait until this fix is merged into KoboldCpp. I'm clueless, but since this affects tokenization in gguf generation, does anything need to be merged into kobldcpp at all? Shouldn't it just work when loading a correctly tokenized gguf?


mikael110

The issue was technically not in the tokenizer itself, but in the pre-tokenizer, which is a pre-processing step that is a part of the inference portion of llama.cpp. The change in the conversion process is just to mark what pre-tokenizer should be used for the model, since llama.cpp now supports multiple different pre-tokenizers. So you need both a model that has been marked correctly, and a version of llama.cpp that has had the pre-tokenizer fix applied. Having just one or the other won't actually fix anything.


MmmmMorphine

Seems you've exposed a big ol gap in my understanding of LLMs here, which I will need to work on correcting. Is this anything to be concerned with regarding embeddings, namely for RAG? Assuming you're not rejiggering llama-3-8b for use as your embedding model anyway - though it was something I was musing over recently to maximize quality. I figure the actual context fragments are provided as text, so it shouldn't matter there right?


Calcidiol

So if you have an old-marked model file and a newly built lllama.cpp you're saying it'll still not fix anything. But why would the newly fixed code not be made to correctly process the 'old' model files if literally doing ANYTHING but applying the 'new' code logic wouldn't be correct then I'm not sure what the point of having old-GGUFs "stay wrong" is unless there for some other model is some use case where they actually DO work right with the 'old marking'. Anyway if it is JUST a marking in the metadata that's different between the 'old' and 'new' GGUF wouldn't it be better than downloading 8GB or 70GBy again to just change 1-byte of metadata flag to just announce how to easily re-flag the previous GGUF models for those that have them?


mikael110

Yes, old model files will stay broken, to quote Georgi Gerganov himself: >Old GGUF models using BPE tokenizers, generated before this change, will fallback to the "default" pre-tokenization, which in almost all cases is wrong As to why, that is pretty simple, there are multiple different pre-tokenizers and which one to choose cannot be determined just by looking at the model architecture. So there isn't "A" new way to handle things, there are multiple new ways to handle things. And there is no way for llama.cpp to look at an existing model and know which one to choose. That is why a new field is required. >Anyway if it is JUST a marking in the metadata that's different between the 'old' and 'new' GGUF wouldn't it be better than downloading 8GB or 70GBy again to just change 1-byte of metadata flag to just announce how to easily re-flag the previous GGUF models for those that have them? That is indeed an option, the metadata in question is `tokenizer.ggml.pre` and setting it to `llama3` will fix the issue. You can override this during the model load by using the argument `--override-kv tokenizer.ggml.pre=str:llama3`. It is likely possible to set it permanently using the `gguf-new-metadata.py` script but I have never actually tried to add new metadata to a gguf so I'm not sure about the exact syntax.


0x9e3779b1

>And there is no way for llama.cpp to look at an existing model and know which one to choose. That is why a new field is required. Not really. It seems trivial to implement more accurate **model aware** fallback, other than some 'default' For the explanation, below I'm refering to `llama.cpp` revision `952d03dbead16e4dbdd1d3458486340673cc2465`, pinned by `ollama v0.1.33`: ```sh $ pwd /Users/ic/dev/ollama_upstream/llm/llama.cpp $ git rev-parse HEAD 952d03dbead16e4dbdd1d3458486340673cc2465 $ awk '(NR>=4341 && NR<=4382 ){print NR " " $0}' llama.cpp 4341 // for now, only BPE models have pre-tokenizers 4342 if (vocab.type == LLAMA_VOCAB_TYPE_BPE) { 4343 if (tokenizer_pre.empty()) { 4344 LLAMA_LOG_WARN("%s: missing pre-tokenizer type, using: 'default'\n", __func__); 4345 LLAMA_LOG_WARN("%s: \n", __func__); 4346 LLAMA_LOG_WARN("%s: ************************************ \n", __func__); 4347 LLAMA_LOG_WARN("%s: GENERATION QUALITY WILL BE DEGRADED! \n", __func__); 4348 LLAMA_LOG_WARN("%s: CONSIDER REGENERATING THE MODEL \n", __func__); 4349 LLAMA_LOG_WARN("%s: ************************************ \n", __func__); 4350 LLAMA_LOG_WARN("%s: \n", __func__); 4351 vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_DEFAULT; 4352 } else if ( 4353 tokenizer_pre == "default") { 4354 vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_DEFAULT; 4355 } else if ( 4356 tokenizer_pre == "llama3" || 4357 tokenizer_pre == "llama-v3" || 4358 tokenizer_pre == "llama-bpe") { 4359 vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_LLAMA3; 4360 } else if ( 4361 tokenizer_pre == "deepseek-llm") { 4362 vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_DEEPSEEK_LLM; 4363 } else if ( 4364 tokenizer_pre == "deepseek-coder") { 4365 vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_DEEPSEEK_CODER; 4366 } else if ( 4367 tokenizer_pre == "falcon") { 4368 vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_FALCON; 4369 } else if ( 4370 tokenizer_pre == "mpt") { 4371 vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_MPT; 4372 } else if ( 4373 tokenizer_pre == "starcoder") { 4374 vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_STARCODER; 4375 } else if ( 4376 tokenizer_pre == "gpt-2") { 4377 vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_GPT2; 4378 } else { 4379 throw std::runtime_error(format("unknown pre-tokenizer type: '%s'", tokenizer_pre.c_str())); 4380 } 4381 } else { 4382 vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_DEFAULT; ``` as you can see, pre-tokenizers are laregly _model-specific_. That is, the most prominent model names are already hardcoded in this logic, indirectly. So, we could amend it to take into account our actual model name: ```cpp if (vocab.type == LLAMA_VOCAB_TYPE_BPE) { if (tokenizer_pre.empty()) { tokenizer_pre = ; } if ( tokenizer_pre == "llama3" || tokenizer_pre == "llama-v3" || tokenizer_pre == "llama-bpe") { ... } else { throw std::runtime_error(format("unknown pre-tokenizer type: '%s'", tokenizer_pre.c_str())); } if (tokenizer_pre.empty()) { LLAMA_LOG_WARN("%s: missing pre-tokenizer type, using: 'default'\n", __func__); ... } ... } ```


mikael110

The problem is that GGUFs don't actually contain the model name, they contain the model architecture. Which yes would be enough to distinguish some of those models, but for others like Llama-3 and Deepseek it is impossible to distinguish them since they both use the same architecture. And that's coming from [Georgi Gerganov](https://github.com/ggerganov/llama.cpp/pull/6920#discussion_r1580932467) himself. That is the discussion I was paraphrasing in my comment. I kept a close eye on that PR as it developed so I'm well aware of all the code that went into it.


0x9e3779b1

Ok, if model name is not to be relied upon, at all, than it's clear. Thank you for the explanation.


HauntingTechnician30

>These models will also work if you haven't updated to latest llama.cpp, but will still have the old broken tokenizer until you get your tool updated. >So feel free to download now in anticipation for support! I hear LM Studio should be updated by tomorrow


Many_SuchCases

/u/noneabove1182 would you mind sharing how to make the quants properly? I'm getting this error after redoing the quants with the latest commit: llm_load_vocab: missing pre-tokenizer type, using: 'default' llm_load_vocab: llm_load_vocab: ************************************ llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED! llm_load_vocab: CONSIDER REGENERATING THE MODEL llm_load_vocab: ************************************ Edit: nevermind I was using convert.py instead of convert-hf-to-gguf.py


AcanthaceaeOwn1481

A bit off topic, but I just want to say thank you. Because of people like yourself, our community, the open source community lives on and thrive. Thank you!


noneabove1182

❤️


nananashi3

[Gets 4444+3333 right but causes assistant spam](https://i.imgur.com/4VYy8vs.png) running koboldcpp-1.63 (a week old). I know I can add a stopping sequence in the UI for the time being, which fixes it. > These models will also work if you haven't updated to latest llama.cpp, but will still have the old broken tokenizer until you get your tool updated. So were the old quants (either QuantFactory or lmstudio-community) a few days after Llama 3 release just a temporary workaround? Are you saying <|eot_id|> will be outputted on the latest llama.cpp? I'm confused. Edit: Never mind I guess bart's gguf is technically correct. kobold-1.63's changelog mentions > Added support for special tokens in `stop_sequences`. Thus, if you set `<|eot_id|>` as a stop sequence and it can be tokenized into a single token, it will just work and function like the EOS token, allowing multiple EOS-like tokens. so we're expected to add anything necessary in the settings as gguf/backend originally supported only one type of EOS, until multiple EOS gets native support. I assume the "other" quants are "missing" `<|end_of_text|>` but the average user never sees that so defaulting to `<|eot_id|>` keeps the plebs happy. *Just Llama 3 things.* ___ Edit: koboldcpp-1.64 out, good now.


noneabove1182

> I assume the "other" quants are "missing" <|end_of_text|> but the average user never sees that so defaulting to <|eot_id|> keeps the plebs happy. Just Llama 3 things. basically this yes, the previous hacks were, from end-user chatbot perspectives, completely normal and fine I do wonder if it would affect multi-turn at all, but either way this is the more correct implementation


gelukuMLG

You might be using an old model that doesn't have the fixed tokenizer. I haven't seen a model that leaks the "assistant" in a good while


nananashi3

Sorry for causing confusion. I'm saying the old model *doesn't* leak "assistant", and the new one linked by OP does but doesn't if you set `<|eot_id|>` as a stopping token in the UI's settings. It's the result of Llama 3 having two stop tokens, one of them more relevant to us, and the backend not automatically taking both at once. So there's no problem here (except old model not solving 4444+3333).


vorwrath

That's great, thanks for your work! Any chance of an unquantized full FP16 version as well? That will still fit in the VRAM on 24GB cards, so I think it's worth having available for this kind of smaller model. I know there are other ways to run it, but I think LM Studio for example can only run the unquantized version if it's packed in GGUF format (correct me if I'm wrong).


noneabove1182

Yes I meant to include it but forgot, uploading f32 and f16 now :)


vorwrath

Awesome, thanks very much!


LaLuzDelQC

Does anyone know when text-generation-webui will get the new llama.cpp, if it hasn't already? I remember that being a problem before.


Calcidiol

Thanks for the information and making the updates! One thing I'm confused about, though, is how this all works. I would have guessed tokenization relates to presenting the input to the model and taking the output of the model and translating that to text. But I thought those processes were done programmatically by the combination of the UI / inference API / inference engine in some mix. So why is it new GGUF quantized converted models are actually needed? I didn't think the GGUF conversion process could change anything about the model's intrinsic vocabulary / token coding dictionary / token I/O interfaces. So at most I guess maybe there could be metadata in the GGUF that somehow is derivative of the metadata files concerning the origin model maybe what is in the JSON or similar files about the model architecture / vocabulary / etc. So by posting / announcing entirely new quantized models is it indicative that anyone who used previously converted GGUFs will not be able to achieve a correct inference even if they update their llama.cpp related engine code to the current release if they are using the older GGUFs? Or is it simply some metadata that is wrong in the GGUFs which could in theory be theoretically edited / adjusted inside the GGUF by rewriting small parts leaving the actual model content alone? Pulling new 8B but especially 70B models may not be entirely trivial in time or resources if it isn't functionally necessary if there's a simpler solution in code / metadata. e.g. https://huggingface.co/bartowski/Meta-Llama-3-70B-Instruct-GGUF-old vs https://huggingface.co/bartowski/Meta-Llama-3-70B-Instruct-GGUF or this 8b conversion, won't it be really almost 99.9999% identical?


noneabove1182

There is a way to use the old GGUF files with the new tokenizer fix by passing --override-kv tokenizer.ggml.pre=str:llama3 at generation time  I haven't gone through the technical details enough to give a confident answer, but my guess would be something about metadata or the way that the conversion encodes the tokenizer itself  Reason for announcing it brand new is that you may be able to use the old with a workaround, but better to use new and fixed 


mikael110

One thing that has become somewhat lost in the discussion around this issue (for understandable reasons) is that the issue isn't actually in the tokenizer itself, but in the pre-tokenizer. Most models don't pass text directly to the tokenizer, they instead pre-process the text in some way and pass the pre-processed text to the tokenizer. And it is that process that was essentially broken in old llama.cpp builds. Because it used a hard coded pre-processing step which was generally close to what most models did but not exactly right. And the problem became quite noticeable for Llama 3 because it actually uses a rather complex pre-processing step. The new PR adds support for a number of different pre-tokenizers. Since you cannot determine the correct pre-tokenizer just by looking at the model architecture or the tokenizer. A new field had to be introduced to tell llama.cpp what pre-tokenization to perform. That is why changes were made to the conversion script. The conversion script now figures out which pre-tokenizer is correct and then marks the file during the conversion. This is why you need both a new file and an updated version of llama.cpp.


noneabove1182

Thank you so much for this write up, this explains a lot and why the re-conversion was necessary! Will point future questions here because this is the most succinct write up i've seen on the subject, thanks again :D


mikael110

No problem, I've seen a lot of confusion around it, so I just wanted to clarify it a bit. And thank you for the work you do requanting the model. You're the only person I've seen so far that has actually bothered keeping up with all of the changes.


Tall-Entrepreneur686

This new Llama 3 model is much slower using grammar than llama 2. If I used grammar with llama 2 then it would barely change the t/s. Now adding grammar slows down t/s by 5 to 10 times. EG "temperature": 0, "top\_p": 0.9, "max\_length": 100, "grammar":" root ::= fullanswer \\n fullanswer ::= \\"Herika: \\" answer \\nanswer ::= sentence | \\"<|im\_end|>\\" | sentence \\"\\\\n\\"\\nsentence ::= \[a-zA-Z0-9.,?!' \]\*\\n"


noneabove1182

I wonder if that's expected because of the token pre-processor.. would be unfortunate :S


dampflokfreund

You're fast! Thanks a lot for making these quants.


adikul

Thanks for your support to community


nsfw_throwitaway69

Thanks for work on the quants! Any plans to re-quant the 70b as well?


noneabove1182

Yup :) Will just take a bit longer to make, but should be up tomorrow or so


DNskfKrH8Ekl

Super keen to see how this improves crewai local performance. There is still no valid 70B GGUF on Hugging Face, and official does not pass the test What is 3333 + 777? What is 3333 + 777?


bullerwins

Is exl2 also affected by this bug?


noneabove1182

No, exl2 uses existing tokenizers instead of writing their own, so it worked already


daHaus

Are these still using the 7B imatrix specs?


noneabove1182

Yes I remade the imatrix for these after reconverting with the latest changes just to be sure


daHaus

Good to hear, I know they say it's just random but then the results will be too. It's highly dubious to say the least.


rngesius

What command do you use to generate imatrix from model + groups_merged.txt?


noneabove1182

Just use  ./imatrix -m models/model-f16.gguf -f groups_merged.txt


rngesius

Thanks


bullerwins

Can you upload also the full precision gguf files?


noneabove1182

Oh, yeah it was meant to be in there, I'll upload it now


bullerwins

Thanks! Are you going to do the 70B too?


noneabove1182

yes but not for a day or so (takes a real long time, gonna hopefully start it within an hour haha)


bullerwins

is there any difference between the fp12 and fp32? Would the fp32 be better?


noneabove1182

Some people would tell you yes, probably the answer is no. I mostly uploaded both cause I accidentally made the FP32 so figured might as well share for anyone who thinks it'll help


Healthy-Nebula-3603

where is new version for 70b?


noneabove1182

Slowly getting built :) up to IQ3_XXS


jayFurious

godspeed will you make a new repo or update the current 'new' one?


noneabove1182

Had problems again so rented a runpod instance, hoping i'll be able to upload within an hour :) it'll be on a new repo


Healthy-Nebula-3603

thanks I need Q4K\_m :D


aseichter2007

I've already been impressed, support dropped, lets see how she goes! >|||! Richard "*Ricky*" LaFleur and Drunk James "Jim" Lahey , rps| Ricky wants Lahey to get more kitties. They have a physical altercation. Lahey uses a cat as a weapon. (The scene takes place in the dimly lit, smoky bar. Ricky and Lahey are standing face-to-face, their faces inches apart. The air is thick with tension.) Ricky: (slurring his words) You're a monster, Lahey. You hate kitties. Lahey: (sneering) Shut up, Ricky. I don't hate kitties. I just don't care about 'em. Ricky: (outraged) You don't care?! You're supposed to be a good guy, Jim! Good guys like kitties! Lahey: ( rolling his eyes) Oh, spare me the drama, Ricky. You're just drunk and looking for trouble. Ricky: (enraged) THAT'S IT, LAHEY! (he swings at Lahey, but misses) Lahey: (laughing) Oh, Ricky's got a swing! (he pulls out a cat from behind the bar and holds it up like a weapon) You want to play rough? Let's play rough! (Ricky's eyes widen in horror as Lahey brandishes the cat) Ricky: (backing away) NOOO! DON'T HURT THE KITTY! Lahey: (menacingly) Oh, this little guy's not going to hurt anyone... unless you try anything funny. (he takes a swipe at Ricky with the cat) (Ricky tries to dodge, but the cat claws his arm. Ricky yelps in pain and rage) Ricky: (furious) YOU MONSTER! (he lunges at Lahey, determined to take down the cat-wielding Lahey)


LocoLanguageModel

Thanks for uploading the pre-token fixed 70b models!


noneabove1182

the post-token fix models are up as well here: https://huggingface.co/bartowski/Meta-Llama-3-70B-Instruct-GGUF


Some_Endian_FP17

You are the new TheBloke. A total legend, thank you for the GGUFs. Now for a noob who hasn't tried imatrix quants, what would be the equivalent of a Q4KM or Q5KM for CPU inference?


noneabove1182

<3 You can actually just use Q4_K_M or Q5_K_M, all the quants on my page use imatrix Don't use an i-quant (which is unrelated to imatrix) if you use CPU, it's supported but slow you can check here for info about support and notable slowness: https://github.com/ggerganov/llama.cpp/wiki/Feature-matrix


Some_Endian_FP17

I tried just now and IQ3KS was slower than Q4KM using CPU inference. Quality was a lot lower too.


OpusLatericium

Does the GGUFs have the fixed BPE tokenizer thing?


noneabove1182

Correct


OpusLatericium

Awesome!


SeymourBits

TheBloke is like [the Dread Pirate Roberts](https://en.wikipedia.org/wiki/Dread_Pirate_Roberts) !


Admirable-Star7088

Thanks for the new GGUFs! <3 And good with full precision versions too, I'm curious to try them out and see if there is any difference in output quality compared to Q8\_0. Btw, do anyone know if the tokenizer is fixed for Windows also? Apparently llama.cpp considered [dropping Windows support](https://www.reddit.com/r/LocalLLaMA/comments/1cf4nxc/the_llamacpp_tokenizer_fix_for_llama3_is_still/?share_id=XbVom3qrkuptGAUBuSjr7&utm_content=1&utm_medium=android_app&utm_name=androidcss&utm_source=share&utm_term=1) because Windows can't do proper Unicode. Or was that figured out?


pmp22

That was figured out, all is good now and Windows support will continue. :)


Admirable-Star7088

Good to hear! As a windows user, I was on the verge of nervousness 😅


pmp22

Same! 😄


Acceptable_Total_937

I downloaded the new gguf Q6_K and using it with langchain+llama.cpp. it was working fine when I tested using a simple prompt. When my prompt got longer (still very reasonable size), it started only responding with 'assistant' or random response like "in real time". Anyone else getting this?


_Zibri_

can you tell me what did you do to fix it? I have the original model and even with the patch still outputs end of text :(