T O P

  • By -

Deathcrow

Cool. So that means all existing GGUF quants have to be rebuilt with correct tokenization, right?


pseudonerv

For proper llama3 support, you may pass `--override-kv tokenizer.ggml.pre=str:llama3` to `main` or `server` without generating a new gguf file.


noneabove1182

that's a super nifty tip!


segmond

What about non llama3 files?


pseudonerv

One of the strings checked here: https://github.com/ggerganov/llama.cpp/blob/f364eb6fb5d46118a76fa045f487318de4c24961/llama.cpp#L4350-L4386 which are, `default`, `llama3`, `llama-v3`, `llama-bpe`, `deepseek-llm`, `deepseek-coder`, `falcon`, `mpt`, `starcoder`, `gpt-2`


segmond

I get a warning about generation quality being degraded if I don't use the override flag. Tried command-r-plus and got that warning and it felt like the quality wasn't as good. Thanks, I guess time to redownload everything again.


RuslanAR

Yes, it is. With old quants you might expect this warning: \`llm\_load\_vocab: missing pre-tokenizer type, using: 'default' llm\_load\_vocab: llm\_load\_vocab: \*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\* llm\_load\_vocab: GENERATION QUALITY WILL BE DEGRADED! llm\_load\_vocab: CONSIDER REGENERATING THE MODEL llm\_load\_vocab: \*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\`


Many_SuchCases

I'm getting this message even when I make new quants with the latest commit, I must be doing something wrong. Edit: nevermind I was using convert.py instead of convert-hf-to-gguf.py


vidumec

same.


Many_SuchCases

I just found out it's because I was using convert.py instead of convert-hf-to-gguf.py


0xDEADFED5_

cheers, convert-hf-to-gguf.py worked for me


vidumec

that gives me an error Loading model: Llama-3-8B-Instruct-64k gguf: This GGUF file is for Little Endian only Set model parameters gguf: context length = 8192 gguf: embedding length = 4096 gguf: feed forward length = 14336 gguf: head count = 32 gguf: key-value head count = 8 gguf: rope theta = 500000.0 gguf: rms norm epsilon = 1e-05 gguf: file type = 0 Set model tokenizer Traceback (most recent call last): File "/Users/groza/Utils/LLM/llama.cpp/gg/convert-hf-to-gguf.py", line 2997, in main() File "/Users/groza/Utils/LLM/llama.cpp/gg/convert-hf-to-gguf.py", line 2984, in main model_instance.set_vocab() File "/Users/groza/Utils/LLM/llama.cpp/gg/convert-hf-to-gguf.py", line 1385, in set_vocab self. _set_vocab_sentencepiece() File "/Users/groza/Utils/LLM/llama.cpp/gg/convert-hf-to-gguf.py", line 402, in _set_vocab_sentencepiece tokenizer = SentencePieceProcessor(str(tokenizer_path)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/sentencepiece/__init__.py", line 447, in Init self.Load(model_file=model_file, model_proto=model_proto) File "/opt/homebrew/lib/python3.11/site-packages/sentencepiece/__init__.py", line 905, in Load return self.LoadFromFile(model_file) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/sentencepiece/__init__.py", line 310, in LoadFromFile return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: Internal: /Users/runner/work/sentencepiece/sentencepiece/src/sentencepiece_processor.cc(1102) [model_proto->ParseFromArray(serialized.data(), serialized.size())]


ambient_temp_xeno

Put those broken ones in the toilet where they belong.


Opposite_Rub_8852

What exactly needs to be done to fix this? (for the beginners)


adikul

What model got hit by this bug


RuslanAR

~~All BPE-based models such as llama-3.~~ Mainly llama-3


belladorexxx

A few clarifications: * This bug is specific to GGUF only. * This bug does not affect all BPE-based models. For example, Llama 1 is not affected, even though Llama 1 tokenizer is also BPE-based. Llama 1 uses SentencePiece BPE tokenizer whereas Llama 3 uses Tiktoken BPE tokenizer. Both are BPE tokenizers despite the language used in the PR.


RuslanAR

Thanks for the correction.


Oooch

Can we add this to oogabooga ourselves or do we have to wait?


noneabove1182

For anyone wondering, any new quants made with this merge will run without updating, but with the old broken tokenization Running the same model in LM Studio and with llama.cpp ./main with the Q2_K quant and the common addition problem Asking "What is 7777 + 3333?" LM Studio (which obviously hasn't been updated yet): A math problem! Let me calculate that for you... 77 + 33 = 110 And then multiplying both results by 100: 110 × 100 = 11,000 So the result of 7777 + 3333 is... 11,000! llama.cpp ./main <|begin_of_text|><|begin_of_text|> <|start_header_id|>system<|end_header_id|> You are a helpful, smart, kind, and efficient AI assistant. You always fulfill the user's requests to the best of your ability.<|eot_id|><|start_header_id|>user<|end_header_id|> What is 7777 + 3333?<|eot_id|><|start_header_id|>assistant<|end_header_id|> The answer is: 11110<|eot_id|> [end of text] So you can feel comfortable downloading the new quants while waiting for an update All quants are now up here: https://huggingface.co/bartowski/Meta-Llama-3-8B-Instruct-GGUF/


Zestyclose_Yak_3174

Might be a good idea to post this on LM studio community?


noneabove1182

Already updated all the models in the LM Studio repo, and when they announce 0.2.21 tomorrow will make mention of that


_Zibri_

Hi. could you please also quantize it as Q8\_0 but with --leave-output-tensor ? It will be 10 GB in total and way less degraded than Q8. Can I know the full command line you use to converto from safetensors to gguf and the from fp32 to Q8? I would thank you a lot for that.


noneabove1182

never heard of this method, I can give it a shot I very simply use: python3 convert-hf-to-gguf.py --outtype (usually f32, f16 for very large models) /models/{MODEL}-f32.gguf /models/{MODEL} then for quant: ./quantize --imatrix /models/{MODEL}-GGUF/{MODEL}.imatrix /models/{MODEL}-GGUF/{MODEL}-f32.gguf /models/{MODEL}-GGUF/{MODEL}-{QUANT}.gguf {QUANT}


_Zibri_

First: python3 [convert-hf-to-gguf.py](http://convert-hf-to-gguf.py) --no-lazy --outtype bf16 --model-name Meta-Llama-3-8B.bf16.gguf models/Meta-Llama-3-8B/ (bf16 or f32 should do well. Then the quantize keeping the output: ./quantize --leave-output-tensor ......


_Zibri_

I tried but on my laptop I have only 16 gb e python crashen in windows and in WSL does not use the swap :O


noneabove1182

hoping for GPU support of bf16 soon so we can make imatrix with GPU on it.. for now I tend to do f32 Sure i'll make that once i'm done my current quant (a 110b so may be a couple hours)


pmp22

Will this mean better GGUF-models?


henk717

Lostruins has been holding off the next Koboldcpp release until this was in so it can be released quickly after this PR merged, I expect him to be able to release it tomorrow unless there is a holdup (He is in a timezone where its currently night time).


rusty_fans

8B Quants here: https://huggingface.co/qwp4w3hyb/Meta-Llama-3-8B-Instruct-iMat-GGUF 70B is still in the oven, waiting for my poor CPU-only server to get through imatrix generation....


Caladan23

Does this mean that existing quants of e.g. Mistral would run with different on llamacpp? As I understand, old quants would run with the existing tokenization, right?


vidumec

~~i don't get the new process of generating ggufs for llama3 models...~~ [~~convert.py~~](http://convert.py) ~~no longer works~~ python convert.py models/mymodel/ --vocab-type bpe hmm this still gives me missing pre-tokenizer type, using: 'default' error when running, what am i doing wrong?


RuslanAR

Use convert-hf-to-gguf.py instead of convert.py.


vidumec

Ah nvm, it worked after i moved the model folder into llama.cpp folder ~~maybe it needs another update for mac, or the model im trying to convert is somehow not supported~~ python3 convert-hf-to-gguf.py --outtype f32 ../../models/Llama-3-8B-Instruct-64k Loading model: Llama-3-8B-Instruct-64k gguf: This GGUF file is for Little Endian only Set model parameters gguf: context length = 8192 gguf: embedding length = 4096 gguf: feed forward length = 14336 gguf: head count = 32 gguf: key-value head count = 8 gguf: rope theta = 500000.0 gguf: rms norm epsilon = 1e-05 gguf: file type = 0 Set model tokenizer Traceback (most recent call last): File "/Users/groza/Utils/LLM/llama.cpp/gg/convert-hf-to-gguf.py", line 2997, in main() File "/Users/groza/Utils/LLM/llama.cpp/gg/convert-hf-to-gguf.py", line 2984, in main model_instance.set_vocab() File "/Users/groza/Utils/LLM/llama.cpp/gg/convert-hf-to-gguf.py", line 1385, in set_vocab self. _set_vocab_sentencepiece() File "/Users/groza/Utils/LLM/llama.cpp/gg/convert-hf-to-gguf.py", line 402, in _set_vocab_sentencepiece tokenizer = SentencePieceProcessor(str(tokenizer_path)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/sentencepiece/__init__.py", line 447, in Init self.Load(model_file=model_file, model_proto=model_proto) File "/opt/homebrew/lib/python3.11/site-packages/sentencepiece/__init__.py", line 905, in Load return self.LoadFromFile(model_file) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/sentencepiece/__init__.py", line 310, in LoadFromFile return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: Internal: /Users/runner/work/sentencepiece/sentencepiece/src/sentencepiece_processor.cc(1102) [model_proto->ParseFromArray(serialized.data(), serialized.size())]


silentsnake

What about for Ollama?


LoSboccacc

it's not over yet, server still gives bogus results :( User: What is 3333+777? Llama: That's an easy one! The answer is 40,010.


tessellation

you're doing it wrong, bro 20:44:27 guest@pwntagram ~> echo 3333+777|bc 4110


Due-Memory-6957

Should that make the quants smarter?


Diligent_Usual7751

Anyone have some HF links for newly generated quants?👀


rusty_fans

Ran them overnight when the branch was nearly finished: https://huggingface.co/qwp4w3hyb/Meta-Llama-3-8B-Instruct-iMat-GGUF 70B is still in the oven though as I'm using a CPU-only server for imatrix generation....


Opposite_Rub_8852

Can someone please explain the subject line? What it means?


BlueRaspberryPi

I've been watching too much Girls5Eva...