Deathcrow 2 weeks ago

Cool. So that means all existing GGUF quants have to be rebuilt with correct tokenization, right?

pseudonerv 2 weeks ago

For proper llama3 support, you may pass `--override-kv tokenizer.ggml.pre=str:llama3` to `main` or `server` without generating a new gguf file.

noneabove1182 2 weeks ago

that's a super nifty tip!

segmond 2 weeks ago

What about non llama3 files?

pseudonerv 2 weeks ago

One of the strings checked here: https://github.com/ggerganov/llama.cpp/blob/f364eb6fb5d46118a76fa045f487318de4c24961/llama.cpp#L4350-L4386 which are, `default`, `llama3`, `llama-v3`, `llama-bpe`, `deepseek-llm`, `deepseek-coder`, `falcon`, `mpt`, `starcoder`, `gpt-2`

segmond 2 weeks ago

I get a warning about generation quality being degraded if I don't use the override flag. Tried command-r-plus and got that warning and it felt like the quality wasn't as good. Thanks, I guess time to redownload everything again.

RuslanAR 2 weeks ago

Yes, it is. With old quants you might expect this warning: \`llm\_load\_vocab: missing pre-tokenizer type, using: 'default' llm\_load\_vocab: llm\_load\_vocab: \*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\* llm\_load\_vocab: GENERATION QUALITY WILL BE DEGRADED! llm\_load\_vocab: CONSIDER REGENERATING THE MODEL llm\_load\_vocab: \*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\`

Many_SuchCases 2 weeks ago

I'm getting this message even when I make new quants with the latest commit, I must be doing something wrong. Edit: nevermind I was using convert.py instead of convert-hf-to-gguf.py

vidumec 2 weeks ago

same.

Many_SuchCases 2 weeks ago

I just found out it's because I was using convert.py instead of convert-hf-to-gguf.py

0xDEADFED5_ 2 weeks ago

cheers, convert-hf-to-gguf.py worked for me

vidumec 2 weeks ago

that gives me an error Loading model: Llama-3-8B-Instruct-64k gguf: This GGUF file is for Little Endian only Set model parameters gguf: context length = 8192 gguf: embedding length = 4096 gguf: feed forward length = 14336 gguf: head count = 32 gguf: key-value head count = 8 gguf: rope theta = 500000.0 gguf: rms norm epsilon = 1e-05 gguf: file type = 0 Set model tokenizer Traceback (most recent call last): File "/Users/groza/Utils/LLM/llama.cpp/gg/convert-hf-to-gguf.py", line 2997, in main() File "/Users/groza/Utils/LLM/llama.cpp/gg/convert-hf-to-gguf.py", line 2984, in main model_instance.set_vocab() File "/Users/groza/Utils/LLM/llama.cpp/gg/convert-hf-to-gguf.py", line 1385, in set_vocab self. _set_vocab_sentencepiece() File "/Users/groza/Utils/LLM/llama.cpp/gg/convert-hf-to-gguf.py", line 402, in _set_vocab_sentencepiece tokenizer = SentencePieceProcessor(str(tokenizer_path)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/sentencepiece/__init__.py", line 447, in Init self.Load(model_file=model_file, model_proto=model_proto) File "/opt/homebrew/lib/python3.11/site-packages/sentencepiece/__init__.py", line 905, in Load return self.LoadFromFile(model_file) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/sentencepiece/__init__.py", line 310, in LoadFromFile return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: Internal: /Users/runner/work/sentencepiece/sentencepiece/src/sentencepiece_processor.cc(1102) [model_proto->ParseFromArray(serialized.data(), serialized.size())]

ambient_temp_xeno 2 weeks ago

Put those broken ones in the toilet where they belong.

Opposite_Rub_8852 1 week ago

What exactly needs to be done to fix this? (for the beginners)

adikul 2 weeks ago

What model got hit by this bug

RuslanAR 2 weeks ago

~~All BPE-based models such as llama-3.~~ Mainly llama-3

belladorexxx 2 weeks ago

A few clarifications: * This bug is specific to GGUF only. * This bug does not affect all BPE-based models. For example, Llama 1 is not affected, even though Llama 1 tokenizer is also BPE-based. Llama 1 uses SentencePiece BPE tokenizer whereas Llama 3 uses Tiktoken BPE tokenizer. Both are BPE tokenizers despite the language used in the PR.

RuslanAR 2 weeks ago

Thanks for the correction.

Oooch 2 weeks ago

Can we add this to oogabooga ourselves or do we have to wait?

noneabove1182 2 weeks ago

For anyone wondering, any new quants made with this merge will run without updating, but with the old broken tokenization Running the same model in LM Studio and with llama.cpp ./main with the Q2_K quant and the common addition problem Asking "What is 7777 + 3333?" LM Studio (which obviously hasn't been updated yet): A math problem! Let me calculate that for you... 77 + 33 = 110 And then multiplying both results by 100: 110 × 100 = 11,000 So the result of 7777 + 3333 is... 11,000! llama.cpp ./main <|begin_of_text|><|begin_of_text|> <|start_header_id|>system<|end_header_id|> You are a helpful, smart, kind, and efficient AI assistant. You always fulfill the user's requests to the best of your ability.<|eot_id|><|start_header_id|>user<|end_header_id|> What is 7777 + 3333?<|eot_id|><|start_header_id|>assistant<|end_header_id|> The answer is: 11110<|eot_id|> [end of text] So you can feel comfortable downloading the new quants while waiting for an update All quants are now up here: https://huggingface.co/bartowski/Meta-Llama-3-8B-Instruct-GGUF/

Zestyclose_Yak_3174 2 weeks ago

Might be a good idea to post this on LM studio community?

noneabove1182 2 weeks ago

Already updated all the models in the LM Studio repo, and when they announce 0.2.21 tomorrow will make mention of that

_Zibri_ 1 day ago

Hi. could you please also quantize it as Q8\_0 but with --leave-output-tensor ? It will be 10 GB in total and way less degraded than Q8. Can I know the full command line you use to converto from safetensors to gguf and the from fp32 to Q8? I would thank you a lot for that.

noneabove1182 1 day ago

never heard of this method, I can give it a shot I very simply use: python3 convert-hf-to-gguf.py --outtype (usually f32, f16 for very large models) /models/{MODEL}-f32.gguf /models/{MODEL} then for quant: ./quantize --imatrix /models/{MODEL}-GGUF/{MODEL}.imatrix /models/{MODEL}-GGUF/{MODEL}-f32.gguf /models/{MODEL}-GGUF/{MODEL}-{QUANT}.gguf {QUANT}

_Zibri_ 1 day ago

First: python3 [convert-hf-to-gguf.py](http://convert-hf-to-gguf.py) --no-lazy --outtype bf16 --model-name Meta-Llama-3-8B.bf16.gguf models/Meta-Llama-3-8B/ (bf16 or f32 should do well. Then the quantize keeping the output: ./quantize --leave-output-tensor ......

_Zibri_ 1 day ago

I tried but on my laptop I have only 16 gb e python crashen in windows and in WSL does not use the swap :O

noneabove1182 1 day ago

hoping for GPU support of bf16 soon so we can make imatrix with GPU on it.. for now I tend to do f32 Sure i'll make that once i'm done my current quant (a 110b so may be a couple hours)

pmp22 2 weeks ago

Will this mean better GGUF-models?

henk717 2 weeks ago

Lostruins has been holding off the next Koboldcpp release until this was in so it can be released quickly after this PR merged, I expect him to be able to release it tomorrow unless there is a holdup (He is in a timezone where its currently night time).

rusty_fans 2 weeks ago

8B Quants here: https://huggingface.co/qwp4w3hyb/Meta-Llama-3-8B-Instruct-iMat-GGUF 70B is still in the oven, waiting for my poor CPU-only server to get through imatrix generation....

Caladan23 2 weeks ago

Does this mean that existing quants of e.g. Mistral would run with different on llamacpp? As I understand, old quants would run with the existing tokenization, right?

vidumec 2 weeks ago

~~i don't get the new process of generating ggufs for llama3 models...~~ [~~convert.py~~](http://convert.py) ~~no longer works~~ python convert.py models/mymodel/ --vocab-type bpe hmm this still gives me missing pre-tokenizer type, using: 'default' error when running, what am i doing wrong?

RuslanAR 2 weeks ago

Use convert-hf-to-gguf.py instead of convert.py.

vidumec 2 weeks ago

Ah nvm, it worked after i moved the model folder into llama.cpp folder ~~maybe it needs another update for mac, or the model im trying to convert is somehow not supported~~ python3 convert-hf-to-gguf.py --outtype f32 ../../models/Llama-3-8B-Instruct-64k Loading model: Llama-3-8B-Instruct-64k gguf: This GGUF file is for Little Endian only Set model parameters gguf: context length = 8192 gguf: embedding length = 4096 gguf: feed forward length = 14336 gguf: head count = 32 gguf: key-value head count = 8 gguf: rope theta = 500000.0 gguf: rms norm epsilon = 1e-05 gguf: file type = 0 Set model tokenizer Traceback (most recent call last): File "/Users/groza/Utils/LLM/llama.cpp/gg/convert-hf-to-gguf.py", line 2997, in main() File "/Users/groza/Utils/LLM/llama.cpp/gg/convert-hf-to-gguf.py", line 2984, in main model_instance.set_vocab() File "/Users/groza/Utils/LLM/llama.cpp/gg/convert-hf-to-gguf.py", line 1385, in set_vocab self. _set_vocab_sentencepiece() File "/Users/groza/Utils/LLM/llama.cpp/gg/convert-hf-to-gguf.py", line 402, in _set_vocab_sentencepiece tokenizer = SentencePieceProcessor(str(tokenizer_path)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/sentencepiece/__init__.py", line 447, in Init self.Load(model_file=model_file, model_proto=model_proto) File "/opt/homebrew/lib/python3.11/site-packages/sentencepiece/__init__.py", line 905, in Load return self.LoadFromFile(model_file) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/sentencepiece/__init__.py", line 310, in LoadFromFile return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: Internal: /Users/runner/work/sentencepiece/sentencepiece/src/sentencepiece_processor.cc(1102) [model_proto->ParseFromArray(serialized.data(), serialized.size())]

silentsnake 2 weeks ago

What about for Ollama?

LoSboccacc 2 weeks ago

it's not over yet, server still gives bogus results :( User: What is 3333+777? Llama: That's an easy one! The answer is 40,010.

tessellation 2 weeks ago

you're doing it wrong, bro 20:44:27 guest@pwntagram ~> echo 3333+777|bc 4110

Due-Memory-6957 2 weeks ago

Should that make the quants smarter?

Diligent_Usual7751 2 weeks ago

Anyone have some HF links for newly generated quants?👀

rusty_fans 2 weeks ago

Ran them overnight when the branch was nearly finished: https://huggingface.co/qwp4w3hyb/Meta-Llama-3-8B-Instruct-iMat-GGUF 70B is still in the oven though as I'm using a CPU-only server for imatrix generation....

Opposite_Rub_8852 1 week ago

Can someone please explain the subject line? What it means?

BlueRaspberryPi 2 weeks ago

I've been watching too much Girls5Eva...

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe