-
Notifications
You must be signed in to change notification settings - Fork 13.4k
llama : fix session saving/loading #3400
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
(termux) confirmed |
This fixes the crash for me, but it does not seem to use or update the cache file properly when the prompt changes. It is as if the previous prompt is still there, influencing the generation. For example, if I generate a kanji mnemonic by running main: build = 1294 (b0670db) main: built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for x86_64-linux-gnu main: seed = 1696154328 llama_model_loader: loaded meta data with 19 key-value pairs and 723 tensors from llama-2-70b.Q5_K_M.gguf (version GGUF V2 (latest)) […] llama_model_loader: - type f32: 161 tensors llama_model_loader: - type q5_K: 481 tensors llama_model_loader: - type q6_K: 81 tensors llm_load_print_meta: format = GGUF V2 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 4096 llm_load_print_meta: n_embd = 8192 llm_load_print_meta: n_head = 64 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 80 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_gqa = 8 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: n_ff = 28672 llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: model type = 70B llm_load_print_meta: model ftype = mostly Q5_K - Medium llm_load_print_meta: model params = 68.98 B llm_load_print_meta: model size = 45.40 GiB (5.65 BPW) llm_load_print_meta: general.name = LLaMA v2 llm_load_print_meta: BOS token = 1 '<s>' llm_load_print_meta: EOS token = 2 '</s>' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.23 MB llm_load_tensors: mem required = 46494.72 MB .................................................................................................... llama_new_context_with_model: n_ctx = 4096 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: kv self size = 1280.00 MB llama_new_context_with_model: compute buffer total size = 573.88 MB system_info: n_threads = 8 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | main: attempting to load saved session from 'mnemonics.bin' main: session file does not exist, will create sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000 generate: n_ctx = 4096, n_batch = 512, n_predict = -2, n_keep = 0 For each kanji character, write a Markdown‐formatted mnemonic that uses its keyword and the keyword of all its components. […] Kanji: 謝 (apologize) Components: 言 (say), 射 (shoot) Mnemonic: **Shot** first, ***apologize*** (**say** you are sorry) later. Kanji: 提 (propose) Components: 扌 (left hand), 是 (go with) Mnemonic: When you **propose** to someone, put a ring on the **left hand** and say “I ***go with*** you.” It’s how it works in some countries. Kanji: llama_print_timings: load time = 3490.10 ms llama_print_timings: sample time = 31.98 ms / 44 runs ( 0.73 ms per token, 1375.77 tokens per second) llama_print_timings: prompt eval time = 678353.51 ms / 1782 tokens ( 380.67 ms per token, 2.63 tokens per second) llama_print_timings: eval time = 58432.71 ms / 43 runs ( 1358.90 ms per token, 0.74 tokens per second) llama_print_timings: total time = 737161.46 ms Generating another one for the same kanji (same prompt) works fine: main: build = 1294 (b0670db) main: built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for x86_64-linux-gnu main: seed = 1696150253 llama_model_loader: loaded meta data with 19 key-value pairs and 723 tensors from llama-2-70b.Q5_K_M.gguf (version GGUF V2 (latest)) […] llama_model_loader: - type f32: 161 tensors llama_model_loader: - type q5_K: 481 tensors llama_model_loader: - type q6_K: 81 tensors llm_load_print_meta: format = GGUF V2 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 4096 llm_load_print_meta: n_embd = 8192 llm_load_print_meta: n_head = 64 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 80 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_gqa = 8 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: n_ff = 28672 llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: model type = 70B llm_load_print_meta: model ftype = mostly Q5_K - Medium llm_load_print_meta: model params = 68.98 B llm_load_print_meta: model size = 45.40 GiB (5.65 BPW) llm_load_print_meta: general.name = LLaMA v2 llm_load_print_meta: BOS token = 1 '<s>' llm_load_print_meta: EOS token = 2 '</s>' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.23 MB llm_load_tensors: mem required = 46494.72 MB .................................................................................................... llama_new_context_with_model: n_ctx = 4096 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: kv self size = 1280.00 MB llama_new_context_with_model: compute buffer total size = 573.88 MB system_info: n_threads = 8 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | main: attempting to load saved session from 'mnemonics.bin' main: loaded a session with prompt size of 1782 tokens main: session file has exact match for prompt! sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000 generate: n_ctx = 4096, n_batch = 512, n_predict = -2, n_keep = 0 For each kanji character, write a Markdown‐formatted mnemonic that uses its keyword and the keyword of all its components. […] Kanji: 謝 (apologize) Components: 言 (say), 射 (shoot) Mnemonic: **Shot** first, ***apologize*** (**say** you are sorry) later. Kanji: 提 (propose) Components: 扌 (left hand), 是 (go with) Mnemonic: When someone ***proposes*** something to you, you will either **go with it** or not. It’s like your left hand is saying: “this way!” and your right hand saying: “that way!”. You need to pick one. Kanji: llama_print_timings: load time = 3586.75 ms llama_print_timings: sample time = 40.55 ms / 58 runs ( 0.70 ms per token, 1430.44 tokens per second) llama_print_timings: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second) llama_print_timings: eval time = 72142.00 ms / 57 runs ( 1265.65 ms per token, 0.79 tokens per second) llama_print_timings: total time = 83733.50 ms But if I change the last paragraph of the prompt (the kanji for which I want a mnemonic), this happens: main: build = 1294 (b0670db) main: built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for x86_64-linux-gnu main: seed = 1696152733 llama_model_loader: loaded meta data with 19 key-value pairs and 723 tensors from llama-2-70b.Q5_K_M.gguf (version GGUF V2 (latest)) […] llama_model_loader: - type f32: 161 tensors llama_model_loader: - type q5_K: 481 tensors llama_model_loader: - type q6_K: 81 tensors llm_load_print_meta: format = GGUF V2 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 4096 llm_load_print_meta: n_embd = 8192 llm_load_print_meta: n_head = 64 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 80 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_gqa = 8 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: n_ff = 28672 llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: model type = 70B llm_load_print_meta: model ftype = mostly Q5_K - Medium llm_load_print_meta: model params = 68.98 B llm_load_print_meta: model size = 45.40 GiB (5.65 BPW) llm_load_print_meta: general.name = LLaMA v2 llm_load_print_meta: BOS token = 1 '<s>' llm_load_print_meta: EOS token = 2 '</s>' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.23 MB llm_load_tensors: mem required = 46494.72 MB .................................................................................................... llama_new_context_with_model: n_ctx = 4096 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: kv self size = 1280.00 MB llama_new_context_with_model: compute buffer total size = 573.88 MB system_info: n_threads = 8 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | main: attempting to load saved session from 'mnemonics.bin' main: loaded a session with prompt size of 1782 tokens main: session file matches 1755 / 1786 tokens of prompt sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000 generate: n_ctx = 4096, n_batch = 512, n_predict = -2, n_keep = 0 For each kanji character, write a Markdown‐formatted mnemonic that uses its keyword and the keyword of all its components. […] Kanji: 謝 (apologize) Components: 言 (say), 射 (shoot) Mnemonic: **Shot** first, ***apologize*** (**say** you are sorry) later. Kanji: 配 (hand out) Components: 酉 (sign of the bird), 己 (oneself) Mnemonic: When one needs to **propose something**, he should make sure that this proposal is really his (**oneself**) before bringing up the subject in front of others. The ***sign of the bird*** is a sign of peace, so it’s best if the matter can be settled amicably. Kanji: llama_print_timings: load time = 3509.82 ms llama_print_timings: sample time = 48.35 ms / 69 runs ( 0.70 ms per token, 1427.12 tokens per second) llama_print_timings: prompt eval time = 18194.90 ms / 31 tokens ( 586.93 ms per token, 1.70 tokens per second) llama_print_timings: eval time = 90051.99 ms / 68 runs ( 1324.29 ms per token, 0.76 tokens per second) llama_print_timings: total time = 109755.13 ms The output references the kanji of the previous generation (“propose”), even though it is nowhere to be found in the new prompt! Subsequent runs with the same prompt would say that the session file matches the prompt exactly, but “propose” and its keywords would keep reappearing. |
@Senemu Could you please try your test with the latest version of this branch and see if the issue is resolved? |
The issue is resolved in the current version of this branch! 👏 |
llama.h
Outdated
// c0 < -1 : [0, c1] | ||
// c1 < -1 : [c0, inf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be c0 < 0
?
@Senemu I made some more changes, hoping I didn't break it again. Will merge it now without testing, but if you spot any issues again - let us know |
ac2219f breaks the session cache even when using exactly the same prompt. The first run (without a cache file) works as expected, but a rerun outputs garbage: main: build = 1315 (ac2219f) main: built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for x86_64-linux-gnu main: seed = 1696150253 llama_model_loader: loaded meta data with 19 key-value pairs and 723 tensors from llama-2-70b.Q5_K_M.gguf (version GGUF V2 (latest)) […] llama_model_loader: - type f32: 161 tensors llama_model_loader: - type q5_K: 481 tensors llama_model_loader: - type q6_K: 81 tensors llm_load_print_meta: format = GGUF V2 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 4096 llm_load_print_meta: n_embd = 8192 llm_load_print_meta: n_head = 64 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 80 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_gqa = 8 llm_load_print_meta: f_norm_eps = 0,0e+00 llm_load_print_meta: f_norm_rms_eps = 1,0e-05 llm_load_print_meta: n_ff = 28672 llm_load_print_meta: freq_base_train = 10000,0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: model type = 70B llm_load_print_meta: model ftype = mostly Q5_K - Medium llm_load_print_meta: model params = 68,98 B llm_load_print_meta: model size = 45,40 GiB (5,65 BPW) llm_load_print_meta: general.name = LLaMA v2 llm_load_print_meta: BOS token = 1 '<s>' llm_load_print_meta: EOS token = 2 '</s>' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0,23 MB llm_load_tensors: mem required = 46494,72 MB .................................................................................................... llama_new_context_with_model: n_ctx = 4096 llama_new_context_with_model: freq_base = 10000,0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: kv self size = 1280,00 MB llama_new_context_with_model: compute buffer total size = 573,88 MB system_info: n_threads = 8 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | main: attempting to load saved session from 'mnemonics.bin' main: loaded a session with prompt size of 1782 tokens main: session file has exact match for prompt! sampling: repeat_last_n = 64, repeat_penalty = 1,100000, presence_penalty = 0,000000, frequency_penalty = 0,000000, top_k = 40, tfs_z = 1,000000, top_p = 0,950000, typical_p = 1,000000, temp = 0,800000, mirostat = 0, mirostat_lr = 0,100000, mirostat_ent = 5,000000 generate: n_ctx = 4096, n_batch = 512, n_predict = -2, n_keep = 0 For each kanji character, write a Markdown‐formatted mnemonic that uses its keyword and the keyword of all its components. […] Kanji: 謝 (apologize) Components: 言 (say), 射 (shoot) Mnemonic: **Shot** first, ***apologize*** (**say** you are sorry) later. Kanji: 提 (propose) Components: 扌 (left hand), 是 (go with) Mnemonic: When What Where Why ## Markdown [end of text] llama_print_timings: load time = 3831,54 ms llama_print_timings: sample time = 10,80 ms / 14 runs ( 0,77 ms per token, 1295,94 tokens per second) llama_print_timings: prompt eval time = 0,00 ms / 1 tokens ( 0,00 ms per token, inf tokens per second) llama_print_timings: eval time = 17569,35 ms / 13 runs ( 1351,49 ms per token, 0,74 tokens per second) llama_print_timings: total time = 18710,15 ms |
If this doesn't get resolved soon, open a new issue (or reopen an old one, if there is one that applies) so this doesn't get missed. |
…example * 'master' of github.com:ggerganov/llama.cpp: (24 commits) convert : fix Baichuan2 models by using vocab size in config.json (ggml-org#3299) readme : add project status link ggml : fix build after ggml-org#3329 llm : add Refact model (ggml-org#3329) sync : ggml (conv 1d + 2d updates, UB fixes) (ggml-org#3468) finetune : readme fix typo (ggml-org#3465) ggml : add RISC-V Vector Support for K-Quants and improved the existing intrinsics (ggml-org#3453) main : consistent prefix/suffix coloring (ggml-org#3425) llama : fix session saving/loading (ggml-org#3400) llama : expose model's rope_freq_scale in the API (ggml-org#3418) metal : alibi for arbitrary number of heads (ggml-org#3426) cmake : make LLAMA_NATIVE flag actually use the instructions supported by the processor (ggml-org#3273) Work on the BPE tokenizer (ggml-org#3252) convert : fix vocab size when not defined in hparams (ggml-org#3421) cmake : increase minimum version for add_link_options (ggml-org#3444) CLBlast: Add broadcast support for matrix multiplication (ggml-org#3402) gguf : add BERT, MPT, and GPT-J arch info (ggml-org#3408) gguf : general usability improvements (ggml-org#3409) cmake : make CUDA flags more similar to the Makefile (ggml-org#3420) finetune : fix ggml-org#3404 (ggml-org#3437) ...
* llama : fix session saving/loading * llama : temp fix for clearing "future" tokens from the KV cache * llama : fix handling of "future" tokens when loading sessions * llama : fix comments for llama_kv_cache API
@Senemu The issue should be fixed on latest |
It is fixed in b8fe4b5. Thank you very much! |
…example * 'master' of github.com:ggerganov/llama.cpp: (34 commits) examples: support LLaVA v1.5 (multimodal model) (ggml-org#3436) docs : fix typo GOMP_CPU_AFFINITY (ggml-org#3597) cmake : fix add_compile_options on macOS typo : it is `--n-gpu-layers` not `--gpu-layers` (ggml-org#3592) ci : check if there is enough VRAM (ggml-org#3596) server : add completion mode (no chat) (ggml-org#3582) prompts : add mnemonics.txt server : fix kv cache management (ggml-org#3588) main : fix session loading bug (ggml-org#3400) server : add parameter -tb N, --threads-batch N (ggml-org#3584) common : fix mirostat state when using multiple sequences (ggml-org#3543) batched : add bench tool (ggml-org#3545) examples : add batched.swift + improve CI for swift (ggml-org#3562) Add MPT model to supported models in README.md (ggml-org#3574) Minor improvements in GPT2 tokenizer (ggml-org#3567) readme : add bloom (ggml-org#3570) llm : add bloom models (ggml-org#3553) swift : improvements and fixes (ggml-org#3564) llm : add MPT support (ggml-org#3417) infill. : fix tokenization (ggml-org#3508) ...
ref #3397
I think this should fix the issue with saving/loading session data after #3228.
Make sure to delete any old chat data
@jluisreymejias Can you give this branch a try?