AI: Difference between revisions
m (→Benchmarks) |
m (→Benchmarks) |
||
| Line 493: | Line 493: | ||
| |
| |
||
|202.6ms/T, 4.82T/s |
|202.6ms/T, 4.82T/s |
||
|- |
|||
|2023-12-07 |
|||
|koboldcpp |
|||
|guanaco-33B.q4_K_M.gguf |
|||
|cublas |
|||
|n/a |
|||
|7950x3d, 2TB SSD, 64gb, --nommap --smartcontext --usecublas '''--threads 15''' --gpulayers 27 |
|||
| |
|||
|195.0ms/T, 5.03T/s |
|||
|} |
|} |
||
Revision as of 19:29, 6 December 2023
New Shit
koboldcpp.exe 13B-HyperMantis.ggmlv3.q4_K_M.bin --debug --usecublas --usemlock --contextsize 8192 --blasbatchsize 512 --psutil_set_threads --threads 6 --blasthreads 10 --gpulayers 5 --highpriority --stream --usemlock --unbantokens --smartcontext
If you're running 1.35 and a superHOT model, you should also add --linearrope which should make them perform better.
RP recommened models: "superhot, airoboros, wizard-vicuna, guanaco, chronos are a few commonly discussed models off the top of my head. For me, it's superhot or guanaco (one or the other, not the merge though)"
LLama
llama.cpp pull requests
cuBLAS: use host pinned memory and dequantize while copying
Add NVIDIA cuBLAS support (Merged now)
TextGeneration-webui pull requests
I think I made the Triton-GPTQ model work with the GPTQ-for-LLaMa (Triton branch) repository. (slower than non-triton)
Tried more Vicuna + superCOT lora and I am fairly impressed at how well it understands me for a 13B model in terms of followup instructions.
Models
https://huggingface.co/Melbourne/Alpacino-30b-ggml/tree/main
Misc
GPT4-x-Alpaca-30B is particularly good at writing stories
"There was a tokenizer caching error, some people said. Redownload the hf_output files from the repo or just change the use_cache line in the config.json to say: "use_cache": true," for the Vicuna13B-free https://github.com/stochasticai/xturing/tree/main/examples/int4_finetuning
https://wiki.installgentoo.com/wiki/Home_server#Expanding_Your_Storage
https://rentry.org/llama-tard-v2
https://hackmd.io/@reneil1337/alpaca
https://find.4chan.org/?q=AI+Dynamic+Storytelling+General
https://find.4chan.org/?q=AI+Chatbot+General
https://find.4chan.org/?q=%2Flmg%2F (local models general)
https://boards.4channel.org/g/thread/92400764#p92400764
https://files.catbox.moe/lvefgy.json
https://pytorch.org/hub/nvidia_deeplearningexamples_tacotron2/
python server.py --model llama-7b-4bit --wbits 4python server.py --model llama-13b-4bit-128g --wbits 4 --groupsize 128
https://github.com/qwopqwop200/GPTQ-for-LLaMa/issues/59 for installing with out of space error
https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model#4-bit-mode
https://github.com/pybind/pybind11/discussions/4566
https://lmsysvicuna.miraheze.org/wiki/How_to_use_Vicuna#Use_with_llama.cpp%3A
https://huggingface.co/anon8231489123/vicuna-13b-GPTQ-4bit-128g
Vicuna generating its own prompts
should be worse than Q4_1 (which is QK=32) but there are several PRs in the work that should improve quantization accuracy in general
https://github.com/ggerganov/llama.cpp/pull/729
https://github.com/ggerganov/llama.cpp/pull/835
https://github.com/ggerganov/llama.cpp/pull/896
https://huggingface.co/TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g - python3 llama.py vicuna-AlekseyKorshuk-7B c4 --wbits 4 --true-sequential --act-order --groupsize 128 --save_safetensors vicuna-AlekseyKorshuk-7B-GPTQ-4bit-128g.safetensors
≈65% speedup of the AVX-512 implementation of ggml_vec_dot_q4_0() #933
"Speaking of which, for any 30b anons struggling with context size, I figured something out. If you use the Triton branch on WSL, go into GPTQ_loader.py and comment out make_quant_attn like so" from here
GGML Quantization
Papers
Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision
Benchmarks
TODO: Try different cublas batch sizes
| Interface | Model | GPTQ | Xformers? | HW | Load | Speed | |
|---|---|---|---|---|---|---|---|
| text-gen | anon8231489123-vicuna-13b-GPTQ-4bit-128g | GPTQ-for-LLaMa-triton | yes | 240gb SSD, 16gb,desktop off | 10.53 | 7.97 tokens/s | |
| text-gen | anon8231489123-vicuna-13b-GPTQ-4bit-128g | GPTQ-for-LLaMa-triton | No xformers | 240gb SSD, 16gb,desktop off | 10.22s | 7.55 tokens/s | |
| text-gen | anon8231489123-vicuna-13b-GPTQ-4bit-128g | GPTQ-for-LLaMa-cuda | No xformers | 240gb SSD, 16gb,desktop off | 16.68s | 4.03 tokens/s | |
| text-gen | anon8231489123-vicuna-13b-GPTQ-4bit-128g | GPTQ-for-LLaMa-cuda | yes | 240gb SSD, 16gb,desktop off | 9.34s | 4.01 tokens/s | |
| text-gen | llama-30b-sft-oa-alpaca-epoch-2-4bit-ggml | no | no | 2TB SSD, 64gb | ? | 0.67 tokens/s | |
| text-gen | llama-30b-sft-oa-alpaca-epoch-2-4bit-ggml | no | no | 2TB SSD, 64gb, --threads 8 | maybe 30s? | 0.51 tokens/s | |
| text-gen | llama-30b-sft-oa-alpaca-epoch-2-4bit-ggml | no | no | 2TB SSD, 64gb, --threads 7 | 0.68 tokens/s | ||
| text-gen | llama-30b-sft-oa-alpaca-epoch-2-4bit-ggml | no | no | 2TB SSD, 64gb, --threads 6 | 0.61 tokens/s | ||
| text-gen | anon8231489123-vicuna-13b-GPTQ-4bit-128g-ggml | no | no | 2TB SSD, 64gb | 1.17 tokens/s | ||
| text-gen | anon8231489123-vicuna-13b-GPTQ-4bit-128g | GPTQ-for-LLaMa-triton | yes | 2TB SSD, 64gb, --pre_layer 25 | 45.69 | 0.25 tokens/s | |
| text-gen | anon8231489123-vicuna-13b-GPTQ-4bit-128g | GPTQ-for-LLaMa-triton | yes | 2TB SSD, 64gb | 36.47 | 9.63 tokens/s | |
| llama.cpp | llama-30b-sft-oa-alpaca-epoch-2-4bit-ggml | 2TB SSD, 64gb | 10317.90 ms | 1096.21 ms per token | |||
| llama.cpp-modern-avx512 | llama-30b-sft-oa-alpaca-epoch-2-4bit-ggml | 2TB SSD, 64gb | 9288.69 ms | 1049.03 ms per token | |||
| llama.cpp-avx512-pr833 | llama-30b-sft-oa-alpaca-epoch-2-4bit-ggml | 2TB SSD, 64gb | 13864.06 ms | 0.89 tokens/s, 820.68 ms per token | |||
| text-gen | TheBloke-gpt4-alpaca-lora-30B-4bit-GGML/ggml-model-q4_0 | 2TB SSD, 64gb | 0.78 tokens/s | ||||
| text-gen+avx512-pr833 | TheBloke-gpt4-alpaca-lora-30B-4bit-GGML/ggml-model-q4_0 | 2TB SSD, 64gb | 1.04 tokens/s | ||||
| 2023-04-24 | text-gen | anon8231489123-vicuna-13b-GPTQ-4bit-128g | GPTQ-for-LLaMa-triton | yes | 2TB SSD, 64gb, also running llama.cpp with another model | 16.36 | 5.07 tokens/s |
| 2023-04-26 | koboldcpp | gozfarb-llama-30b-supercot-ggml/ggml-model-q4_0.bin | clblast | n/a | 2TB SSD, 64gb, --threads 8 | 1073ms/T | |
| 2023-04-29 | koboldcpp | Alpacino-30b-q4_0.bin | clblast | n/a | 2TB SSD, 64gb | 700ms/T | |
| 2023-07-13 | koboldcpp | llama-33b-supercot-ggml-q5_1 (complains about old format) | cublas | n/a | 2TB SSD, 64gb, --nommap --smartcontext --usecublas --gpulayers 18 | 643ms/T 1.4T/s | |
| 2023-07-13 | koboldcpp | llama-33b-supercot-ggml-q5_1 (complains about old format) | clblast | n/a | 2TB SSD, 64gb, --nommap --smartcontext --useclblast 0 0 --gpulayers 18 | 685ms/T 1.2T/s | |
| 2023-07-13 | koboldcpp | airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin | cublas | n/a | 2TB SSD, 64gb, --nommap --smartcontext --usecublas --gpulayers 18 (probably space for more) | 652ms/T 1.5T/s | |
| 2023-07-13 | koboldcpp | airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin | cublas | n/a | 2TB SSD, 64gb, --nommap --smartcontext --usecublas --gpulayers 26 (I note 3 threads are set by default) | 593ms/T 1.6T/s | |
| 2023-07-13 | koboldcpp | airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin | cublas | n/a | 2TB SSD, 64gb, --nommap --smartcontext --usecublas --gpulayers 26 --psutil _set_threads (4 threads) | 514ms/T 1.8T/s | |
| 2023-07-13 | koboldcpp | airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin | cublas | n/a | 2TB SSD, 64gb, --smartcontext --usecublas --gpulayers 26 --psutil _set_threads (removed nommap) | 508ms/T 1.9T/s | |
| 2023-07-13 | koboldcpp | airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin | cublas | n/a | 2TB SSD, 64gb, --smartcontext --usecublas --gpulayers 26 --threads 5 | 454ms/T 2.1T/s | |
| 2023-07-13 | koboldcpp | airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin | cublas | n/a | 2TB SSD, 64gb, --smartcontext --usecublas --gpulayers 26 --threads 6 | 422ms/T 2.2T/s | |
| 2023-07-13 | koboldcpp | airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin | cublas | n/a | 2TB SSD, 64gb, --smartcontext --usecublas --gpulayers 26 --threads 7 | 509ms/T 1.8T/s | |
| 2023-07-13 | koboldcpp | airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin | cublas | n/a | 2TB SSD, 64gb, --smartcontext --usecublas --gpulayers 26 --threads 8 | 494ms/T 1.7T/s | |
| 2023-07-13 | koboldcpp | airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin | cublas | n/a | 2TB SSD, 64gb, --smartcontext --usecublas --gpulayers 26 --threads 6 --linearrope (no difference, needs supercot?) | 425ms/T 2.2T/s | |
| 2023-07-13 | koboldcpp | airoboros-33b-gpt4-1.4.ggmlv3.q4_K_M.bin | cublas | n/a | 2TB SSD, 64gb, --smartcontext --usecublas --gpulayers 26 --threads 6 | 400ms/T 2.3T/s | |
| 2023-07-13 | koboldcpp | airoboros-65b-gpt4-1.4.ggmlv3.q4_K_M.bin | cublas | n/a | 2TB SSD, 64gb, --nommap --smartcontext --usecublas --gpulayers 13 --threads 6 | 1366ms/T 0.7T/s | |
| 2023-07-14 | koboldcpp | airoboros-65b-gpt4-1.4.ggmlv3.q2_K.bin | cublas | n/a | 2TB SSD, 64gb, --nommap --smartcontext --usecublas --gpulayers 13 --threads 6 | 765ms/T - 1.2T/s | |
| 2023-09-06 | koboldcpp | guanaco-33B.ggmlv3.q4_K_M.bin | cublas | n/a | 2TB SSD, 64gb, --stream --smartcontext --usecublas --gpulayers 29 --threads 6 | 562ms/T - 1.3T/s | |
| 2023-09-06 | koboldcpp | guanaco-33B.ggmlv3.q4_K_M.bin | cublas | n/a | 2TB SSD, 64gb, --nommap --stream --smartcontext --usecublas --gpulayers 29 --threads 6 | 567ms/T), Total:70.7s (1.4T/s | |
| 2023-09-06 | koboldcpp | guanaco-33B.ggmlv3.q4_K_M.bin | cublas | n/a | 2TB SSD, 64gb, --nommap --stream --smartcontext --usecublas --gpulayers 25 --threads 6 | 563ms/T), Total:70.2s (1.4T/s | |
| 2023-12-03 | koboldcpp | guanaco-33B.q4_K_M.gguf | cublas | n/a | 2TB SSD, 64gb, --nommap --smartcontext --usecublas --threads 6 --gpulayers 27 | 330.7ms/T), Total:40.79s 2.94T/s | |
| 2023-12-07 | koboldcpp | guanaco-33B.q4_K_M.gguf | cublas | n/a | 7950x3d, 2TB SSD, 64gb, --nommap --smartcontext --usecublas --threads 6 --gpulayers 27 | 202.1ms/T, 4.78T/s | |
| 2023-12-07 | koboldcpp | guanaco-33B.q4_K_M.gguf | cublas | n/a | 7950x3d, 2TB SSD, 64gb, --nommap --smartcontext --usecublas --threads 32 --gpulayers 27 | 360.8ms/T, 2.68T/s | |
| 2023-12-07 | koboldcpp | guanaco-33B.q4_K_M.gguf | cublas | n/a | 7950x3d, 2TB SSD, 64gb, --nommap --smartcontext --usecublas --threads 16 --gpulayers 27 | 202.6ms/T, 4.82T/s | |
| 2023-12-07 | koboldcpp | guanaco-33B.q4_K_M.gguf | cublas | n/a | 7950x3d, 2TB SSD, 64gb, --nommap --smartcontext --usecublas --threads 15 --gpulayers 27 | 195.0ms/T, 5.03T/s |