AI
LLama
llama.cpp pull requests
cuBLAS: use host pinned memory and dequantize while copying
Add NVIDIA cuBLAS support (Merged now)
TextGeneration-webui pull requests
I think I made the Triton-GPTQ model work with the GPTQ-for-LLaMa (Triton branch) repository. (slower than non-triton)
Tried more Vicuna + superCOT lora and I am fairly impressed at how well it understands me for a 13B model in terms of followup instructions.
Models
https://huggingface.co/Melbourne/Alpacino-30b-ggml/tree/main
Misc
https://github.com/stochasticai/xturing/tree/main/examples/int4_finetuning
https://wiki.installgentoo.com/wiki/Home_server#Expanding_Your_Storage
https://rentry.org/llama-tard-v2
https://hackmd.io/@reneil1337/alpaca
https://find.4chan.org/?q=AI+Dynamic+Storytelling+General
https://find.4chan.org/?q=AI+Chatbot+General
https://find.4chan.org/?q=%2Flmg%2F (local models general)
https://boards.4channel.org/g/thread/92400764#p92400764
https://files.catbox.moe/lvefgy.json
https://pytorch.org/hub/nvidia_deeplearningexamples_tacotron2/
python server.py --model llama-7b-4bit --wbits 4python server.py --model llama-13b-4bit-128g --wbits 4 --groupsize 128
https://github.com/qwopqwop200/GPTQ-for-LLaMa/issues/59 for installing with out of space error
https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model#4-bit-mode
https://github.com/pybind/pybind11/discussions/4566
https://lmsysvicuna.miraheze.org/wiki/How_to_use_Vicuna#Use_with_llama.cpp%3A
https://huggingface.co/anon8231489123/vicuna-13b-GPTQ-4bit-128g
Vicuna generating its own prompts
should be worse than Q4_1 (which is QK=32) but there are several PRs in the work that should improve quantization accuracy in general
https://github.com/ggerganov/llama.cpp/pull/729
https://github.com/ggerganov/llama.cpp/pull/835
https://github.com/ggerganov/llama.cpp/pull/896
https://huggingface.co/TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g - python3 llama.py vicuna-AlekseyKorshuk-7B c4 --wbits 4 --true-sequential --act-order --groupsize 128 --save_safetensors vicuna-AlekseyKorshuk-7B-GPTQ-4bit-128g.safetensors
≈65% speedup of the AVX-512 implementation of ggml_vec_dot_q4_0() #933
"Speaking of which, for any 30b anons struggling with context size, I figured something out. If you use the Triton branch on WSL, go into GPTQ_loader.py and comment out make_quant_attn like so" from here
Benchmarks
| Date | Interface | Model | GPTQ | Xformers? | HW | Load | Speed |
|---|---|---|---|---|---|---|---|
| text-gen | anon8231489123-vicuna-13b-GPTQ-4bit-128g | GPTQ-for-LLaMa-triton | yes | 240gb SSD, 16gb,desktop off | 10.53 | 7.97 tokens/s | |
| text-gen | anon8231489123-vicuna-13b-GPTQ-4bit-128g | GPTQ-for-LLaMa-triton | No xformers | 240gb SSD, 16gb,desktop off | 10.22s | 7.55 tokens/s | |
| text-gen | anon8231489123-vicuna-13b-GPTQ-4bit-128g | GPTQ-for-LLaMa-cuda | No xformers | 240gb SSD, 16gb,desktop off | 16.68s | 4.03 tokens/s | |
| text-gen | anon8231489123-vicuna-13b-GPTQ-4bit-128g | GPTQ-for-LLaMa-cuda | yes | 240gb SSD, 16gb,desktop off | 9.34s | 4.01 tokens/s | |
| text-gen | llama-30b-sft-oa-alpaca-epoch-2-4bit-ggml | no | no | 2TB SSD, 64gb | ? | 0.67 tokens/s | |
| text-gen | llama-30b-sft-oa-alpaca-epoch-2-4bit-ggml | no | no | 2TB SSD, 64gb, --threads 8 | maybe 30s? | 0.51 tokens/s | |
| text-gen | llama-30b-sft-oa-alpaca-epoch-2-4bit-ggml | no | no | 2TB SSD, 64gb, --threads 7 | 0.68 tokens/s | ||
| text-gen | llama-30b-sft-oa-alpaca-epoch-2-4bit-ggml | no | no | 2TB SSD, 64gb, --threads 6 | 0.61 tokens/s | ||
| text-gen | anon8231489123-vicuna-13b-GPTQ-4bit-128g-ggml | no | no | 2TB SSD, 64gb | 1.17 tokens/s | ||
| text-gen | anon8231489123-vicuna-13b-GPTQ-4bit-128g | GPTQ-for-LLaMa-triton | yes | 2TB SSD, 64gb, --pre_layer 25 | 45.69 | 0.25 tokens/s | |
| text-gen | anon8231489123-vicuna-13b-GPTQ-4bit-128g | GPTQ-for-LLaMa-triton | yes | 2TB SSD, 64gb | 36.47 | 9.63 tokens/s | |
| llama.cpp | llama-30b-sft-oa-alpaca-epoch-2-4bit-ggml | 2TB SSD, 64gb | 10317.90 ms | 1096.21 ms per token | |||
| llama.cpp-modern-avx512 | llama-30b-sft-oa-alpaca-epoch-2-4bit-ggml | 2TB SSD, 64gb | 9288.69 ms | 1049.03 ms per token | |||
| llama.cpp-avx512-pr833 | llama-30b-sft-oa-alpaca-epoch-2-4bit-ggml | 2TB SSD, 64gb | 13864.06 ms | 0.89 tokens/s, 820.68 ms per token | |||
| text-gen | TheBloke-gpt4-alpaca-lora-30B-4bit-GGML/ggml-model-q4_0 | 2TB SSD, 64gb | 0.78 tokens/s | ||||
| text-gen+avx512-pr833 | TheBloke-gpt4-alpaca-lora-30B-4bit-GGML/ggml-model-q4_0 | 2TB SSD, 64gb | 1.04 tokens/s | ||||
| 2023-04-24 | text-gen | anon8231489123-vicuna-13b-GPTQ-4bit-128g | GPTQ-for-LLaMa-triton | yes | 2TB SSD, 64gb, also running llama.cpp with another model | 16.36 | 5.07 tokens/s |
| 2023-04-26 | koboldcpp | gozfarb-llama-30b-supercot-ggml/ggml-model-q4_0.bin | clblast | n/a | 2TB SSD, 64gb, --threads 8 | 1073ms/T | |
| 2023-04-29 | koboldcpp | Alpacino-30b-q4_0.bin | clblast | n/a | 2TB SSD, 64gb | 700ms/T |