AI
LLama
https://wiki.installgentoo.com/wiki/Home_server#Expanding_Your_Storage
https://rentry.org/llama-tard-v2
https://hackmd.io/@reneil1337/alpaca
https://boards.4channel.org/g/catalog#s=lmg%2F
https://find.4chan.org/?q=AI+Dynamic+Storytelling+General
https://find.4chan.org/?q=AI+Chatbot+General
https://find.4chan.org/?q=%2Flmg%2F (local models general)
https://boards.4channel.org/g/thread/92400764#p92400764
https://files.catbox.moe/lvefgy.json
https://pytorch.org/hub/nvidia_deeplearningexamples_tacotron2/
python server.py --model llama-7b-4bit --wbits 4python server.py --model llama-13b-4bit-128g --wbits 4 --groupsize 128
https://github.com/qwopqwop200/GPTQ-for-LLaMa/issues/59 for installing with out of space error
https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model#4-bit-mode
https://github.com/pybind/pybind11/discussions/4566
https://lmsysvicuna.miraheze.org/wiki/How_to_use_Vicuna#Use_with_llama.cpp%3A
https://huggingface.co/anon8231489123/vicuna-13b-GPTQ-4bit-128g
Vicuna generating its own prompts
should be worse than Q4_1 (which is QK=32) but there are several PRs in the work that should improve quantization accuracy in general
https://github.com/ggerganov/llama.cpp/pull/729
https://github.com/ggerganov/llama.cpp/pull/835
https://github.com/ggerganov/llama.cpp/pull/896
Benchmarks
| Interface | Model | GPTQ | Xformers? | HW | Load | Speed |
|---|---|---|---|---|---|---|
| text-gen | anon8231489123-vicuna-13b-GPTQ-4bit-128g | GPTQ-for-LLaMa-triton | yes | 240gb SSD, 16gb,desktop off | 10.53 | 7.97 tokens/s |
| text-gen | anon8231489123-vicuna-13b-GPTQ-4bit-128g | GPTQ-for-LLaMa-triton | No xformers | 240gb SSD, 16gb,desktop off | 10.22s | 7.55 tokens/s |
| text-gen | anon8231489123-vicuna-13b-GPTQ-4bit-128g | GPTQ-for-LLaMa-cuda | No xformers | 240gb SSD, 16gb,desktop off | 16.68s | 4.03 tokens/s |
| text-gen | anon8231489123-vicuna-13b-GPTQ-4bit-128g | GPTQ-for-LLaMa-cuda | yes | 240gb SSD, 16gb,desktop off | 9.34s | 4.01 tokens/s |
| text-gen | llama-30b-sft-oa-alpaca-epoch-2-4bit-ggml | no | no | 2TB SSD, 64gb | ? | 0.67 tokens/s |
| text-gen | llama-30b-sft-oa-alpaca-epoch-2-4bit-ggml | no | no | 2TB SSD, 64gb, --threads 8 | maybe 30s? | 0.51 tokens/s |
| text-gen | llama-30b-sft-oa-alpaca-epoch-2-4bit-ggml | no | no | 2TB SSD, 64gb, --threads 7 | 0.68 tokens/s | |
| text-gen | anon8231489123-vicuna-13b-GPTQ-4bit-128g-ggml | no | no | 2TB SSD, 64gb | 1.17 tokens/s | |
| text-gen | anon8231489123-vicuna-13b-GPTQ-4bit-128g | GPTQ-for-LLaMa-triton | yes | 2TB SSD, 64gb, --pre_layer 25 | 45.69 | 0.25 tokens/s |
| text-gen | anon8231489123-vicuna-13b-GPTQ-4bit-128g | GPTQ-for-LLaMa-triton | yes | 2TB SSD, 64gb | 36.47 | 9.63 tokens/s |
| llama.cpp | llama-30b-sft-oa-alpaca-epoch-2-4bit-ggml | 2TB SSD, 64gb | 10317.90 ms | 1096.21 ms per token |