AI

LLama

https://files.catbox.moe/lvefgy.json

python server.py --model llama-7b-4bit --wbits 4
python server.py --model llama-13b-4bit-128g --wbits 4 --groupsize 128

should be worse than Q4_1 (which is QK=32) but there are several PRs in the work that should improve quantization accuracy in general

https://github.com/ggerganov/llama.cpp/pull/729

https://github.com/ggerganov/llama.cpp/pull/835

https://github.com/ggerganov/llama.cpp/pull/896

Benchmarks


Interface	Model	GPTQ	Xformers?	HW	Load	Speed
text-gen	anonVic13B	GPTQ-for-LLaMa-triton	yes	240gb SSD, 16gb,desktop off	10.53	7.97 tokens/sec
text-gen	anonVic13B	GPTQ-for-LLaMa-triton	No xformers	240gb SSD, 16gb,desktop off	10.22s	7.55 tokens/sec
text-gen	anonVic13B	GPTQ-for-LLaMa-cuda	No xformers	240gb SSD, 16gb,desktop off	16.68s	4.03 tokens/sec
text-gen	anonVic13B	GPTQ-for-LLaMa-cuda	yes	240gb SSD, 16gb,desktop off	9.34s	4.01 tokens/sec