AI: Difference between revisions

https://huggingface.co/TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g - python3 llama.py vicuna-AlekseyKorshuk-7B c4 --wbits 4 --true-sequential --act-order --groupsize 128 --save_safetensors vicuna-AlekseyKorshuk-7B-GPTQ-4bit-128g.safetensors

≈65% speedup of the AVX-512 implementation of ggml_vec_dot_q4_0() #933

"Speaking of which, for any 30b anons struggling with context size, I figured something out. If you use the Triton branch on WSL, go into GPTQ_loader.py and comment out make_quant_attn like so" from here

Just grab the CUDA branch of qwop's GPTQ for LLaMA (or Triton if you want to be a dickhole) or if you have webui installed, go into the folder for GPTQ. Make sure all the requirements are installed and run this line:

python llama.py /path/to30b c4 --wbits 4 --true-sequential --groupsize 128 --save_safetensors alpacino-4bit-128g.safetensors

And it'll run. For windows, obviously flip the slashes the right way. And for linux, you may need to add CUDA_VISIBLE_DEVICES=0 to the front of the command.

Benchmarks


Date	Interface	Model	GPTQ	Xformers?	HW	Load	Speed
	text-gen	anon8231489123-vicuna-13b-GPTQ-4bit-128g	GPTQ-for-LLaMa-triton	yes	240gb SSD, 16gb,desktop off	10.53	7.97 tokens/s
	text-gen	anon8231489123-vicuna-13b-GPTQ-4bit-128g	GPTQ-for-LLaMa-triton	No xformers	240gb SSD, 16gb,desktop off	10.22s	7.55 tokens/s
	text-gen	anon8231489123-vicuna-13b-GPTQ-4bit-128g	GPTQ-for-LLaMa-cuda	No xformers	240gb SSD, 16gb,desktop off	16.68s	4.03 tokens/s
	text-gen	anon8231489123-vicuna-13b-GPTQ-4bit-128g	GPTQ-for-LLaMa-cuda	yes	240gb SSD, 16gb,desktop off	9.34s	4.01 tokens/s
	text-gen	llama-30b-sft-oa-alpaca-epoch-2-4bit-ggml	no	no	2TB SSD, 64gb	?	0.67 tokens/s
	text-gen	llama-30b-sft-oa-alpaca-epoch-2-4bit-ggml	no	no	2TB SSD, 64gb, --threads 8	maybe 30s?	0.51 tokens/s
	text-gen	llama-30b-sft-oa-alpaca-epoch-2-4bit-ggml	no	no	2TB SSD, 64gb, --threads 7		0.68 tokens/s
	text-gen	llama-30b-sft-oa-alpaca-epoch-2-4bit-ggml	no	no	2TB SSD, 64gb, --threads 6		0.61 tokens/s
	text-gen	anon8231489123-vicuna-13b-GPTQ-4bit-128g-ggml	no	no	2TB SSD, 64gb		1.17 tokens/s
	text-gen	anon8231489123-vicuna-13b-GPTQ-4bit-128g	GPTQ-for-LLaMa-triton	yes	2TB SSD, 64gb, --pre_layer 25	45.69	0.25 tokens/s
	text-gen	anon8231489123-vicuna-13b-GPTQ-4bit-128g	GPTQ-for-LLaMa-triton	yes	2TB SSD, 64gb	36.47	9.63 tokens/s
	llama.cpp	llama-30b-sft-oa-alpaca-epoch-2-4bit-ggml			2TB SSD, 64gb	10317.90 ms	1096.21 ms per token
	llama.cpp-modern-avx512	llama-30b-sft-oa-alpaca-epoch-2-4bit-ggml			2TB SSD, 64gb	9288.69 ms	1049.03 ms per token
	llama.cpp-avx512-pr833	llama-30b-sft-oa-alpaca-epoch-2-4bit-ggml			2TB SSD, 64gb	13864.06 ms	0.89 tokens/s, 820.68 ms per token
	text-gen	TheBloke-gpt4-alpaca-lora-30B-4bit-GGML/ggml-model-q4_0			2TB SSD, 64gb		0.78 tokens/s
	text-gen+avx512-pr833	TheBloke-gpt4-alpaca-lora-30B-4bit-GGML/ggml-model-q4_0			2TB SSD, 64gb		1.04 tokens/s
2023-04-24	text-gen	anon8231489123-vicuna-13b-GPTQ-4bit-128g	GPTQ-for-LLaMa-triton	yes	2TB SSD, 64gb, also running llama.cpp with another model	16.36	5.07 tokens/s

@@ Line 88: / Line 88: @@
 {| class="wikitable sortable"
 |+
+!Date
 !Interface
 !Model
@@ Line 96: / Line 97: @@
 !Speed
 |-
+|
 |text-gen
 |anon8231489123-vicuna-13b-GPTQ-4bit-128g
@@ Line 104: / Line 106: @@
 |7.97 tokens/s
 |-
+|
 |text-gen
 |anon8231489123-vicuna-13b-GPTQ-4bit-128g
@@ Line 112: / Line 115: @@
 |7.55 tokens/s
 |-
+|
 |text-gen
 |anon8231489123-vicuna-13b-GPTQ-4bit-128g
@@ Line 120: / Line 124: @@
 |4.03 tokens/s
 |-
+|
 |text-gen
 |anon8231489123-vicuna-13b-GPTQ-4bit-128g
@@ Line 128: / Line 133: @@
 |4.01 tokens/s
 |-
+|
 |text-gen
 |llama-30b-sft-oa-alpaca-epoch-2-4bit'''-ggml'''
@@ Line 136: / Line 142: @@
 |0.67 tokens/s
 |-
+|
 |text-gen
 |llama-30b-sft-oa-alpaca-epoch-2-4bit'''-ggml'''
@@ Line 144: / Line 151: @@
 |0.51 tokens/s
 |-
+|
 |text-gen
 |llama-30b-sft-oa-alpaca-epoch-2-4bit'''-ggml'''
@@ Line 152: / Line 160: @@
 |0.68 tokens/s
 |-
+|
 |text-gen
 |llama-30b-sft-oa-alpaca-epoch-2-4bit'''-ggml'''
@@ Line 160: / Line 169: @@
 |0.61 tokens/s
 |-
+|
 |text-gen
 |anon8231489123-vicuna-13b-GPTQ-4bit-128g'''-ggml'''
@@ Line 168: / Line 178: @@
 |1.17 tokens/s
 |-
+|
 |text-gen
 |anon8231489123-vicuna-13b-GPTQ-4bit-128g
@@ Line 176: / Line 187: @@
 |0.25 tokens/s
 |-
+|
 |text-gen
 |anon8231489123-vicuna-13b-GPTQ-4bit-128g
@@ Line 184: / Line 196: @@
 |9.63 tokens/s
 |-
+|
 |'''llama.cpp'''
 |llama-30b-sft-oa-alpaca-epoch-2-4bit'''-ggml'''
@@ Line 192: / Line 205: @@
 |1096.21 ms per token
 |-
+|
 |'''llama.cpp-modern-avx512'''
 |llama-30b-sft-oa-alpaca-epoch-2-4bit'''-ggml'''
@@ Line 200: / Line 214: @@
 |1049.03 ms per token
 |-
+|
 |'''llama.cpp-avx512-pr833'''
 |llama-30b-sft-oa-alpaca-epoch-2-4bit'''-ggml'''
@@ Line 208: / Line 223: @@
 |0.89 tokens/s, 820.68 ms per token
 |-
+|
 |text-gen
 |TheBloke-gpt4-alpaca-lora-30B-4bit-GGML/ggml-model-'''q4_0'''
@@ Line 216: / Line 232: @@
 |0.78 tokens/s
 |-
+|
 |text-gen+'''avx512-pr833'''
 |TheBloke-gpt4-alpaca-lora-30B-4bit-GGML/ggml-model-q4_0
@@ Line 223: / Line 240: @@
 |
 |1.04 tokens/s
+|-
+|2023-04-24
+|text-gen
+|anon8231489123-vicuna-13b-GPTQ-4bit-128g
+|GPTQ-for-LLaMa-'''triton'''
+|yes
+|2TB SSD, 64gb, also running llama.cpp with another model
+|16.36
+|5.07 tokens/s
 |}

AI: Difference between revisions

Revision as of 14:22, 23 April 2023

LLama

Benchmarks

Navigation menu

Search