AI: Difference between revisions

From Hegemon Wiki
Jump to navigation Jump to search
Line 88: Line 88:
{| class="wikitable sortable"
{| class="wikitable sortable"
|+
|+
!Date
!Interface
!Interface
!Model
!Model
Line 96: Line 97:
!Speed
!Speed
|-
|-
|
|text-gen
|text-gen
|anon8231489123-vicuna-13b-GPTQ-4bit-128g
|anon8231489123-vicuna-13b-GPTQ-4bit-128g
Line 104: Line 106:
|7.97 tokens/s
|7.97 tokens/s
|-
|-
|
|text-gen
|text-gen
|anon8231489123-vicuna-13b-GPTQ-4bit-128g
|anon8231489123-vicuna-13b-GPTQ-4bit-128g
Line 112: Line 115:
|7.55 tokens/s
|7.55 tokens/s
|-
|-
|
|text-gen
|text-gen
|anon8231489123-vicuna-13b-GPTQ-4bit-128g
|anon8231489123-vicuna-13b-GPTQ-4bit-128g
Line 120: Line 124:
|4.03 tokens/s
|4.03 tokens/s
|-
|-
|
|text-gen
|text-gen
|anon8231489123-vicuna-13b-GPTQ-4bit-128g
|anon8231489123-vicuna-13b-GPTQ-4bit-128g
Line 128: Line 133:
|4.01 tokens/s
|4.01 tokens/s
|-
|-
|
|text-gen
|text-gen
|llama-30b-sft-oa-alpaca-epoch-2-4bit'''-ggml'''
|llama-30b-sft-oa-alpaca-epoch-2-4bit'''-ggml'''
Line 136: Line 142:
|0.67 tokens/s
|0.67 tokens/s
|-
|-
|
|text-gen
|text-gen
|llama-30b-sft-oa-alpaca-epoch-2-4bit'''-ggml'''
|llama-30b-sft-oa-alpaca-epoch-2-4bit'''-ggml'''
Line 144: Line 151:
|0.51 tokens/s
|0.51 tokens/s
|-
|-
|
|text-gen
|text-gen
|llama-30b-sft-oa-alpaca-epoch-2-4bit'''-ggml'''
|llama-30b-sft-oa-alpaca-epoch-2-4bit'''-ggml'''
Line 152: Line 160:
|0.68 tokens/s
|0.68 tokens/s
|-
|-
|
|text-gen
|text-gen
|llama-30b-sft-oa-alpaca-epoch-2-4bit'''-ggml'''
|llama-30b-sft-oa-alpaca-epoch-2-4bit'''-ggml'''
Line 160: Line 169:
|0.61 tokens/s
|0.61 tokens/s
|-
|-
|
|text-gen
|text-gen
|anon8231489123-vicuna-13b-GPTQ-4bit-128g'''-ggml'''
|anon8231489123-vicuna-13b-GPTQ-4bit-128g'''-ggml'''
Line 168: Line 178:
|1.17 tokens/s
|1.17 tokens/s
|-
|-
|
|text-gen
|text-gen
|anon8231489123-vicuna-13b-GPTQ-4bit-128g
|anon8231489123-vicuna-13b-GPTQ-4bit-128g
Line 176: Line 187:
|0.25 tokens/s
|0.25 tokens/s
|-
|-
|
|text-gen
|text-gen
|anon8231489123-vicuna-13b-GPTQ-4bit-128g
|anon8231489123-vicuna-13b-GPTQ-4bit-128g
Line 184: Line 196:
|9.63 tokens/s
|9.63 tokens/s
|-
|-
|
|'''llama.cpp'''
|'''llama.cpp'''
|llama-30b-sft-oa-alpaca-epoch-2-4bit'''-ggml'''
|llama-30b-sft-oa-alpaca-epoch-2-4bit'''-ggml'''
Line 192: Line 205:
|1096.21 ms per token
|1096.21 ms per token
|-
|-
|
|'''llama.cpp-modern-avx512'''
|'''llama.cpp-modern-avx512'''
|llama-30b-sft-oa-alpaca-epoch-2-4bit'''-ggml'''
|llama-30b-sft-oa-alpaca-epoch-2-4bit'''-ggml'''
Line 200: Line 214:
|1049.03 ms per token
|1049.03 ms per token
|-
|-
|
|'''llama.cpp-avx512-pr833'''
|'''llama.cpp-avx512-pr833'''
|llama-30b-sft-oa-alpaca-epoch-2-4bit'''-ggml'''
|llama-30b-sft-oa-alpaca-epoch-2-4bit'''-ggml'''
Line 208: Line 223:
|0.89 tokens/s, 820.68 ms per token
|0.89 tokens/s, 820.68 ms per token
|-
|-
|
|text-gen
|text-gen
|TheBloke-gpt4-alpaca-lora-30B-4bit-GGML/ggml-model-'''q4_0'''
|TheBloke-gpt4-alpaca-lora-30B-4bit-GGML/ggml-model-'''q4_0'''
Line 216: Line 232:
|0.78 tokens/s
|0.78 tokens/s
|-
|-
|
|text-gen+'''avx512-pr833'''
|text-gen+'''avx512-pr833'''
|TheBloke-gpt4-alpaca-lora-30B-4bit-GGML/ggml-model-q4_0
|TheBloke-gpt4-alpaca-lora-30B-4bit-GGML/ggml-model-q4_0
Line 223: Line 240:
|
|
|1.04 tokens/s
|1.04 tokens/s
|-
|2023-04-24
|text-gen
|anon8231489123-vicuna-13b-GPTQ-4bit-128g
|GPTQ-for-LLaMa-'''triton'''
|yes
|2TB SSD, 64gb, also running llama.cpp with another model
|16.36
|5.07 tokens/s
|}
|}

Revision as of 14:22, 23 April 2023

LLama

I think I made the Triton-GPTQ model work with the GPTQ-for-LLaMa (Triton branch) repository.

https://huggingface.co/tsumeone/llama-30b-supercot-4bit-128g-cuda

https://huggingface.co/ausboss/llama-13b-supercot-4bit-128g/tree/main

https://github.com/ggerganov/llama.cpp/pull/1044

https://huggingface.co/Melbourne/Alpacino-30b-ggml/tree/main

https://github.com/stochasticai/xturing/tree/main/examples/int4_finetuning

https://wiki.installgentoo.com/wiki/Home_server#Expanding_Your_Storage

https://rentry.org/llama-tard-v2

https://rentry.org/llamaaids

https://hackmd.io/@reneil1337/alpaca

https://boards.4channel.org/g/catalog#s=lmg%2F

https://find.4chan.org/?q=AI+Dynamic+Storytelling+General

https://find.4chan.org/?q=AI+Chatbot+General

https://find.4chan.org/?q=%2Flmg%2F (local models general)

https://boards.4channel.org/g/thread/92400764#p92400764

https://rentry.org/llamaaids


https://files.catbox.moe/lvefgy.json

https://pytorch.org/hub/nvidia_deeplearningexamples_tacotron2/


python server.py --model llama-7b-4bit --wbits 4

python server.py --model llama-13b-4bit-128g --wbits 4 --groupsize 128

https://github.com/qwopqwop200/GPTQ-for-LLaMa/issues/59 for installing with out of space error

https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model#4-bit-mode


https://github.com/pybind/pybind11/discussions/4566

https://lmsysvicuna.miraheze.org/wiki/How_to_use_Vicuna#Use_with_llama.cpp%3A

https://huggingface.co/anon8231489123/vicuna-13b-GPTQ-4bit-128g


Here's the uncucked Vicuna model (trained on the dataset that don't have the moralistic bullshit anymore) Too bad it's just the CPU quantized version

Vicuna generating its own prompts


should be worse than Q4_1 (which is QK=32) but there are several PRs in the work that should improve quantization accuracy in general

https://github.com/ggerganov/llama.cpp/pull/729

https://github.com/ggerganov/llama.cpp/pull/835

https://github.com/ggerganov/llama.cpp/pull/896

benchmark_threads.txt

https://huggingface.co/TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g - python3 llama.py vicuna-AlekseyKorshuk-7B c4 --wbits 4 --true-sequential --act-order --groupsize 128 --save_safetensors vicuna-AlekseyKorshuk-7B-GPTQ-4bit-128g.safetensors


≈65% speedup of the AVX-512 implementation of ggml_vec_dot_q4_0() #933

"Speaking of which, for any 30b anons struggling with context size, I figured something out. If you use the Triton branch on WSL, go into GPTQ_loader.py and comment out make_quant_attn like so" from here


Just grab the CUDA branch of qwop's GPTQ for LLaMA (or Triton if you want to be a dickhole) or if you have webui installed, go into the folder for GPTQ. Make sure all the requirements are installed and run this line:

python llama.py /path/to30b c4 --wbits 4 --true-sequential --groupsize 128 --save_safetensors alpacino-4bit-128g.safetensors

And it'll run. For windows, obviously flip the slashes the right way. And for linux, you may need to add CUDA_VISIBLE_DEVICES=0 to the front of the command.

Benchmarks

Date Interface Model GPTQ Xformers? HW Load Speed
text-gen anon8231489123-vicuna-13b-GPTQ-4bit-128g GPTQ-for-LLaMa-triton yes 240gb SSD, 16gb,desktop off 10.53 7.97 tokens/s
text-gen anon8231489123-vicuna-13b-GPTQ-4bit-128g GPTQ-for-LLaMa-triton No xformers 240gb SSD, 16gb,desktop off 10.22s 7.55 tokens/s
text-gen anon8231489123-vicuna-13b-GPTQ-4bit-128g GPTQ-for-LLaMa-cuda No xformers 240gb SSD, 16gb,desktop off 16.68s 4.03 tokens/s
text-gen anon8231489123-vicuna-13b-GPTQ-4bit-128g GPTQ-for-LLaMa-cuda yes 240gb SSD, 16gb,desktop off 9.34s 4.01 tokens/s
text-gen llama-30b-sft-oa-alpaca-epoch-2-4bit-ggml no no 2TB SSD, 64gb ? 0.67 tokens/s
text-gen llama-30b-sft-oa-alpaca-epoch-2-4bit-ggml no no 2TB SSD, 64gb, --threads 8 maybe 30s? 0.51 tokens/s
text-gen llama-30b-sft-oa-alpaca-epoch-2-4bit-ggml no no 2TB SSD, 64gb, --threads 7 0.68 tokens/s
text-gen llama-30b-sft-oa-alpaca-epoch-2-4bit-ggml no no 2TB SSD, 64gb, --threads 6 0.61 tokens/s
text-gen anon8231489123-vicuna-13b-GPTQ-4bit-128g-ggml no no 2TB SSD, 64gb 1.17 tokens/s
text-gen anon8231489123-vicuna-13b-GPTQ-4bit-128g GPTQ-for-LLaMa-triton yes 2TB SSD, 64gb, --pre_layer 25 45.69 0.25 tokens/s
text-gen anon8231489123-vicuna-13b-GPTQ-4bit-128g GPTQ-for-LLaMa-triton yes 2TB SSD, 64gb 36.47 9.63 tokens/s
llama.cpp llama-30b-sft-oa-alpaca-epoch-2-4bit-ggml 2TB SSD, 64gb 10317.90 ms 1096.21 ms per token
llama.cpp-modern-avx512 llama-30b-sft-oa-alpaca-epoch-2-4bit-ggml 2TB SSD, 64gb 9288.69 ms 1049.03 ms per token
llama.cpp-avx512-pr833 llama-30b-sft-oa-alpaca-epoch-2-4bit-ggml 2TB SSD, 64gb 13864.06 ms 0.89 tokens/s, 820.68 ms per token
text-gen TheBloke-gpt4-alpaca-lora-30B-4bit-GGML/ggml-model-q4_0 2TB SSD, 64gb 0.78 tokens/s
text-gen+avx512-pr833 TheBloke-gpt4-alpaca-lora-30B-4bit-GGML/ggml-model-q4_0 2TB SSD, 64gb 1.04 tokens/s
2023-04-24 text-gen anon8231489123-vicuna-13b-GPTQ-4bit-128g GPTQ-for-LLaMa-triton yes 2TB SSD, 64gb, also running llama.cpp with another model 16.36 5.07 tokens/s