Editing
AI
(section)
Jump to navigation
Jump to search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== Local Benchmarks == TODO: Try different cublas batch sizes === Main === ==== Older ==== {| class="wikitable sortable" |+ ! !Interface !Model !GPTQ !Xformers? !HW !Load !Speed |- | |text-gen |anon8231489123-vicuna-13b-GPTQ-4bit-128g |GPTQ-for-LLaMa-'''triton''' |yes |240gb SSD, 16gb,desktop off |10.53 |7.97 tokens/s |- | |text-gen |anon8231489123-vicuna-13b-GPTQ-4bit-128g |GPTQ-for-LLaMa-'''triton''' |No xformers |240gb SSD, 16gb,desktop off |10.22s |7.55 tokens/s |- | |text-gen |anon8231489123-vicuna-13b-GPTQ-4bit-128g |GPTQ-for-LLaMa-'''cuda''' |No xformers |240gb SSD, 16gb,desktop off |16.68s |4.03 tokens/s |- | |text-gen |anon8231489123-vicuna-13b-GPTQ-4bit-128g |GPTQ-for-LLaMa-'''cuda''' |yes |240gb SSD, 16gb,desktop off |9.34s |4.01 tokens/s |- | |text-gen |llama-30b-sft-oa-alpaca-epoch-2-4bit'''-ggml''' |no |no |2TB SSD, 64gb |? |0.67 tokens/s |- | |text-gen |llama-30b-sft-oa-alpaca-epoch-2-4bit'''-ggml''' |no |no |2TB SSD, 64gb, '''--threads 8''' |maybe 30s? |0.51 tokens/s |- | |text-gen |llama-30b-sft-oa-alpaca-epoch-2-4bit'''-ggml''' |no |no |2TB SSD, 64gb, '''--threads 7''' | |0.68 tokens/s |- | |text-gen |llama-30b-sft-oa-alpaca-epoch-2-4bit'''-ggml''' |no |no |2TB SSD, 64gb, '''--threads 6''' | |0.61 tokens/s |- | |text-gen |anon8231489123-vicuna-13b-GPTQ-4bit-128g'''-ggml''' |no |no |2TB SSD, 64gb | |1.17 tokens/s |- | |text-gen |anon8231489123-vicuna-13b-GPTQ-4bit-128g |GPTQ-for-LLaMa-'''triton''' |yes |2TB SSD, 64gb, '''--pre_layer 25''' |45.69 |0.25 tokens/s |- | |text-gen |anon8231489123-vicuna-13b-GPTQ-4bit-128g |GPTQ-for-LLaMa-'''triton''' |yes |2TB SSD, 64gb |36.47 |9.63 tokens/s |- | |'''llama.cpp''' |llama-30b-sft-oa-alpaca-epoch-2-4bit'''-ggml''' | | |2TB SSD, 64gb |10317.90 ms |1096.21 ms per token |- | |'''llama.cpp-modern-avx512''' |llama-30b-sft-oa-alpaca-epoch-2-4bit'''-ggml''' | | |2TB SSD, 64gb |9288.69 ms |1049.03 ms per token |- | |'''llama.cpp-avx512-pr833''' |llama-30b-sft-oa-alpaca-epoch-2-4bit'''-ggml''' | | |2TB SSD, 64gb |13864.06 ms |0.89 tokens/s, 820.68 ms per token |- | |text-gen |TheBloke-gpt4-alpaca-lora-30B-4bit-GGML/ggml-model-'''q4_0''' | | |2TB SSD, 64gb | |0.78 tokens/s |- | |text-gen+'''avx512-pr833''' |TheBloke-gpt4-alpaca-lora-30B-4bit-GGML/ggml-model-q4_0 | | |2TB SSD, 64gb | |1.04 tokens/s |- |2023-04-24 |text-gen |anon8231489123-vicuna-13b-GPTQ-4bit-128g |GPTQ-for-LLaMa-'''triton''' |yes |2TB SSD, 64gb, also running llama.cpp with another model |16.36 |5.07 tokens/s |- |2023-04-26 |koboldcpp |gozfarb-llama-30b-supercot-ggml/ggml-model-q4_0.bin |'''clblast''' |n/a |2TB SSD, 64gb, '''--threads 8''' | |1073ms/T |- |2023-04-29 |koboldcpp |Alpacino-30b-q4_0.bin |'''clblast''' |n/a |2TB SSD, 64gb | |700ms/T |- |2023-07-13 |koboldcpp |llama-33b-supercot-ggml-'''q5_1''' (complains about '''old format''') |'''cublas''' |n/a |2TB SSD, 64gb, '''--nommap --smartcontext --usecublas --gpulayers 18''' | |643ms/T 1.4T/s |- |2023-07-13 |koboldcpp |llama-33b-supercot-ggml-'''q5_1''' (complains about '''old format''') |'''clblast''' |n/a |2TB SSD, 64gb, --nommap --smartcontext '''--useclblast 0 0''' --gpulayers 18 | |685ms/T 1.2T/s |- |2023-07-13 |koboldcpp |'''airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin''' |'''cublas''' |n/a |2TB SSD, 64gb, --nommap --smartcontext --usecublas --gpulayers 18 '''(probably space for more)''' | |652ms/T 1.5T/s |- |2023-07-13 |koboldcpp |airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin |cublas |n/a |2TB SSD, 64gb, --nommap --smartcontext --usecublas '''--gpulayers 26 (I note 3 threads are set by default)''' | |593ms/T 1.6T/s |- |2023-07-13 |koboldcpp |airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin |cublas |n/a |2TB SSD, 64gb, --nommap --smartcontext --usecublas --gpulayers 26 '''--psutil _set_threads (4 threads)''' | |514ms/T 1.8T/s |- |2023-07-13 |koboldcpp |airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin |cublas |n/a |2TB SSD, 64gb, --smartcontext --usecublas --gpulayers 26 --psutil _set_threads '''(removed nommap)''' | |508ms/T 1.9T/s |- |2023-07-13 |koboldcpp |airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin |cublas |n/a |2TB SSD, 64gb, --smartcontext --usecublas --gpulayers 26 '''--threads 5''' | |454ms/T 2.1T/s |- |2023-07-13 |koboldcpp |airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin |cublas |n/a |2TB SSD, 64gb, --smartcontext --usecublas --gpulayers 26 '''--threads 6''' | |'''422ms/T 2.2T/s''' |- |2023-07-13 |koboldcpp |airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin |cublas |n/a |2TB SSD, 64gb, --smartcontext --usecublas --gpulayers 26 '''--threads 7''' | |509ms/T 1.8T/s |- |2023-07-13 |koboldcpp |airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin |cublas |n/a |2TB SSD, 64gb, --smartcontext --usecublas --gpulayers 26 '''--threads 8''' | |494ms/T 1.7T/s |- |2023-07-13 |koboldcpp |airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin |cublas |n/a |2TB SSD, 64gb, --smartcontext --usecublas --gpulayers 26 '''--threads 6 --linearrope (no difference, needs supercot?)''' | |425ms/T 2.2T/s |- |2023-07-13 |koboldcpp |airoboros-33b-gpt4-'''1.4'''.ggmlv3.q4_K_M.bin |cublas |n/a |2TB SSD, 64gb, --smartcontext --usecublas --gpulayers 26 '''--threads 6''' | |400ms/T 2.3T/s |- |2023-07-13 |koboldcpp |airoboros-'''65b'''-gpt4-1.4.ggmlv3.q4_K_M.bin |cublas |n/a |2TB SSD, 64gb, --nommap --smartcontext --usecublas --gpulayers 13 --threads 6 | |1366ms/T 0.7T/s |- |2023-07-14 |koboldcpp |airoboros-65b-gpt4-1.4.ggmlv3.'''q2_K.'''bin |cublas |n/a |2TB SSD, 64gb, --nommap --smartcontext --usecublas --gpulayers 13 --threads 6 | |765ms/T - 1.2T/s |- |2023-09-06 |koboldcpp |'''guanaco-33B.ggmlv3.q4_K_M.bin''' |cublas |n/a |2TB SSD, 64gb, --stream --smartcontext --usecublas --gpulayers 29 --threads 6 | |562ms/T - 1.3T/s |- |2023-09-06 |koboldcpp |guanaco-33B.ggmlv3.q4_K_M.bin |cublas |n/a |2TB SSD, 64gb, '''--nommap''' --stream --smartcontext --usecublas --gpulayers 29 --threads 6 | |567ms/T), Total:70.7s (1.4T/s |- |2023-09-06 |koboldcpp |guanaco-33B.ggmlv3.q4_K_M.bin |cublas |n/a |2TB SSD, 64gb, '''--nommap''' --stream --smartcontext --usecublas --gpulayers 25 --threads 6 | |563ms/T), Total:70.2s (1.4T/s |- |2023-12-03 |koboldcpp |guanaco-33B.q4_K_M.'''gguf''' |cublas |n/a |2TB SSD, 64gb, --nommap --smartcontext --usecublas --threads 6 '''--gpulayers 27''' | |330.7ms/T), Total:40.79s 2.94T/s |- |2023-12-07 |koboldcpp |guanaco-33B.q4_K_M.gguf |cublas |n/a |'''7950x3d''', 2TB SSD, 64gb, --nommap --smartcontext --usecublas --threads 6 --gpulayers 27 | |202.1ms/T, 4.78T/s |- |2023-12-07 |koboldcpp |guanaco-33B.q4_K_M.gguf |cublas |n/a |7950x3d, 2TB SSD, 64gb, --nommap --smartcontext --usecublas '''--threads 32''' --gpulayers 27 | |360.8ms/T, 2.68T/s |- |2023-12-07 |koboldcpp |guanaco-33B.q4_K_M.gguf |cublas |n/a |7950x3d, 2TB SSD, 64gb, --nommap --smartcontext --usecublas '''--threads 16''' --gpulayers 27 | |202.6ms/T, 4.82T/s |- |2023-12-07 |koboldcpp |guanaco-33B.q4_K_M.gguf |cublas |n/a |7950x3d, 2TB SSD, 64gb, --nommap --smartcontext --usecublas '''--threads 15''' --gpulayers 27 | |195.0ms/T, 5.03T/s |- |2023-12-16 |koboldcpp |'''mistral-7b-instruct-v0.2.Q8_0.gguf''' |cublas |n/a |7950x3d, 2TB SSD, 64gb, --nommap --smartcontext --usecublas --threads 15 '''--gpulayers 33''' | |22.9ms/T, 42.90T/s |- |2023-12-17 |koboldcpp |'''mixtral-8x7b-moe-rp-story.Q8_0.gguf''' |cublas |n/a |7950x3d, 2TB SSD, 64gb, --nommap --smartcontext --usecublas --threads 15 '''--gpulayers 6''' | |214.9ms/T, 4.47T/s |- |2024-02-04 |SillyTavern |miqu 70b | | |gpu layers 9 | |1.4T/s |} ==== Old (2080ti) ==== {| class="wikitable sortable" |+ ! !Interface !Model !Ram !VRam !Paramaters !Speed |- |2025-08-01 |ollama |qwen3-'''coder''':30b-a3b-q4_K_M | | | |17.01T/s |- |2025-08-01 |ollama |hf.co/'''unsloth'''/Qwen3-'''Coder'''-30B-A3B-Instruct-GGUF:Q4_K_M | | | |16.84T/s |- |2025-08-01 |ollama |hf.co/'''unsloth'''/Qwen3-30B-A3B-'''Thinking'''-2507-GGUF:Q4_K_M | | | |14.39T/s |- |2025-08-01 |ollama |qwen3:30b-a3b-thinking-2507-'''q4_K_M''' | | | |13.95T/s |- |2025-08-02 |ollama |Qwen3:30b-a3b-thinking-2507-'''q8_0''' | | | |10.93T/s |- |2025-08-13 |'''llama.cpp-cuda''' |unsloth_Qwen3-Coder-30B-A3B-Instruct-GGUF_Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf | | |No gpu, --ctx-size 32684 |'''20.62 T/s''' |- |2025-08-13 |llama.cpp-cuda |unsloth_Qwen3-Coder-30B-A3B-Instruct-GGUF_Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf | |9287MiB / 11264MiB | -ngl 22 --ctx-size 12684 |'''31.54 T/s''' |- |2025-08-13 |llama.cpp-cuda |unsloth_Qwen3-Coder-30B-A3B-Instruct-GGUF_Qwen3-Coder-30B-A3B-Instruct-UD-Q6_K_XL.gguf | | | -ngl 15 --ctx-size 12684 |'''21.51 T/s''' |- |2025-08-14 |llama.cpp-cuda |unsloth_gemma-3-27b-it-qat-GGUF_gemma-3-27b-it-qat-Q4_K_M.gguf | | |default |like 3? idk |- |2025-08-14 |llama.cpp-cuda |unsloth_gemma-3-27b-it-qat-GGUF_gemma-3-27b-it-qat-Q4_K_M.gguf | | | -fa --ctx-size 12684 |2.76 T/s |- |2025-08-14 |llama.cpp-cuda |unsloth_gemma-3-27b-it-qat-GGUF_gemma-3-27b-it-qat-Q4_K_M.gguf | | | -fa -ngl 25 --ctx-size 12684 |4.87 Ts |- |2025-08-14 |llama.cpp-cuda |unsloth_gemma-3-27b-it-qat-GGUF_gemma-3-27b-it-qat-Q4_K_M.gguf | | | -ngl 22 --ctx-size 12684 |4.10 T/s |- |2025-08-14 |llama.cpp-cuda |unsloth_gemma-3-27b-it-qat-GGUF_gemma-3-27b-it-qat-Q4_K_M.gguf | | | -ngl 22 --ctx-size 12684 '''-fa''' |4.43 T/s |- |2025-08-14 |llama.cpp-cuda |unsloth/'''gemma-3n-E4B'''-it-GGUF:UD-Q4_K_XL | | | -fa -ngl 99 --ctx-size 12684 |23.14 T/s |- |2025-08-14 |llama.cpp-cuda |unsloth/Mistral-Small-3.2-24B-Instruct-2506-GGUF:UD-Q4_K_XL | | |<nowiki>--ctx-size 12684</nowiki> --flash-attn --jinja --temp 0.15 --top-k -1 --top-p 1.00 -ngl 20 |7.48 T/s |- |2025-08-14 |llama.cpp-cuda |unsloth/Qwen3-4B-Instruct-2507-GGUF:UD-Q8_K_XL | | |<nowiki>--flash-attn</nowiki> -ngl 99 --jinja --ctx-size 12684 --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 --presence-penalty 1.5 -c 22684 -n 32768 --no-context-shift |48.69 T/s |- |2025-08-14 |ollama |hf.co/bartowski/L3.3-MS-Nevoria-70b-GGUF:Q4_K_M | | |Defaults |1.74 tokens/s |- |2025-08-14 |llama.cpp-cuda |hf.co/bartowski/L3.3-MS-Nevoria-70b-GGUF:Q4_K_M | | | --flash-attn -ngl 15 --ctx-size 6000 |1.82 T/s |- |2025-08-14 |llama.cpp-cuda |hf.co/bartowski/L3.3-MS-Nevoria-70b-GGUF:Q4_K_M | | |<nowiki>--hf-repo-draft unsloth/Llama-3.2-1B-Instruct-GGUF:Q4_K_M</nowiki> --flash-attn -ngl 13 --ctx-size 6000 --gpu-layers-draft 99 |2.98 T/s |- |2025-08-14 |llama.cpp-cuda |unsloth_gemma-3-27b-it-qat-GGUF_gemma-3-27b-it-qat-Q4_K_M.gguf | | | +270m draft model |7.19 T/s |- |2025-08-16 |llama.cpp-cuda |qwen3-coder-30b:Q4_K_XL-GPU | |8879MiB / 11264MiB | --flash-attn -ctk q4_0 -ctv q4_0 --jinja -ngl 22 --ctx-size 12684 |33.77 T/s |- |2025-08-16 |llama.cpp-cuda |qwen3-coder-30b:Q4_K_XL-GPU | |8907MiB / 11264MiB | --flash-attn -ctk q8_0 -ctv q8_0 --jinja -ngl 22 --ctx-size 12684 |35.71 T/s |- |2025-08-16 |llama.cpp-cuda |qwen3-coder-30b:Q4_K_XL-GPU | |10400MiB / 11264MiB | --flash-attn -ctk q4_0 -ctv q4_0 --jinja '''-ngl 26''' --ctx-size 12684 |37.55 T/s |- |2025-08-16 |llama.cpp-cuda |qwen3-coder-30b:Q4_K_XL-GPU | |10371MiB / 11264MiB |<nowiki>--flash-attn</nowiki> -ctk q8_0 -ctv q8_0 --jinja -ngl 26 --ctx-size 12684 |37.08 T/s |- |2025-08-21 |llama.cpp-cuda |ggml-org/gpt-oss-20b-GGUF:mxfp4 | |7907MiB / 11264MiB | --ctx-size 32768 --jinja -ub 2048 -b 2048 -ngl 99 -fa --n-cpu-moe 16 |38.15 T/s |- |2025-08-23 |llama.cpp-cuda |qwen3-coder-30b:'''Q6'''_K_XL-GPU | |10511MiB / 11264MiB |<nowiki>--flash-attn</nowiki> -ctk q8_0 -ctv q8_0 --n-cpu-moe 33 -ngl 99 --ctx-size 12684 |32.5 t/s |- |2025-08-25 |llama.cpp-cuda |unsloth/gemma-3-12b-it-qat-GGUF:UD-Q4_K_XL | |6030MiB / 11264MiB |<nowiki>--flash-attn</nowiki>-ngl 22 --ctx-size 12684 |12.94 T/s |- |2025-08-25 |llama.cpp-cuda |unsloth/gemma-3-12b-it-qat-GGUF:UD-Q4_K_XL | |10174MiB / 11264MiB |<nowiki>--flash-attn</nowiki>-ngl 99 --ctx-size 12684 |48.3 t/s |- |2025-08-25 |llama.cpp-cuda |unsloth/gemma-3-12b-it-qat-GGUF:UD-Q4_K_XL | |10674MiB / 11264MiB |<nowiki>--flash-attn</nowiki>-ngl 99 --ctx-size 12684 '''-ctk q8_0 -ctv q8_0''' |'''37.45 T/s''' |- |2025-08-25 |llama.cpp-cuda |unsloth/gemma-3-12b-it-qat-GGUF:UD-Q4_K_XL | |10728MiB / 11264MiB |<nowiki>--flash-attn</nowiki>-ngl 99 -'''-ctx-size 21845''' |49.20 T/s |- |2025-08-25 |llama.cpp-cuda |QuantStack/InternVL3_5-30B-A3B-gguf:IQ4_XS 👁 | |9220MiB / 11264MiB |${KV_CACHE} --flash-attn --n-cpu-moe 30 -ngl 99 --ctx-size 12684 |47.1 t/s |- |2025-08-28 |llama.cpp-cuda |mradermacher_Dolphin-Mistral-24B-Venice-Edition-i1-GGUF_Dolphin-Mistral-24B-Venice-Edition.i1-Q4_K_M.gguf | |9550MiB / 11264MiB |<nowiki>--ctx-size 8192</nowiki> --flash-attn --jinja --temp 0.15 --top-p 0.95 --top-k 20 --min-p 0 -ngl 20 |7.9 t/s |- |2025-08-28 |llama.cpp-cuda |mradermacher_Dolphin-Mistral-24B-Venice-Edition-i1-GGUF_Dolphin-Mistral-24B-Venice-Edition.i1-Q4_K_M.gguf | |9107MiB / 11264MiB |'''${KV_CACHE}''' <nowiki>--ctx-size 8192</nowiki> --flash-attn --jinja --temp 0.15 --top-p 0.95 --top-k 20 --min-p 0 -ngl 20 |7.8 t/s |- |2025-08-28 |llama.cpp-cuda |mradermacher_Dolphin-Mistral-24B-Venice-Edition-i1-GGUF_Dolphin-Mistral-24B-Venice-Edition.i1-Q4_K_M.gguf | |9766MiB / 11264MiB |${KV_CACHE} <nowiki>--ctx-size 8192</nowiki> --flash-attn --jinja --temp 0.15 --top-p 0.95 --top-k 20 --min-p 0 '''-ngl 22''' |8.3 t/s |- |2025-08-28 |llama.cpp-cuda |mradermacher_Dolphin-Mistral-24B-Venice-Edition-i1-GGUF_Dolphin-Mistral-24B-Venice-Edition.i1-Q4_K_M.gguf | |10429MiB / 11264MiB |${KV_CACHE} <nowiki>--ctx-size 8192</nowiki> --flash-attn --jinja --temp 0.15 --top-p 0.95 --top-k 20 --min-p 0 '''-ngl 24''' |8.9 t/s |- |2025-08-28 |llama.cpp-cuda |mradermacher_Dolphin-Mistral-24B-Venice-Edition-i1-GGUF_Dolphin-Mistral-24B-Venice-Edition.i1-Q4_K_M.gguf | |10542MiB / 11264MiB | --'''hf-repo-draft''' bartowski/alamios_Mistral-Small-3.1-DRAFT-0.5B-GGUF:Q4_K_M --jinja --temp 0.15 --top-p 0.95 --top-k 20 --min-p 0 --ctx-size 8192 ${KV_CACHE} --flash-attn '''-ngl 22''' --gpu-layers-draft 99 |9.7 t/s |- |2025-08-29 |llama.cpp-cuda |unsloth/GLM-4.5-Air-UD-Q2_K_XL.gguf |8864MiB |10253MiB / 11264MiB |${KV_CACHE} --flash-attn --n-cpu-moe 42 -ngl 99 --ctx-size 8192 |17.1 t/s |- |2025-08-30 |llama.cpp-cuda |unsloth/gpt-oss-20b:f16 |8293M |7876MiB / 11264MiB | --ctx-size 32768 -ub 2048 -b 2048 -ngl 99 --n-cpu-moe 16 ${KV_CACHE} |39.4 t/s |- |2025-08-30 |llama.cpp-cuda |unsloth/gpt-oss-20b:f16 |13.6G |10320MiB / 11264MiB | --ctx-size 32768 -ub 2048 -b 2048 -ngl 99 --n-cpu-moe 12 ${KV_CACHE} |42.8 t/s |- |2025-08-30 |llama.cpp-cuda |unsloth/gpt-oss-'''120b''':f16 |33.6G |9411MiB / 11264MiB | --ctx-size 32768 -ub 2048 -b 2048 ${KV_CACHE} |6.0 t/s |- |2025-08-30 |llama.cpp-cuda |unsloth/Seed-OSS-36B-Instruct-UD-Q4_K_XL |14.5G |10327MiB / 11264MiB |${KV_CACHE} --flash-attn --jinja --n-cpu-moe 40 -ngl 24 --ctx-size 12684 |4.2t/s |- |2025-09-01 |llama.cpp-cuda |unsloth/gpt-oss-'''120b''':f16 |44.4G |9541MiB / 11264MiB |${'''UNSLOTH_GPT_OSS_PARAMS'''} (k_top 100 not 0) --jinja --ctx-size '''12768''' -ub 2048 -b 2048 -ngl 99 --n-cpu-moe 34 --flash-attn ${KV_CACHE_Q4} |11.7 t/s |- |2025-09-01 |llama.cpp-cuda |unsloth/gpt-oss-120b:f16 |45.9G |9576MiB / 11264MiB |${'''UNSLOTH_GPT_OSS_PARAMS'''} (k_top 100 not 0) --jinja --ctx-size '''22768''' -ub 2048 -b 2048 -ngl 99 --n-cpu-moe 34 --flash-attn ${KV_CACHE_Q4} |12.9 t/s |- |2025-09-05 |llama.cpp-cuda |'''Hermes-4-14B-GGUF:Q4_K_M''' |853.63 MiB |10625MiB / 11264MiB |${KV_CACHE} --flash-attn on --jinja -ngl 99 --ctx-size 12684 |'''42.7 t/s''' |- |2025-09-10 |llama.cpp-cuda |gabriellarson/'''ERNIE-4.5-21B-A3B-Thinking-GGUF:Q4_K_M''' | |'''2742MiB''' / 11264MiB |${KV_CACHE} --flash-attn on --n-cpu-moe 35 -ngl 99 --jinja --ctx-size 12684 |'''31.1 t/s''' |- |2025-09-10 |llama.cpp-cuda |gabriellarson/ERNIE-4.5-21B-A3B-Thinking-GGUF:Q4_K_M | |10759MiB / 11264MiB |${KV_CACHE} --flash-attn on --n-cpu-moe '''10''' -ngl 99 --jinja --ctx-size 12684 |'''49.0 t/s''' |- |2025-09-10 |llama.cpp-cuda |gabriellarson/ERNIE-4.5-21B-A3B-Thinking-GGUF:Q8_0 |15.7G |9408MiB / 11264MiB |${KV_CACHE} --flash-attn on --n-cpu-moe '''20''' -ngl 99 --jinja --ctx-size 12684 |'''28.5 t/s''' |- |2025-09-18 |llama.cpp-cuda |unsloth_Magistral-Small-2509-GGUF_Magistral-Small-2509-UD-Q4_K_XL.gguf 👁 |8900M |10237MiB / 11264MiB |<nowiki>--special</nowiki> --ctx-size 12684 --flash-attn on -ngl 20 --jinja --temp 0.7 --top-k -1 --top-p 0.95 |7.9 t/s |- |2025-09-18 |llama.cpp-cuda |glm-4.5-air:Q3_K_XL |42.6G |8881MiB / 11264MiB |${KV_CACHE} --flash-attn on --jinja --n-cpu-moe 44 -ngl 99 --ctx-size 8192 |7.6 t/s |- |2025-09-26 |llama.cpp-cuda |qwen3-coder-30b:Q4_K_XL-GPU |10.5G |8632MiB |${KV_CACHE} --flash-attn on --jinja --n-cpu-moe 30 -ngl 99 --ctx-size 25000 |49.33t/s |- |2025-10-03 |llama.cpp-cuda |granite-4.0-h-small:Q4_K_XL ('''32B-9B''') |11.4G |9324MiB |${KV_CACHE} --flash-attn on --n-cpu-moe 25 -ngl 99 --ctx-size 12684 |25.23t/s |- |2025-10-03 |llama.cpp-cuda |granite-4.0-h-tiny:Q4_K_XL ('''7B-1B)''' |0 |5487MiB |${KV_CACHE} --flash-attn on --n-cpu-moe 0 -ngl 99 --ctx-size 16384 |112.28 |- |2025-10-03 |llama.cpp-cuda |granite-4.0-h-micro:Q4_K_XL ('''3B''') |0 |3431MiB |${KV_CACHE} --flash-attn on --n-cpu-moe 0 -ngl 99 --ctx-size 16384 |108.58 |- |2025-10-31 |llama.cpp-cuda |unsloth/Qwen3-VL-30B-A3B-Thinking-UD-Q4_K_XL.gguf +unsloth/Qwen3-VL-30B-A3B-Thinking-GGUF-mmproj-F16.gguf |S10.6g |V10855MiB |${KV_CACHE} --flash-attn on --ctx-size 12684 --n-cpu-moe 30 -ngl 99 |42.42 |- |2025-11-06 |llama.cpp-cuda |Face314/Qwen3-Coder-30B-A3B-Instruct-'''MXFP4_MOE'''.gguf |10582MiB |8542MiB |${KV_CACHE} --flash-attn on --n-cpu-moe 30 -ngl 99 --ctx-size 25000 |42.72 tokens/s |- |2025-11-25 |llama.cpp-cuda |noctrex/Qwen3-VL-30B-A3B-Instruct-1M-'''MXFP4_MOE'''.gguf +unsloth/Qwen3-VL-30B-A3B-Thinking-GGUF-mmproj-F16.gguf |10.3G |9328MiB |${KV_CACHE} --flash-attn on --ctx-size 12684 --n-cpu-moe 30 -ngl 99 |43.72 tokens/s |- |2025-11-27 |llama.cpp-cuda |unsloth/glm-4.5-air:Q3_K_XL |46.2g |7632MiB |${KV_CACHE_Q4} --flash-attn on --jinja --n-cpu-moe 44 -ngl 99 --ctx-size 12000 |13.0 tokens/sec |- |2025-11-27 |llama.cpp-cuda |mradermacher/GLM-4.5-Air-Derestricted.i1-IQ4_XS.gguf |51G |7226MiB |${KV_CACHE_Q4} --flash-attn on --jinja --n-cpu-moe 44 -ngl 99 --ctx-size 12000 |'''13.50''' tokens/s |- |2025-11-27 |llama.cpp-cuda |agiccodingman/Qwen3-VL-'''32B'''-Thinking-unsloth-MXFP4_MOE-Q6_K.gguf No vision or draft |24.6G |9644MiB |${KV_CACHE} --flash-attn on --ctx-size 8192 -ngl 17 # 502MiB per-layer |'''2.4''' tokens/sec |- |2025-11-29 |llama.cpp-cuda |unsloth/'''Qwen3-Next-80B-A3B'''-Instruct-UD-Q4_K_XL.gguf |41.3G |9358MiB |${KV_CACHE} --flash-attn on --jinja --ctx-size 12684 --n-cpu-moe '''40''' -ngl 99 |'''22.78''' tokens/s |} deepseek-r1-distill-qwen-32b - 4.20 tok/sec ==== Newer (5080) ==== {| class="wikitable sortable" ! !Interface !Model !Ram !VRam !Paramaters !Speed |- |2025-12-01 |llama.cpp-cuda |unsloth/glm-4.5-air:Q3_K_XL |49.5G |7768MiB |${KV_CACHE} --flash-attn on --jinja --n-cpu-moe 44 -ngl 99 --ctx-size 8192 |16.34 tokens/s |- |2025-12-01 |llama.cpp-cuda |unsloth/glm-4.5-air:Q3_K_XL |49.5G |14302MiB |${KV_CACHE} --flash-attn on --jinja --n-cpu-moe '''38''' -ngl 99 --ctx-size 8192 |'''18.20 tokens/s''' |- |2025-12-05 |llama.cpp-cuda |Qwen3-Next-80B-A3B-Thinking |41.0G |9528MiB |${KV_CACHE} --flash-attn on --jinja --ctx-size 12684 --n-cpu-moe '''40''' -ngl 99 |23.14 tokens/s |- |2025-12-05 |llama.cpp-cuda |Qwen3-Next-80B-A3B-Thinking |41.0G |9528MiB |${KV_CACHE} --flash-attn on --jinja --ctx-size 12684 --n-cpu-moe '''34''' -ngl 99 |23.75 tokens/s |- |2025-12-05 |llama.cpp-cuda ('''Updated to 7278''' with Qwen3-next fixes) |Qwen3-Next-80B-A3B-Thinking |41.0G |14628MiB |${KV_CACHE} --flash-attn on --jinja --ctx-size 12684 --n-cpu-moe 34 -ngl 99 |'''26.40 tokens/s''' |} === Hextor === https://artificialanalysis.ai/models/comparisons/qwen3-30b-a3b-2507-vs-gemma-3-27b https://artificialanalysis.ai/models/comparisons/qwen3-30b-a3b-2507-vs-deepseek-r1-0120 [https://arxiv.org/html/2507.20984v2#S5 Smallthinker comparison, compares gemma3, older qwen3] {| class="wikitable" |+ !Date !Engine !Model !Thinking ! !Total Time !Eval Tokens !Prompt (tokens/s) !Eval (token/s) |- |2025-08-01 |ollama |gemma3:1B |❌ | |20s |573 |91.61 |28.04 |- |2025-08-01 |ollama |mashriram/gemma3nTools:e4b |❌ | |1m12s |537 |22.12 |7.95 |- |2025-08-01 |ollama |gemma3:12b-it-qat |❌ | |4m2s |716 |9.79 |2.99 |- |2025-08-01 |ollama |gemma3:27b-it-qat |❌ | |9m11s |778 |4.16 |1.43 |- |2025-08-01 |ollama |deepseek-r1:8b |✅ | |6m9s |1807 |13.53 |4.90 |- |2025-08-01 |ollama |qwen3:30b-a3b-instruct-2507-q4_K_M |❌ | |2m10s |1454 |26.60 |11.17 |- |2025-08-01 |ollama |hf.co/unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF:Q4_K_M |❌ | |56s |623 |21.37 |11.26 |- |2025-08-01 |ollama |hf.co/unsloth/Qwen3-30B-A3B-Thinking-2507-GGUF:Q4_K_M |✅ | |1m50s |1203 |25.88 |10.87 |- |2025-08-01 |ollama |hf.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:Q4_K_M |❌ | |36s |504 |136.43 |11.33 |- |2025-08-01 |ollama |qwq:latest |✅ | |14m17s |1111 |3.21 |1.30 |- |2025-08-01 |ollama |hf.co/unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF:Q4_K_XL |✅ | |6m9s |1860 |12.93 |5.05 |- |2025-08-13 |llama.cpp |unsloth_Qwen3-Coder-30B-A3B-Instruct-GGUF_Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf |❌ | |30903.17 ms |357 |30.27 |11.70 |- |2025-08-15 |llama.cpp |unsloth_Qwen3-30B-A3B-Instruct-2507-GGUF_Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf |❌ | |14703.99 ms |177 |30.83 |12.04 |- |2025-08-15 |llama.cpp |unsloth_Qwen3-4B-Instruct-2507-GGUF_Qwen3-4B-Instruct-2507-UD-Q8_K_XL.gguf |❌ | |28738.32 ms |147 |28.28 |5.22 |- |2025-08-15 |llama.cpp |unsloth_Qwen3-4B-Instruct-2507-GGUF_Qwen3-4B-Instruct-2507-UD-'''Q4'''_K_XL.gguf |❌ | |8580.48 ms |79 |31.71 |9.82 |- |2025-08-16 |llama.cpp |unsloth_Qwen3-Coder-30B-A3B-Instruct-GGUF_Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf |❌ |kv_cache q8 | | | |12.47, 11.73 |- |2025-08-16 |llama.cpp |unsloth_Qwen3-Coder-30B-A3B-Instruct-GGUF_Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf |❌ |kv_cache q4 | | | |11.89, 11.94 |- |2025-08-30 |llama.cpp |unsloth/gpt-oss-20b-F16.gguf |❌ |no fa? | | | |11.7 t/s |- |2025-10-04 |llama.cpp |unsloth/granite-4.0-h-small:Q4_K_XL |❌ | | | | |4.17 tokens/s |}
Summary:
Please note that all contributions to Hegemon Wiki may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
Hegemon Wiki:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Navigation menu
Personal tools
Not logged in
Talk
Contributions
Log in
Namespaces
Page
Discussion
English
Views
Read
Edit
Edit source
View history
More
Search
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Tools
What links here
Related changes
Special pages
Page information