AI: Difference between revisions

koboldcpp.exe 13B-HyperMantis.ggmlv3.q4_K_M.bin --debug --usecublas --usemlock --contextsize 8192 --blasbatchsize 512 --psutil_set_threads --threads 6 --blasthreads 10 --gpulayers 5 --highpriority --stream --usemlock --unbantokens --smartcontext

If you're running 1.35 and a superHOT model, you should also add --linearrope which should make them perform better.

RP recommened models: "superhot, airoboros, wizard-vicuna, guanaco, chronos are a few commonly discussed models off the top of my head. For me, it's superhot or guanaco (one or the other, not the merge though)"

Programs

UI

jan.ai

open webui

CLI

ngpt

mcp-client-cli

mcptools - For inspecting mcp servers.

Client Name	Description	Key Features	Implementation	URL
oterm	A text-based terminal client for Ollama with MCP tools, prompts, and sampling.	Supports MCP tools, prompts, sampling; Streamable HTTP & WebSocket transports.	TUI (Terminal UI)	GitHub
ollama-mcp-client	Python-based client for integrating local Ollama models with MCP servers.	Seamless MCP integration, Git operations support, tool discovery.	Python CLI	GitHub
mcp-client-for-ollama	TUI client for interacting with MCP servers using Ollama, offering interactivity.	Multi-server support, streaming responses, fuzzy autocomplete.	TUI (Terminal UI)	GitHub
Mcp-cli	General-purpose CLI for interacting with MCP servers, supporting Ollama.	Supports multiple providers, modular chat, context-aware completions.	Command-line	Source
Mcp Client Ollama	Python-based CLI for connecting Ollama to MCP servers, focusing on tool execution.	stdio and SSE transports, JSON configuration, multiple server support.	Python CLI	Source

GUI

Tasksel

Not sure if this is common knowledge, but some advice to all fellow VRAMlets who are offloading to RAM. Setting the number of threads is not good enough, you can get extra speed by manually setting core affinity.

For context:

I have a 13600K which has 6 P-cores. I had read that you should set --threads to that number, so I would run koboldcpp with --threads 6 and from some testing this was indeed the best option with that argument alone.

BUT, I looked at which cores were actually used and found e-cores also being used sometimes.

So the next step was to set the core affinity to just P-cores. Each P-core has two threads and CPU0-11 was P-cores, CPU12-19 was E-cores. Thus, I ran koboldcpp with one thread from each core:

taskset -c 0,2,4,6,8,10 python kobodcpp.py [args]

My speed running command-r went from ~2.3 T/s to 2.67 T/s Pretty good. But, what if I use them fully, I thought. So I set --threads 12 and taskset -c 0,1,2,3,4,5,6,7,8,9,10,11 And I get a generation with 3.09 T/s That's a whooping 33% increase from my initial. Hope this is helpful, it actually had my basedfacing captcha: pic rel

Benchmarks

LMArena

MCPMark

https://scale.com/leaderboard

https://artificialanalysis.ai/

https://livebench.ai/#/

https://eqbench.com/creative_writing.html

terminal-bench

BFCL: From Tool Use to Agentic Evaluation of Large Language Models

MTEB Leaderboard - Embedding models

Open ASR Leaderboard

📢UGI-Leaderboard - Uncensored General Intelligence

Artificial Analysis Long Context Reasoning Benchmark Leaderboard

AI Model Evaluations

Vision Benchmarks

https://huggingface.co/spaces/opencompass/open_vlm_leaderboard

https://dubesor.de/visionbench

MCP Servers

Context7

awesome-mcp-clients/

https://smithery.ai/

https://mcp.so/servers

mcptools

https://www.reddit.com/r/RooCode/comments/1ijgk2x/roo_code_mcps_best_mcp_configs/mbej58g/

Prompts

https://cookbook.openai.com/examples/enhance_your_prompts_with_meta_prompting

https://www.prompthub.us/blog/a-complete-guide-to-meta-prompting

rubric prompting

https://generateprompt.ai/en/

awesome-chatgpt-prompts

https://alidocs.dingtalk.com/i/nodes/EpGBa2Lm8aZxe5myC99MelA2WgN7R35y

https://github.com/elder-plinius/L1B3RT4S

Video

ComfyUI - Wan 2.2 & FFLF with Flux Kontext for Quick Keyframes for Video

🤿 One-Step Video Upscaling: Complete ComfyUI SeedVR2 Guide (Free workflow included) | AInVFX July 11

Upscaling Maximizing VRAM | Free ComfyUI Workflow!

Easy Creation with One Click - AI Videos - Wan cheatsheet.

TLB-VFI: Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation

Voice

Step-Audio-2-mini - an 8 billion parameter (8B) speech-to-speech model. It outperforms GPT-4o-Audio

There are other tts that are much better like Spark-tts and Higgs-tts. Keep in mind that higss tts full model with voice clone need 18gb of vram an it much slower then spark-tts

https://www.reddit.com/r/speechtech/

marvis-tts

https://github.com/nuvious/coqui-ai-api

Microsoft VibeVoice TTS LOCAL Testing – A Multi-Speaker Podcast TTS!

Dia, Fish Speech

Agentic AI Dungeon

n8n - Web tool for AI agents

chroma - Vector database

pgvector

lmstudio

LLama

/lmg/

Models

Qwen3-Next

Ring-Flash-2.0

Misc

lammacpp server?

"There was a tokenizer caching error, some people said. Redownload the hf_output files from the repo or just change the use_cache line in the config.json to say: "use_cache": true," for the Vicuna13B-free https://github.com/stochasticai/xturing/tree/main/examples/int4_finetuning

https://wiki.installgentoo.com/wiki/Home_server#Expanding_Your_Storage

https://rentry.org/llama-tard-v2

https://rentry.org/llamaaids

https://hackmd.io/@reneil1337/alpaca

https://find.4chan.org/?q=AI+Dynamic+Storytelling+General

https://find.4chan.org/?q=AI+Chatbot+General

https://find.4chan.org/?q=%2Flmg%2F (local models general)

https://boards.4channel.org/g/thread/92400764#p92400764

https://rentry.org/llamaaids

https://files.catbox.moe/lvefgy.json

https://pytorch.org/hub/nvidia_deeplearningexamples_tacotron2/

python server.py --model llama-7b-4bit --wbits 4
python server.py --model llama-13b-4bit-128g --wbits 4 --groupsize 128

https://github.com/qwopqwop200/GPTQ-for-LLaMa/issues/59 for installing with out of space error

https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model#4-bit-mode

https://github.com/pybind/pybind11/discussions/4566

https://lmsysvicuna.miraheze.org/wiki/How_to_use_Vicuna#Use_with_llama.cpp%3A

https://huggingface.co/anon8231489123/vicuna-13b-GPTQ-4bit-128g

Here's the uncucked Vicuna model (trained on the dataset that don't have the moralistic bullshit anymore) Too bad it's just the CPU quantized version

Vicuna generating its own prompts

https://huggingface.co/TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g - python3 llama.py vicuna-AlekseyKorshuk-7B c4 --wbits 4 --true-sequential --act-order --groupsize 128 --save_safetensors vicuna-AlekseyKorshuk-7B-GPTQ-4bit-128g.safetensors

≈65% speedup of the AVX-512 implementation of ggml_vec_dot_q4_0() #933

"Speaking of which, for any 30b anons struggling with context size, I figured something out. If you use the Triton branch on WSL, go into GPTQ_loader.py and comment out make_quant_attn like so" from here

Just grab the CUDA branch of qwop's GPTQ for LLaMA (or Triton if you want to be a dickhole) or if you have webui installed, go into the folder for GPTQ. Make sure all the requirements are installed and run this line:

python llama.py /path/to30b c4 --wbits 4 --true-sequential --groupsize 128 --save_safetensors alpacino-4bit-128g.safetensors

And it'll run. For windows, obviously flip the slashes the right way. And for linux, you may need to add CUDA_VISIBLE_DEVICES=0 to the front of the command.

GGML Quantization

Some tables on Reddit

Relative quantization

Papers

Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision

http://attentionviz.com/

Local Benchmarks

TODO: Try different cublas batch sizes

Main

Older


	Interface	Model	GPTQ	Xformers?	HW	Load	Speed
	text-gen	anon8231489123-vicuna-13b-GPTQ-4bit-128g	GPTQ-for-LLaMa-triton	yes	240gb SSD, 16gb,desktop off	10.53	7.97 tokens/s
	text-gen	anon8231489123-vicuna-13b-GPTQ-4bit-128g	GPTQ-for-LLaMa-triton	No xformers	240gb SSD, 16gb,desktop off	10.22s	7.55 tokens/s
	text-gen	anon8231489123-vicuna-13b-GPTQ-4bit-128g	GPTQ-for-LLaMa-cuda	No xformers	240gb SSD, 16gb,desktop off	16.68s	4.03 tokens/s
	text-gen	anon8231489123-vicuna-13b-GPTQ-4bit-128g	GPTQ-for-LLaMa-cuda	yes	240gb SSD, 16gb,desktop off	9.34s	4.01 tokens/s
	text-gen	llama-30b-sft-oa-alpaca-epoch-2-4bit-ggml	no	no	2TB SSD, 64gb	?	0.67 tokens/s
	text-gen	llama-30b-sft-oa-alpaca-epoch-2-4bit-ggml	no	no	2TB SSD, 64gb, --threads 8	maybe 30s?	0.51 tokens/s
	text-gen	llama-30b-sft-oa-alpaca-epoch-2-4bit-ggml	no	no	2TB SSD, 64gb, --threads 7		0.68 tokens/s
	text-gen	llama-30b-sft-oa-alpaca-epoch-2-4bit-ggml	no	no	2TB SSD, 64gb, --threads 6		0.61 tokens/s
	text-gen	anon8231489123-vicuna-13b-GPTQ-4bit-128g-ggml	no	no	2TB SSD, 64gb		1.17 tokens/s
	text-gen	anon8231489123-vicuna-13b-GPTQ-4bit-128g	GPTQ-for-LLaMa-triton	yes	2TB SSD, 64gb, --pre_layer 25	45.69	0.25 tokens/s
	text-gen	anon8231489123-vicuna-13b-GPTQ-4bit-128g	GPTQ-for-LLaMa-triton	yes	2TB SSD, 64gb	36.47	9.63 tokens/s
	llama.cpp	llama-30b-sft-oa-alpaca-epoch-2-4bit-ggml			2TB SSD, 64gb	10317.90 ms	1096.21 ms per token
	llama.cpp-modern-avx512	llama-30b-sft-oa-alpaca-epoch-2-4bit-ggml			2TB SSD, 64gb	9288.69 ms	1049.03 ms per token
	llama.cpp-avx512-pr833	llama-30b-sft-oa-alpaca-epoch-2-4bit-ggml			2TB SSD, 64gb	13864.06 ms	0.89 tokens/s, 820.68 ms per token
	text-gen	TheBloke-gpt4-alpaca-lora-30B-4bit-GGML/ggml-model-q4_0			2TB SSD, 64gb		0.78 tokens/s
	text-gen+avx512-pr833	TheBloke-gpt4-alpaca-lora-30B-4bit-GGML/ggml-model-q4_0			2TB SSD, 64gb		1.04 tokens/s
2023-04-24	text-gen	anon8231489123-vicuna-13b-GPTQ-4bit-128g	GPTQ-for-LLaMa-triton	yes	2TB SSD, 64gb, also running llama.cpp with another model	16.36	5.07 tokens/s
2023-04-26	koboldcpp	gozfarb-llama-30b-supercot-ggml/ggml-model-q4_0.bin	clblast	n/a	2TB SSD, 64gb, --threads 8		1073ms/T
2023-04-29	koboldcpp	Alpacino-30b-q4_0.bin	clblast	n/a	2TB SSD, 64gb		700ms/T
2023-07-13	koboldcpp	llama-33b-supercot-ggml-q5_1 (complains about old format)	cublas	n/a	2TB SSD, 64gb, --nommap --smartcontext --usecublas --gpulayers 18		643ms/T 1.4T/s
2023-07-13	koboldcpp	llama-33b-supercot-ggml-q5_1 (complains about old format)	clblast	n/a	2TB SSD, 64gb, --nommap --smartcontext --useclblast 0 0 --gpulayers 18		685ms/T 1.2T/s
2023-07-13	koboldcpp	airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin	cublas	n/a	2TB SSD, 64gb, --nommap --smartcontext --usecublas --gpulayers 18 (probably space for more)		652ms/T 1.5T/s
2023-07-13	koboldcpp	airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin	cublas	n/a	2TB SSD, 64gb, --nommap --smartcontext --usecublas --gpulayers 26 (I note 3 threads are set by default)		593ms/T 1.6T/s
2023-07-13	koboldcpp	airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin	cublas	n/a	2TB SSD, 64gb, --nommap --smartcontext --usecublas --gpulayers 26 --psutil _set_threads (4 threads)		514ms/T 1.8T/s
2023-07-13	koboldcpp	airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin	cublas	n/a	2TB SSD, 64gb, --smartcontext --usecublas --gpulayers 26 --psutil _set_threads (removed nommap)		508ms/T 1.9T/s
2023-07-13	koboldcpp	airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin	cublas	n/a	2TB SSD, 64gb, --smartcontext --usecublas --gpulayers 26 --threads 5		454ms/T 2.1T/s
2023-07-13	koboldcpp	airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin	cublas	n/a	2TB SSD, 64gb, --smartcontext --usecublas --gpulayers 26 --threads 6		422ms/T 2.2T/s
2023-07-13	koboldcpp	airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin	cublas	n/a	2TB SSD, 64gb, --smartcontext --usecublas --gpulayers 26 --threads 7		509ms/T 1.8T/s
2023-07-13	koboldcpp	airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin	cublas	n/a	2TB SSD, 64gb, --smartcontext --usecublas --gpulayers 26 --threads 8		494ms/T 1.7T/s
2023-07-13	koboldcpp	airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin	cublas	n/a	2TB SSD, 64gb, --smartcontext --usecublas --gpulayers 26 --threads 6 --linearrope (no difference, needs supercot?)		425ms/T 2.2T/s
2023-07-13	koboldcpp	airoboros-33b-gpt4-1.4.ggmlv3.q4_K_M.bin	cublas	n/a	2TB SSD, 64gb, --smartcontext --usecublas --gpulayers 26 --threads 6		400ms/T 2.3T/s
2023-07-13	koboldcpp	airoboros-65b-gpt4-1.4.ggmlv3.q4_K_M.bin	cublas	n/a	2TB SSD, 64gb, --nommap --smartcontext --usecublas --gpulayers 13 --threads 6		1366ms/T 0.7T/s
2023-07-14	koboldcpp	airoboros-65b-gpt4-1.4.ggmlv3.q2_K.bin	cublas	n/a	2TB SSD, 64gb, --nommap --smartcontext --usecublas --gpulayers 13 --threads 6		765ms/T - 1.2T/s
2023-09-06	koboldcpp	guanaco-33B.ggmlv3.q4_K_M.bin	cublas	n/a	2TB SSD, 64gb, --stream --smartcontext --usecublas --gpulayers 29 --threads 6		562ms/T - 1.3T/s
2023-09-06	koboldcpp	guanaco-33B.ggmlv3.q4_K_M.bin	cublas	n/a	2TB SSD, 64gb, --nommap --stream --smartcontext --usecublas --gpulayers 29 --threads 6		567ms/T), Total:70.7s (1.4T/s
2023-09-06	koboldcpp	guanaco-33B.ggmlv3.q4_K_M.bin	cublas	n/a	2TB SSD, 64gb, --nommap --stream --smartcontext --usecublas --gpulayers 25 --threads 6		563ms/T), Total:70.2s (1.4T/s
2023-12-03	koboldcpp	guanaco-33B.q4_K_M.gguf	cublas	n/a	2TB SSD, 64gb, --nommap --smartcontext --usecublas --threads 6 --gpulayers 27		330.7ms/T), Total:40.79s 2.94T/s
2023-12-07	koboldcpp	guanaco-33B.q4_K_M.gguf	cublas	n/a	7950x3d, 2TB SSD, 64gb, --nommap --smartcontext --usecublas --threads 6 --gpulayers 27		202.1ms/T, 4.78T/s
2023-12-07	koboldcpp	guanaco-33B.q4_K_M.gguf	cublas	n/a	7950x3d, 2TB SSD, 64gb, --nommap --smartcontext --usecublas --threads 32 --gpulayers 27		360.8ms/T, 2.68T/s
2023-12-07	koboldcpp	guanaco-33B.q4_K_M.gguf	cublas	n/a	7950x3d, 2TB SSD, 64gb, --nommap --smartcontext --usecublas --threads 16 --gpulayers 27		202.6ms/T, 4.82T/s
2023-12-07	koboldcpp	guanaco-33B.q4_K_M.gguf	cublas	n/a	7950x3d, 2TB SSD, 64gb, --nommap --smartcontext --usecublas --threads 15 --gpulayers 27		195.0ms/T, 5.03T/s
2023-12-16	koboldcpp	mistral-7b-instruct-v0.2.Q8_0.gguf	cublas	n/a	7950x3d, 2TB SSD, 64gb, --nommap --smartcontext --usecublas --threads 15 --gpulayers 33		22.9ms/T, 42.90T/s
2023-12-17	koboldcpp	mixtral-8x7b-moe-rp-story.Q8_0.gguf	cublas	n/a	7950x3d, 2TB SSD, 64gb, --nommap --smartcontext --usecublas --threads 15 --gpulayers 6		214.9ms/T, 4.47T/s
2024-02-04	SillyTavern	miqu 70b			gpu layers 9		1.4T/s

Newer


	Interface	Model		Notes	HW	Speed
2025-08-01	ollama	qwen3-coder:30b-a3b-q4_K_M				17.01T/s
2025-08-01	ollama	hf.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:Q4_K_M				16.84T/s
2025-08-01	ollama	hf.co/unsloth/Qwen3-30B-A3B-Thinking-2507-GGUF:Q4_K_M				14.39T/s
2025-08-01	ollama	qwen3:30b-a3b-thinking-2507-q4_K_M				13.95T/s
2025-08-02	ollama	Qwen3:30b-a3b-thinking-2507-q8_0				10.93T/s
2025-08-13	llama.cpp-cuda	unsloth_Qwen3-Coder-30B-A3B-Instruct-GGUF_Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf			No gpu, --ctx-size 32684	20.62 T/s
2025-08-13	llama.cpp-cuda	unsloth_Qwen3-Coder-30B-A3B-Instruct-GGUF_Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf		9287MiB / 11264MiB	-ngl 22 --ctx-size 12684	31.54 T/s
2025-08-13	llama.cpp-cuda	unsloth_Qwen3-Coder-30B-A3B-Instruct-GGUF_Qwen3-Coder-30B-A3B-Instruct-UD-Q6_K_XL.gguf			-ngl 15 --ctx-size 12684	21.51 T/s
2025-08-14	llama.cpp-cuda	unsloth_gemma-3-27b-it-qat-GGUF_gemma-3-27b-it-qat-Q4_K_M.gguf			default	like 3? idk
2025-08-14	llama.cpp-cuda	unsloth_gemma-3-27b-it-qat-GGUF_gemma-3-27b-it-qat-Q4_K_M.gguf			-fa --ctx-size 12684	2.76 T/s
2025-08-14	llama.cpp-cuda	unsloth_gemma-3-27b-it-qat-GGUF_gemma-3-27b-it-qat-Q4_K_M.gguf			-fa -ngl 25 --ctx-size 12684	4.87 Ts
2025-08-14	llama.cpp-cuda	unsloth_gemma-3-27b-it-qat-GGUF_gemma-3-27b-it-qat-Q4_K_M.gguf			-ngl 22 --ctx-size 12684	4.10 T/s
2025-08-14	llama.cpp-cuda	unsloth_gemma-3-27b-it-qat-GGUF_gemma-3-27b-it-qat-Q4_K_M.gguf			-ngl 22 --ctx-size 12684 -fa	4.43 T/s
2025-08-14	llama.cpp-cuda	unsloth/gemma-3n-E4B-it-GGUF:UD-Q4_K_XL			-fa -ngl 99 --ctx-size 12684	23.14 T/s
2025-08-14	llama.cpp-cuda	unsloth/Mistral-Small-3.2-24B-Instruct-2506-GGUF:UD-Q4_K_XL			--ctx-size 12684 --flash-attn --jinja --temp 0.15 --top-k -1 --top-p 1.00 -ngl 20	7.48 T/s
2025-08-14	llama.cpp-cuda	unsloth/Qwen3-4B-Instruct-2507-GGUF:UD-Q8_K_XL			--flash-attn -ngl 99 --jinja --ctx-size 12684 --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 --presence-penalty 1.5 -c 22684 -n 32768 --no-context-shift	48.69 T/s
2025-08-14	ollama	hf.co/bartowski/L3.3-MS-Nevoria-70b-GGUF:Q4_K_M			Defaults	1.74 tokens/s
2025-08-14	llama.cpp-cuda	hf.co/bartowski/L3.3-MS-Nevoria-70b-GGUF:Q4_K_M			--flash-attn -ngl 15 --ctx-size 6000	1.82 T/s
2025-08-14	llama.cpp-cuda	hf.co/bartowski/L3.3-MS-Nevoria-70b-GGUF:Q4_K_M			--hf-repo-draft unsloth/Llama-3.2-1B-Instruct-GGUF:Q4_K_M --flash-attn -ngl 13 --ctx-size 6000 --gpu-layers-draft 99	2.98 T/s
2025-08-14	llama.cpp-cuda	unsloth_gemma-3-27b-it-qat-GGUF_gemma-3-27b-it-qat-Q4_K_M.gguf			+270m draft model	7.19 T/s
2025-08-16	llama.cpp-cuda	qwen3-coder-30b:Q4_K_XL-GPU		8879MiB / 11264MiB	--flash-attn -ctk q4_0 -ctv q4_0 --jinja -ngl 22 --ctx-size 12684	33.77 T/s
2025-08-16	llama.cpp-cuda	qwen3-coder-30b:Q4_K_XL-GPU		8907MiB / 11264MiB	--flash-attn -ctk q8_0 -ctv q8_0 --jinja -ngl 22 --ctx-size 12684	35.71 T/s
2025-08-16	llama.cpp-cuda	qwen3-coder-30b:Q4_K_XL-GPU		10400MiB / 11264MiB	--flash-attn -ctk q4_0 -ctv q4_0 --jinja -ngl 26 --ctx-size 12684	37.55 T/s
2025-08-16	llama.cpp-cuda	qwen3-coder-30b:Q4_K_XL-GPU		10371MiB / 11264MiB	--flash-attn -ctk q8_0 -ctv q8_0 --jinja -ngl 26 --ctx-size 12684	37.08 T/s
2025-08-21	llama.cpp-cuda	ggml-org/gpt-oss-20b-GGUF:mxfp4		7907MiB / 11264MiB	--ctx-size 32768 --jinja -ub 2048 -b 2048 -ngl 99 -fa --n-cpu-moe 16	38.15 T/s
2025-08-23	llama.cpp-cuda	qwen3-coder-30b:Q6_K_XL-GPU		10511MiB / 11264MiB	--flash-attn -ctk q8_0 -ctv q8_0 --n-cpu-moe 33 -ngl 99 --ctx-size 12684	32.5 t/s
2025-08-25	llama.cpp-cuda	unsloth/gemma-3-12b-it-qat-GGUF:UD-Q4_K_XL		6030MiB / 11264MiB	--flash-attn-ngl 22 --ctx-size 12684	12.94 T/s
2025-08-25	llama.cpp-cuda	unsloth/gemma-3-12b-it-qat-GGUF:UD-Q4_K_XL		10174MiB / 11264MiB	--flash-attn-ngl 99 --ctx-size 12684	48.3 t/s
2025-08-25	llama.cpp-cuda	unsloth/gemma-3-12b-it-qat-GGUF:UD-Q4_K_XL		10674MiB / 11264MiB	--flash-attn-ngl 99 --ctx-size 12684 -ctk q8_0 -ctv q8_0	37.45 T/s
2025-08-25	llama.cpp-cuda	unsloth/gemma-3-12b-it-qat-GGUF:UD-Q4_K_XL		10728MiB / 11264MiB	--flash-attn-ngl 99 --ctx-size 21845	49.20 T/s
2025-08-25	llama.cpp-cuda	QuantStack/InternVL3_5-30B-A3B-gguf:IQ4_XS 👁		9220MiB / 11264MiB	${KV_CACHE} --flash-attn --n-cpu-moe 30 -ngl 99 --ctx-size 12684	47.1 t/s
2025-08-28	llama.cpp-cuda	mradermacher_Dolphin-Mistral-24B-Venice-Edition-i1-GGUF_Dolphin-Mistral-24B-Venice-Edition.i1-Q4_K_M.gguf		9550MiB / 11264MiB	--ctx-size 8192 --flash-attn --jinja --temp 0.15 --top-p 0.95 --top-k 20 --min-p 0 -ngl 20	7.9 t/s
2025-08-28	llama.cpp-cuda	mradermacher_Dolphin-Mistral-24B-Venice-Edition-i1-GGUF_Dolphin-Mistral-24B-Venice-Edition.i1-Q4_K_M.gguf		9107MiB / 11264MiB	${KV_CACHE} --ctx-size 8192 --flash-attn --jinja --temp 0.15 --top-p 0.95 --top-k 20 --min-p 0 -ngl 20	7.8 t/s
2025-08-28	llama.cpp-cuda	mradermacher_Dolphin-Mistral-24B-Venice-Edition-i1-GGUF_Dolphin-Mistral-24B-Venice-Edition.i1-Q4_K_M.gguf		9766MiB / 11264MiB	${KV_CACHE} --ctx-size 8192 --flash-attn --jinja --temp 0.15 --top-p 0.95 --top-k 20 --min-p 0 -ngl 22	8.3 t/s
2025-08-28	llama.cpp-cuda	mradermacher_Dolphin-Mistral-24B-Venice-Edition-i1-GGUF_Dolphin-Mistral-24B-Venice-Edition.i1-Q4_K_M.gguf		10429MiB / 11264MiB	${KV_CACHE} --ctx-size 8192 --flash-attn --jinja --temp 0.15 --top-p 0.95 --top-k 20 --min-p 0 -ngl 24	8.9 t/s
2025-08-28	llama.cpp-cuda	mradermacher_Dolphin-Mistral-24B-Venice-Edition-i1-GGUF_Dolphin-Mistral-24B-Venice-Edition.i1-Q4_K_M.gguf		10542MiB / 11264MiB	--hf-repo-draft bartowski/alamios_Mistral-Small-3.1-DRAFT-0.5B-GGUF:Q4_K_M --jinja --temp 0.15 --top-p 0.95 --top-k 20 --min-p 0 --ctx-size 8192 ${KV_CACHE} --flash-attn -ngl 22 --gpu-layers-draft 99	9.7 t/s
2025-08-29	llama.cpp-cuda	unsloth/GLM-4.5-Air-UD-Q2_K_XL.gguf	8864MiB	10253MiB / 11264MiB	${KV_CACHE} --flash-attn --n-cpu-moe 42 -ngl 99 --ctx-size 8192	17.1 t/s
2025-08-30	llama.cpp-cuda	unsloth/gpt-oss-20b:f16	8293M	7876MiB / 11264MiB	--ctx-size 32768 -ub 2048 -b 2048 -ngl 99 --n-cpu-moe 16 ${KV_CACHE}	39.4 t/s
2025-08-30	llama.cpp-cuda	unsloth/gpt-oss-20b:f16	13.6G	10320MiB / 11264MiB	--ctx-size 32768 -ub 2048 -b 2048 -ngl 99 --n-cpu-moe 12 ${KV_CACHE}	42.8 t/s
2025-08-30	llama.cpp-cuda	unsloth/gpt-oss-120b:f16	33.6G	9411MiB / 11264MiB	--ctx-size 32768 -ub 2048 -b 2048 ${KV_CACHE}	6.0 t/s
2025-08-30	llama.cpp-cuda	unsloth/Seed-OSS-36B-Instruct-UD-Q4_K_XL	14.5G	10327MiB / 11264MiB	${KV_CACHE} --flash-attn --jinja --n-cpu-moe 40 -ngl 24 --ctx-size 12684	4.2t/s
2025-09-01	llama.cpp-cuda	unsloth/gpt-oss-120b:f16	44.4G	9541MiB / 11264MiB	${UNSLOTH_GPT_OSS_PARAMS} (k_top 100 not 0) --jinja --ctx-size 12768 -ub 2048 -b 2048 -ngl 99 --n-cpu-moe 34 --flash-attn ${KV_CACHE_Q4}	11.7 t/s
2025-09-01	llama.cpp-cuda	unsloth/gpt-oss-120b:f16	45.9G	9576MiB / 11264MiB	${UNSLOTH_GPT_OSS_PARAMS} (k_top 100 not 0) --jinja --ctx-size 22768 -ub 2048 -b 2048 -ngl 99 --n-cpu-moe 34 --flash-attn ${KV_CACHE_Q4}	12.9 t/s
2025-09-05	llama.cpp-cuda	Hermes-4-14B-GGUF:Q4_K_M	853.63 MiB	10625MiB / 11264MiB	${KV_CACHE} --flash-attn on --jinja -ngl 99 --ctx-size 12684	42.7 t/s
2025-09-10	llama.cpp-cuda	gabriellarson/ERNIE-4.5-21B-A3B-Thinking-GGUF:Q4_K_M		2742MiB / 11264MiB	${KV_CACHE} --flash-attn on --n-cpu-moe 35 -ngl 99 --jinja --ctx-size 12684	31.1 t/s
2025-09-10	llama.cpp-cuda	gabriellarson/ERNIE-4.5-21B-A3B-Thinking-GGUF:Q4_K_M		10759MiB / 11264MiB	${KV_CACHE} --flash-attn on --n-cpu-moe 10 -ngl 99 --jinja --ctx-size 12684	49.0 t/s
2025-09-10	llama.cpp-cuda	gabriellarson/ERNIE-4.5-21B-A3B-Thinking-GGUF:Q8_0	15.7G	9408MiB / 11264MiB	${KV_CACHE} --flash-attn on --n-cpu-moe 20 -ngl 99 --jinja --ctx-size 12684	28.5 t/s
2025-09-18	llama.cpp-cuda	unsloth_Magistral-Small-2509-GGUF_Magistral-Small-2509-UD-Q4_K_XL.gguf 👁	8900M	10237MiB / 11264MiB	--special --ctx-size 12684 --flash-attn on -ngl 20 --jinja --temp 0.7 --top-k -1 --top-p 0.95	7.9 t/s
2025-09-18	llama.cpp-cuda	glm-4.5-air:Q3_K_XL	42.6G	8881MiB / 11264MiB	${KV_CACHE} --flash-attn on --jinja --n-cpu-moe 44 -ngl 99 --ctx-size 8192	7.6 t/s
2025-09-26	llama.cpp-cuda	qwen3-coder-30b:Q4_K_XL-GPU	10.5G	8632MiB	${KV_CACHE} --flash-attn on --jinja --n-cpu-moe 30 -ngl 99 --ctx-size 25000	49.33t/s

deepseek-r1-distill-qwen-32b - 4.20 tok/sec

Hextor

https://artificialanalysis.ai/models/comparisons/qwen3-30b-a3b-2507-vs-gemma-3-27b

https://artificialanalysis.ai/models/comparisons/qwen3-30b-a3b-2507-vs-deepseek-r1-0120

Smallthinker comparison, compares gemma3, older qwen3


Date	Engine	Model	Thinking		Total Time	Eval Tokens	Prompt (tokens/s)	Eval (token/s)
2025-08-01	ollama	gemma3:1B	❌		20s	573	91.61	28.04
2025-08-01	ollama	mashriram/gemma3nTools:e4b	❌		1m12s	537	22.12	7.95
2025-08-01	ollama	gemma3:12b-it-qat	❌		4m2s	716	9.79	2.99
2025-08-01	ollama	gemma3:27b-it-qat	❌		9m11s	778	4.16	1.43
2025-08-01	ollama	deepseek-r1:8b	✅		6m9s	1807	13.53	4.90
2025-08-01	ollama	qwen3:30b-a3b-instruct-2507-q4_K_M	❌		2m10s	1454	26.60	11.17
2025-08-01	ollama	hf.co/unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF:Q4_K_M	❌		56s	623	21.37	11.26
2025-08-01	ollama	hf.co/unsloth/Qwen3-30B-A3B-Thinking-2507-GGUF:Q4_K_M	✅		1m50s	1203	25.88	10.87
2025-08-01	ollama	hf.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:Q4_K_M	❌		36s	504	136.43	11.33
2025-08-01	ollama	qwq:latest	✅		14m17s	1111	3.21	1.30
2025-08-01	ollama	hf.co/unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF:Q4_K_XL	✅		6m9s	1860	12.93	5.05
2025-08-13	llama.cpp	unsloth_Qwen3-Coder-30B-A3B-Instruct-GGUF_Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf	❌		30903.17 ms	357	30.27	11.70
2025-08-15	llama.cpp	unsloth_Qwen3-30B-A3B-Instruct-2507-GGUF_Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf	❌		14703.99 ms	177	30.83	12.04
2025-08-15	llama.cpp	unsloth_Qwen3-4B-Instruct-2507-GGUF_Qwen3-4B-Instruct-2507-UD-Q8_K_XL.gguf	❌		28738.32 ms	147	28.28	5.22
2025-08-15	llama.cpp	unsloth_Qwen3-4B-Instruct-2507-GGUF_Qwen3-4B-Instruct-2507-UD-Q4_K_XL.gguf	❌		8580.48 ms	79	31.71	9.82
2025-08-16	llama.cpp	unsloth_Qwen3-Coder-30B-A3B-Instruct-GGUF_Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf	❌	kv_cache q8				12.47, 11.73
2025-08-16	llama.cpp	unsloth_Qwen3-Coder-30B-A3B-Instruct-GGUF_Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf	❌	kv_cache q4				11.89, 11.94
2025-08-30	llama.cpp	unsloth/gpt-oss-20b-F16.gguf	❌	no fa?				11.7 t/s

Prompts

File sorting

You are an AI, your purpose the sorting of files/folders recently downloaded from BitTorrent.
RULE: All interactions with the system must be one of the following "SYSTEM COMMANDS", you may also think by putting thoughts between 'think' html block tags.
* ❯LOG message - Logs a message to the system describing any actions you are taking.
* ❯ABORT optional reason - Aborts the operation. Use this if there is an error.
* ❯MOVE [SOURCE] [DESTINATION] - Moves the file. [SOURCE] is the fullpath of the filename and [DESTINATION] is a folder.
* ❯MKDIR [PATH] - Makes a directory. For example '❯MKDIR "/mnt/storage12/Videos/Series/Example Show"'.
* ❯REQUEST_HELP message - Requests help from a human operator.
You may also think by putting it between html block tags with the name 'think'.
Put quotes around all file/folder names to ensure spaces are handled correctly.
All "SYSTEM COMMANDS" are prefixed with a ❯ character. Each system command should be on a line on it's own.
All "SYSTEM COMMANDS" (other than "❯LOG") should be prefixed with a "LOG" command explaining what action is being taken and why."
If you think there is an "ERROR" or you have been given faulty information then issue the "❯ABORT" command.
If a task is to complex, your confused or you require more information then issue a "❯REQUEST_HELP" command to request help or information from a "HUMAN OPERATOR".
Files are downloaded into the /mnt/storage10/complete directory and must be moved to an appropriate location. There are multiple harddrives on the server, they are mounted as /mnt/storage0, /mt/storage1, /mnt/storage2, ending with /mnt/storage12. Your goal is to move the file to an appropriate location.
Each drive has directories for "/mnt/storage#/Videos/Series", "/mnt/storage#/Videos/Anime", "/mnt/storage#/Videos/Movies", you must move the file/folder to one of these locations. A file catagorized as a "TV Series" will go in the folder name "Series" NOT "TV Series".
You will be given the filename of a download, file size, information about the amount of space on each drive and possible relevant folders.
You should start by classifying the file into a 'Movie', 'TV Series' , 'Anime' or 'Other'. If a movies is also an Anime should be classified as 'Anime'. Both 'Anime" and "TV Series" can have a season so that information by itself isn't enough to determine it's a 'TV Series'. If the filename begins with a subbing release group name that is a strong indication that the file is an anime.
If the category of a file is "Other", do not attempt to move it. Instead output "ABORT" all in capital letters and stop all other output.
Extract the season number if there is season information in the filename.
When ready to move a file issue the command "❯MOVE [SOURCE] [DESTINATION]"
Movies should just be moved to a drive with the least free space into the "/mnt/storage#Videos/Movies/" directory. Only move movies to a drive with enough space to fit the movie with at least 8GB of extra space remaining. Do not rename the file/folder, instead leave the filename intact.
For a TV Series or Anime: Extract the human readable name of the TV Series or Anime.
For a TV Series or Anime: Check to see if there are any folder given to you with a human readable name on a drive with enough space.
For a TV Series or Anime: The file/folder can be either an individual episode or a folder with multiple episodes.
For a TV Series or Anime: If there is a season number then put it into a subfolder with that season number, for example for "foundation.s03e03.1080p.web.h264-successfulcrab[EZTVx.to].mkv" might be put in "/mnt/storage#/Videos/Series/Foundation/Season 03/".
For Anime: If the filename is a folder and there is no season information, assume it is a complete series and just put it into the /Anime/ on the chosen drive.
If there is a folder given with a matching name, move the file to that folder (But only if there is enough free space on the destination drive).
If there are multiple appropriate folders given then choose the one on a drive with the least amount of free space.
For a TV Series/Anime, if there is no given folder with the correct name, or all the appropriate folders are on a drive without enough space then create a folder on the drive with the most free space.
In order to avoid a situation with a "Movie" and a "TV Series" having the same name, a if a folder exists does it have the correct category in it's path.
Does the drive have enough space to move the episode to? If not try another folder.
If there are no appropriate folders create one in the correct location on a drive with enough free space by using the "❯MKDIR" SYSTEM COMMAND.
Here is the disk free space information:
╭────────────────────────╮
│ 13 local devices │
├────────────────┬───────┤
│ MOUNTED ON │ AVAIL │
├────────────────┼───────┤
│ /mnt/storage0 │ 8.9G │
│ /mnt/storage1 │ 37.3G │
│ /mnt/storage10 │ 63.7G │
│ /mnt/storage11 │ 10.6G │
│ /mnt/storage12 │ 1.8T │
│ /mnt/storage2 │ 81.8G │
│ /mnt/storage3 │ 10.9G │
│ /mnt/storage4 │ 11.9G │
│ /mnt/storage5 │ 59.0G │
│ /mnt/storage6 │ 23.4G │
│ /mnt/storage7 │ 33.2G │
│ /mnt/storage8 │ 19.4G │
│ /mnt/storage9 │ 6.7G │
╰────────────────┴───────╯
There are currently no folders that match the filename.
Here is the filename: "/mnt/storage10/complete/[Bolshevik] Killing Bites [BD 1080p x264 10-bit FLAC]"
The filesize of the file/folder is 18G.

File sorting 2

You are an AI, your purpose the sorting of files/folders recently downloaded from BitTorrent.
RULE: All interactions with the system must be one of the following "SYSTEM COMMANDS", you may also think by putting thoughts between 'think' html block tags.
* ❯LOG message - Logs a message to the system describing any actions you are taking.
* ❯ABORT optional reason - Aborts the operation. Use this if there is an error.
* ❯MOVE [SOURCE] [DESTINATION] - Moves the file. [SOURCE] is the fullpath of the filename and [DESTINATION] is a folder.
* ❯MKDIR [PATH] - Makes a directory. For example '❯MKDIR "/mnt/storage12/Videos/Series/Example Show"'.
* ❯REQUEST_HELP message - Requests help from a human operator.
Filenames should be quoted ensure spaces are handled correctly.
All "SYSTEM COMMANDS" are prefixed with a ❯ character. Each system command should be on a line on it's own.
All "SYSTEM COMMANDS" (other than "❯LOG") should be prefixed with a "LOG" command explaining what action is being taken and why."
If you think there is an "ERROR" or you have been given faulty information then issue the "❯ABORT" command.
If a task is to complex, your confused or you require more information then issue a "❯REQUEST_HELP" command to request help or information from a "HUMAN OPERATOR".
Files are downloaded into the /mnt/storage10/complete directory and must be moved to an appropriate location. There are multiple harddrives on the server, they are mounted as /mnt/storage0, /mt/storage1, /mnt/storage2, ending with /mnt/storage12. Your goal is to move the file to an appropriate location.
Each drive has directories for "/mnt/storage#/Videos/Series", "/mnt/storage#/Videos/Anime", "/mnt/storage#/Videos/Movies", you must move the file/folder to one of these locations. A file catagorized as a "TV Series" will go in the folder name "Series" NOT "TV Series".
You will be given the filename of a download, file size, information about the amount of space on each drive and possible relevant folders.
STEP 1: Start by classifying the file into a 'Movie', 'TV Series' , 'Anime' or 'Other'. If a movies is also an Anime should be classified as 'Anime'. Both 'Anime" and "TV Series" can have a season so that information by itself isn't enough to determine it's a 'TV Series'. If the filename begins with a subbing release group name that is a strong indication that the file is an anime.
If the category of a file is "Other", do not attempt to move it. Instead output "ABORT" all in capital letters and stop all other output.
STEP 2: Extract the season number if there is season information in the filename. Note if there is no season information.
STEP 3: Determine if the filename refers to a file or folder. If there is no extension then it is a folder.
When ready to move a file issue the command "❯MOVE [SOURCE] [DESTINATION]"
STEP 4: Determine the destination for the move
Movies should just be moved to a drive with the least free space into the "/mnt/storage#Videos/Movies/" directory. Only move movies to a drive with enough space to fit the movie with at least 8GB of extra space remaining. Do not rename the file/folder, instead leave the filename intact.
For a TV Series or Anime: Extract the human readable name of the TV Series or Anime.
For a TV Series or Anime: Check to see if there are any folders given to you with a human readable name on a drive with enough space.
For a TV Series or Anime: The input filename could be a file with a individual episode or a folder with multiple episodes.
For a TV Series or Anime: If there is a season number then put it into a subfolder with that season number, for example for "foundation.s03e03.1080p.web.h264-successfulcrab[EZTVx.to].mkv" might be put in "/mnt/storage#/Videos/Series/Foundation/Season 03/".
For Anime: If the filename is a folder and there is no season information, assume it is a complete series and just put it into the /Anime/ on the chosen drive.
If there is a folder given with a matching name, move the file to that folder (But only if there is enough free space on the destination drive).
If there are multiple appropriate folders given then choose the one on a drive with the least amount of free space.
For a TV Series/Anime, if there is no given folder with the correct name, or all the appropriate folders are on a drive without enough space then create a folder on the drive with the most free space.
In order to avoid a situation with a "Movie" and a "TV Series" having the same name, a if a folder exists does it have the correct category in it's path.
Does the drive have enough space to move the episode to? If not try another folder.
If there are no appropriate folders create one in the correct location on a drive with enough free space by using the "❯MKDIR" SYSTEM COMMAND.
STEP 5: Make a note of the following information

Will a folder need to be created?

What is the destination path.

STEP 6: Output the sequence of commands.
IMPORTANT: Once you have thought a valid sequence of commands give the solution based on the thinking directly.

Here is the disk free space information:
╭────────────────────────╮
│ 13 local devices │
├────────────────┬───────┤
│ MOUNTED ON │ AVAIL │
├────────────────┼───────┤
│ /mnt/storage0 │ 8.9G │
│ /mnt/storage1 │ 37.3G │
│ /mnt/storage10 │ 63.7G │
│ /mnt/storage11 │ 10.6G │
│ /mnt/storage12 │ 1.8T │
│ /mnt/storage2 │ 81.8G │
│ /mnt/storage3 │ 10.9G │
│ /mnt/storage4 │ 11.9G │
│ /mnt/storage5 │ 59.0G │
│ /mnt/storage6 │ 23.4G │
│ /mnt/storage7 │ 33.2G │
│ /mnt/storage8 │ 19.4G │
│ /mnt/storage9 │ 6.7G │
╰────────────────┴───────╯
There are currently no folders that match the filename.
Here is the filename: "/mnt/storage10/complete/[Bolshevik] Killing Bites [BD 1080p x264 10-bit FLAC]"
The filesize of the file/folder is 18G.

@@ Line 293: / Line 293: @@
 === Models ===
-[https://huggingface.co/TheBloke/VicUnlocked-30B-LoRA-GGML/tree/main VicUnlocked-30B]
+* Qwen3-Next
-[https://www.reddit.com/r/LocalLLM/comments/130hvna/q5_ggml_models/ q5 models (reddit)]
+* [https://huggingface.co/inclusionAI/Ring-flash-2.0-GGUF Ring-Flash-2.0]
-[https://boards.4channel.org/g/thread/93064422#p93064949 q5 models]
-https://huggingface.co/Melbourne/Alpacino-30b-ggml/tree/main
 === Misc ===

AI: Difference between revisions

Revision as of 00:25, 27 September 2025

Contents

New Shit

Programs

UI

CLI

GUI

Tasksel

Benchmarks

Vision Benchmarks

MCP Servers

Prompts

Video

Voice

Agentic AI Dungeon

LLama

Models

Misc

GGML Quantization

Papers

Local Benchmarks

Main

Older

Newer

Hextor

Prompts

File sorting

File sorting 2

Navigation menu

AI: Difference between revisions

Revision as of 00:25, 27 September 2025

New Shit

Programs

UI

CLI

GUI

Tasksel

Benchmarks

Vision Benchmarks

MCP Servers

Prompts

Video

Voice

Agentic AI Dungeon

LLama

Models

Misc

GGML Quantization

Papers

Local Benchmarks

Main

Older

Newer

Hextor

Prompts

File sorting

File sorting 2

Navigation menu

Search