AI

koboldcpp.exe 13B-HyperMantis.ggmlv3.q4_K_M.bin --debug --usecublas --usemlock --contextsize 8192 --blasbatchsize 512 --psutil_set_threads --threads 6 --blasthreads 10 --gpulayers 5 --highpriority --stream --usemlock --unbantokens --smartcontext

If you're running 1.35 and a superHOT model, you should also add --linearrope which should make them perform better.

RP recommened models: "superhot, airoboros, wizard-vicuna, guanaco, chronos are a few commonly discussed models off the top of my head. For me, it's superhot or guanaco (one or the other, not the merge though)"

UI

jan.ai

open webui

CLI

ngpt

mcp-client-cli

mcptools - For inspecting mcp servers.

Client Name	Description	Key Features	Implementation	URL
oterm	A text-based terminal client for Ollama with MCP tools, prompts, and sampling.	Supports MCP tools, prompts, sampling; Streamable HTTP & WebSocket transports.	TUI (Terminal UI)	GitHub
ollama-mcp-client	Python-based client for integrating local Ollama models with MCP servers.	Seamless MCP integration, Git operations support, tool discovery.	Python CLI	GitHub
mcp-client-for-ollama	TUI client for interacting with MCP servers using Ollama, offering interactivity.	Multi-server support, streaming responses, fuzzy autocomplete.	TUI (Terminal UI)	GitHub
Mcp-cli	General-purpose CLI for interacting with MCP servers, supporting Ollama.	Supports multiple providers, modular chat, context-aware completions.	Command-line	Source
Mcp Client Ollama	Python-based CLI for connecting Ollama to MCP servers, focusing on tool execution.	stdio and SSE transports, JSON configuration, multiple server support.	Python CLI	Source

GUI

Tasksel

Not sure if this is common knowledge, but some advice to all fellow VRAMlets who are offloading to RAM. Setting the number of threads is not good enough, you can get extra speed by manually setting core affinity.

For context:

I have a 13600K which has 6 P-cores. I had read that you should set --threads to that number, so I would run koboldcpp with --threads 6 and from some testing this was indeed the best option with that argument alone.

BUT, I looked at which cores were actually used and found e-cores also being used sometimes.

So the next step was to set the core affinity to just P-cores. Each P-core has two threads and CPU0-11 was P-cores, CPU12-19 was E-cores. Thus, I ran koboldcpp with one thread from each core:

taskset -c 0,2,4,6,8,10 python kobodcpp.py [args]

My speed running command-r went from ~2.3 T/s to 2.67 T/s Pretty good. But, what if I use them fully, I thought. So I set --threads 12 and taskset -c 0,1,2,3,4,5,6,7,8,9,10,11 And I get a generation with 3.09 T/s That's a whooping 33% increase from my initial. Hope this is helpful, it actually had my basedfacing captcha: pic rel

Benchmarks

https://artificialanalysis.ai/

https://eqbench.com/creative_writing.html

https://huggingface.co/spaces/hf-audio/open_asr_leaderboard https://modal.com/blog/fast-cheap-batch-transcription

MCP Servers

Context7

awesome-mcp-clients/

https://smithery.ai/

https://mcp.so/servers

https://www.reddit.com/r/RooCode/comments/1ijgk2x/roo_code_mcps_best_mcp_configs/mbej58g/

Agentic AI Dungeon

n8n - Web tool for AI agents

chroma - Vector database

pgvector

lmstudio

LLama

/lmg/

llama.cpp pull requests

cuBLAS: use host pinned memory and dequantize while copying

Add NVIDIA cuBLAS support (Merged now)

TextGeneration-webui pull requests

I think I made the Triton-GPTQ model work with the GPTQ-for-LLaMa (Triton branch) repository. (slower than non-triton)

Tried more Vicuna + superCOT lora and I am fairly impressed at how well it understands me for a 13B model in terms of followup instructions.

Models

VicUnlocked-30B

q5 models (reddit)

q5 models

https://huggingface.co/Melbourne/Alpacino-30b-ggml/tree/main

Misc

GPT4-x-Alpaca-30B is particularly good at writing stories

lammacpp server?

"There was a tokenizer caching error, some people said. Redownload the hf_output files from the repo or just change the use_cache line in the config.json to say: "use_cache": true," for the Vicuna13B-free https://github.com/stochasticai/xturing/tree/main/examples/int4_finetuning

https://wiki.installgentoo.com/wiki/Home_server#Expanding_Your_Storage

https://rentry.org/llama-tard-v2

https://rentry.org/llamaaids

https://hackmd.io/@reneil1337/alpaca

https://find.4chan.org/?q=AI+Dynamic+Storytelling+General

https://find.4chan.org/?q=AI+Chatbot+General

https://find.4chan.org/?q=%2Flmg%2F (local models general)

https://boards.4channel.org/g/thread/92400764#p92400764

https://rentry.org/llamaaids

https://files.catbox.moe/lvefgy.json

https://pytorch.org/hub/nvidia_deeplearningexamples_tacotron2/

python server.py --model llama-7b-4bit --wbits 4
python server.py --model llama-13b-4bit-128g --wbits 4 --groupsize 128

https://github.com/qwopqwop200/GPTQ-for-LLaMa/issues/59 for installing with out of space error

https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model#4-bit-mode

https://github.com/pybind/pybind11/discussions/4566

https://lmsysvicuna.miraheze.org/wiki/How_to_use_Vicuna#Use_with_llama.cpp%3A

https://huggingface.co/anon8231489123/vicuna-13b-GPTQ-4bit-128g

Here's the uncucked Vicuna model (trained on the dataset that don't have the moralistic bullshit anymore) Too bad it's just the CPU quantized version

Vicuna generating its own prompts

should be worse than Q4_1 (which is QK=32) but there are several PRs in the work that should improve quantization accuracy in general

https://github.com/ggerganov/llama.cpp/pull/729

https://github.com/ggerganov/llama.cpp/pull/835

https://github.com/ggerganov/llama.cpp/pull/896

benchmark_threads.txt

https://huggingface.co/TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g - python3 llama.py vicuna-AlekseyKorshuk-7B c4 --wbits 4 --true-sequential --act-order --groupsize 128 --save_safetensors vicuna-AlekseyKorshuk-7B-GPTQ-4bit-128g.safetensors

≈65% speedup of the AVX-512 implementation of ggml_vec_dot_q4_0() #933

"Speaking of which, for any 30b anons struggling with context size, I figured something out. If you use the Triton branch on WSL, go into GPTQ_loader.py and comment out make_quant_attn like so" from here

Just grab the CUDA branch of qwop's GPTQ for LLaMA (or Triton if you want to be a dickhole) or if you have webui installed, go into the folder for GPTQ. Make sure all the requirements are installed and run this line:

python llama.py /path/to30b c4 --wbits 4 --true-sequential --groupsize 128 --save_safetensors alpacino-4bit-128g.safetensors

And it'll run. For windows, obviously flip the slashes the right way. And for linux, you may need to add CUDA_VISIBLE_DEVICES=0 to the front of the command.

GGML Quantization

Some tables on Reddit

Relative quantization

Papers

Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision

http://attentionviz.com/

Local Benchmarks

TODO: Try different cublas batch sizes

Main


	Interface	Model	GPTQ	Xformers?	HW	Load	Speed
	text-gen	anon8231489123-vicuna-13b-GPTQ-4bit-128g	GPTQ-for-LLaMa-triton	yes	240gb SSD, 16gb,desktop off	10.53	7.97 tokens/s
	text-gen	anon8231489123-vicuna-13b-GPTQ-4bit-128g	GPTQ-for-LLaMa-triton	No xformers	240gb SSD, 16gb,desktop off	10.22s	7.55 tokens/s
	text-gen	anon8231489123-vicuna-13b-GPTQ-4bit-128g	GPTQ-for-LLaMa-cuda	No xformers	240gb SSD, 16gb,desktop off	16.68s	4.03 tokens/s
	text-gen	anon8231489123-vicuna-13b-GPTQ-4bit-128g	GPTQ-for-LLaMa-cuda	yes	240gb SSD, 16gb,desktop off	9.34s	4.01 tokens/s
	text-gen	llama-30b-sft-oa-alpaca-epoch-2-4bit-ggml	no	no	2TB SSD, 64gb	?	0.67 tokens/s
	text-gen	llama-30b-sft-oa-alpaca-epoch-2-4bit-ggml	no	no	2TB SSD, 64gb, --threads 8	maybe 30s?	0.51 tokens/s
	text-gen	llama-30b-sft-oa-alpaca-epoch-2-4bit-ggml	no	no	2TB SSD, 64gb, --threads 7		0.68 tokens/s
	text-gen	llama-30b-sft-oa-alpaca-epoch-2-4bit-ggml	no	no	2TB SSD, 64gb, --threads 6		0.61 tokens/s
	text-gen	anon8231489123-vicuna-13b-GPTQ-4bit-128g-ggml	no	no	2TB SSD, 64gb		1.17 tokens/s
	text-gen	anon8231489123-vicuna-13b-GPTQ-4bit-128g	GPTQ-for-LLaMa-triton	yes	2TB SSD, 64gb, --pre_layer 25	45.69	0.25 tokens/s
	text-gen	anon8231489123-vicuna-13b-GPTQ-4bit-128g	GPTQ-for-LLaMa-triton	yes	2TB SSD, 64gb	36.47	9.63 tokens/s
	llama.cpp	llama-30b-sft-oa-alpaca-epoch-2-4bit-ggml			2TB SSD, 64gb	10317.90 ms	1096.21 ms per token
	llama.cpp-modern-avx512	llama-30b-sft-oa-alpaca-epoch-2-4bit-ggml			2TB SSD, 64gb	9288.69 ms	1049.03 ms per token
	llama.cpp-avx512-pr833	llama-30b-sft-oa-alpaca-epoch-2-4bit-ggml			2TB SSD, 64gb	13864.06 ms	0.89 tokens/s, 820.68 ms per token
	text-gen	TheBloke-gpt4-alpaca-lora-30B-4bit-GGML/ggml-model-q4_0			2TB SSD, 64gb		0.78 tokens/s
	text-gen+avx512-pr833	TheBloke-gpt4-alpaca-lora-30B-4bit-GGML/ggml-model-q4_0			2TB SSD, 64gb		1.04 tokens/s
2023-04-24	text-gen	anon8231489123-vicuna-13b-GPTQ-4bit-128g	GPTQ-for-LLaMa-triton	yes	2TB SSD, 64gb, also running llama.cpp with another model	16.36	5.07 tokens/s
2023-04-26	koboldcpp	gozfarb-llama-30b-supercot-ggml/ggml-model-q4_0.bin	clblast	n/a	2TB SSD, 64gb, --threads 8		1073ms/T
2023-04-29	koboldcpp	Alpacino-30b-q4_0.bin	clblast	n/a	2TB SSD, 64gb		700ms/T
2023-07-13	koboldcpp	llama-33b-supercot-ggml-q5_1 (complains about old format)	cublas	n/a	2TB SSD, 64gb, --nommap --smartcontext --usecublas --gpulayers 18		643ms/T 1.4T/s
2023-07-13	koboldcpp	llama-33b-supercot-ggml-q5_1 (complains about old format)	clblast	n/a	2TB SSD, 64gb, --nommap --smartcontext --useclblast 0 0 --gpulayers 18		685ms/T 1.2T/s
2023-07-13	koboldcpp	airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin	cublas	n/a	2TB SSD, 64gb, --nommap --smartcontext --usecublas --gpulayers 18 (probably space for more)		652ms/T 1.5T/s
2023-07-13	koboldcpp	airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin	cublas	n/a	2TB SSD, 64gb, --nommap --smartcontext --usecublas --gpulayers 26 (I note 3 threads are set by default)		593ms/T 1.6T/s
2023-07-13	koboldcpp	airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin	cublas	n/a	2TB SSD, 64gb, --nommap --smartcontext --usecublas --gpulayers 26 --psutil _set_threads (4 threads)		514ms/T 1.8T/s
2023-07-13	koboldcpp	airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin	cublas	n/a	2TB SSD, 64gb, --smartcontext --usecublas --gpulayers 26 --psutil _set_threads (removed nommap)		508ms/T 1.9T/s
2023-07-13	koboldcpp	airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin	cublas	n/a	2TB SSD, 64gb, --smartcontext --usecublas --gpulayers 26 --threads 5		454ms/T 2.1T/s
2023-07-13	koboldcpp	airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin	cublas	n/a	2TB SSD, 64gb, --smartcontext --usecublas --gpulayers 26 --threads 6		422ms/T 2.2T/s
2023-07-13	koboldcpp	airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin	cublas	n/a	2TB SSD, 64gb, --smartcontext --usecublas --gpulayers 26 --threads 7		509ms/T 1.8T/s
2023-07-13	koboldcpp	airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin	cublas	n/a	2TB SSD, 64gb, --smartcontext --usecublas --gpulayers 26 --threads 8		494ms/T 1.7T/s
2023-07-13	koboldcpp	airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin	cublas	n/a	2TB SSD, 64gb, --smartcontext --usecublas --gpulayers 26 --threads 6 --linearrope (no difference, needs supercot?)		425ms/T 2.2T/s
2023-07-13	koboldcpp	airoboros-33b-gpt4-1.4.ggmlv3.q4_K_M.bin	cublas	n/a	2TB SSD, 64gb, --smartcontext --usecublas --gpulayers 26 --threads 6		400ms/T 2.3T/s
2023-07-13	koboldcpp	airoboros-65b-gpt4-1.4.ggmlv3.q4_K_M.bin	cublas	n/a	2TB SSD, 64gb, --nommap --smartcontext --usecublas --gpulayers 13 --threads 6		1366ms/T 0.7T/s
2023-07-14	koboldcpp	airoboros-65b-gpt4-1.4.ggmlv3.q2_K.bin	cublas	n/a	2TB SSD, 64gb, --nommap --smartcontext --usecublas --gpulayers 13 --threads 6		765ms/T - 1.2T/s
2023-09-06	koboldcpp	guanaco-33B.ggmlv3.q4_K_M.bin	cublas	n/a	2TB SSD, 64gb, --stream --smartcontext --usecublas --gpulayers 29 --threads 6		562ms/T - 1.3T/s
2023-09-06	koboldcpp	guanaco-33B.ggmlv3.q4_K_M.bin	cublas	n/a	2TB SSD, 64gb, --nommap --stream --smartcontext --usecublas --gpulayers 29 --threads 6		567ms/T), Total:70.7s (1.4T/s
2023-09-06	koboldcpp	guanaco-33B.ggmlv3.q4_K_M.bin	cublas	n/a	2TB SSD, 64gb, --nommap --stream --smartcontext --usecublas --gpulayers 25 --threads 6		563ms/T), Total:70.2s (1.4T/s
2023-12-03	koboldcpp	guanaco-33B.q4_K_M.gguf	cublas	n/a	2TB SSD, 64gb, --nommap --smartcontext --usecublas --threads 6 --gpulayers 27		330.7ms/T), Total:40.79s 2.94T/s
2023-12-07	koboldcpp	guanaco-33B.q4_K_M.gguf	cublas	n/a	7950x3d, 2TB SSD, 64gb, --nommap --smartcontext --usecublas --threads 6 --gpulayers 27		202.1ms/T, 4.78T/s
2023-12-07	koboldcpp	guanaco-33B.q4_K_M.gguf	cublas	n/a	7950x3d, 2TB SSD, 64gb, --nommap --smartcontext --usecublas --threads 32 --gpulayers 27		360.8ms/T, 2.68T/s
2023-12-07	koboldcpp	guanaco-33B.q4_K_M.gguf	cublas	n/a	7950x3d, 2TB SSD, 64gb, --nommap --smartcontext --usecublas --threads 16 --gpulayers 27		202.6ms/T, 4.82T/s
2023-12-07	koboldcpp	guanaco-33B.q4_K_M.gguf	cublas	n/a	7950x3d, 2TB SSD, 64gb, --nommap --smartcontext --usecublas --threads 15 --gpulayers 27		195.0ms/T, 5.03T/s
2023-12-16	koboldcpp	mistral-7b-instruct-v0.2.Q8_0.gguf	cublas	n/a	7950x3d, 2TB SSD, 64gb, --nommap --smartcontext --usecublas --threads 15 --gpulayers 33		22.9ms/T, 42.90T/s
2023-12-17	koboldcpp	mixtral-8x7b-moe-rp-story.Q8_0.gguf	cublas	n/a	7950x3d, 2TB SSD, 64gb, --nommap --smartcontext --usecublas --threads 15 --gpulayers 6		214.9ms/T, 4.47T/s
2024-02-04	SillyTavern	miqu 70b			gpu layers 9		1.4T/s

Hextor

https://artificialanalysis.ai/models/comparisons/qwen3-30b-a3b-2507-vs-gemma-3-27b

https://artificialanalysis.ai/models/comparisons/qwen3-30b-a3b-2507-vs-deepseek-r1-0120

Smallthinker comparison, compares gemma3, older qwen3


Date	Engine	Model	Thinking	Total Time	Eval Tokens	Prompt (tokens/s)	Eval (token/s)
2025-08-01	ollama	mashriram/gemma3nTools:e4b	❌	1m12s	537	22.12	7.95
2025-08-01	ollama	gemma3:12b-it-qat	❌	4m2s	716	9.79	2.99
2025-08-01	ollama	gemma3:27b-it-qat	❌	9m11s	778	4.16	1.43
2025-08-01	ollama	deepseek-r1:8b	✅	6m9s	1807	13.53	4.90
2025-08-01	ollama	hf.co/unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF:Q4_K_M	❌	56s	623	21.37	11.26
2025-08-01	ollama	hf.co/unsloth/Qwen3-30B-A3B-Thinking-2507-GGUF:Q4_K_M	✅	1m50s	1203	25.88	10.87
2025-08-01	ollama	hf.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:Q4_K_M	❌	36s	504	136.43	11.33
2025-08-01	ollama	alibayram/Qwen3-30B-A3B-Instruct-2507:latest (dupes first word)	❌	1m16s	242	32.20	10.67
2025-08-01	ollama	qwq:latest	✅	14m17s	1111	3.21	1.30
2025-08-01	ollama	hf.co/unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF:Q4_K_XL	✅	6m9s	1860	12.93	5.05

AI

Contents

New Shit

UI

CLI

GUI

Tasksel

Benchmarks

MCP Servers

Agentic AI Dungeon

LLama

llama.cpp pull requests

TextGeneration-webui pull requests

Models

Misc

GGML Quantization

Papers

Local Benchmarks

Main

Hextor

Navigation menu

AI

New Shit

UI

CLI

GUI

Tasksel

Benchmarks

MCP Servers

Agentic AI Dungeon

LLama

llama.cpp pull requests

TextGeneration-webui pull requests

Models

Misc

GGML Quantization

Papers

Local Benchmarks

Main

Hextor

Navigation menu

Search