AI
See also: AI Art
New Shit
https://smcleod.net/2024/12/bringing-k/v-context-quantisation-to-ollama/
Local LightRAG: A GraphRAG Alternative but Fully Local with Ollama
Graph RAG Evolved: PathRAG (Relational Reasoning Paths)
The Only Embedding Model You Need for RAG
https://ollama.com/library/smallthinker - Can be used as a draft model for QwQ-32B giving a %70 speed up.
sqrt(params * active) - A rule of thumb to calculate the equivalent number of parameters that a dense model would have.
Direct3D‑S2: Gigascale 3D Generation Made Easy with Spatial Sparse Attention
Make RAG 100x Better with Real-Time Knowledge Graphs
Are there any free working voice cloning AIs?
privategpt - privategpt imho Is the best for rag if you need the source, It not only lists the PDF used for the answer but also the page, and Is quite precise. So for studyb and search in a library Is the best i know
FLUX Model Quantization Challenge
dont_offload_gguf_layers_offload_tensors_200_gen
I Built the Ultimate RAG MCP Server for AI Coding (Better than Context7)
NEW FramePack F1 Model - Much Better Results - Bonus How to Install Sage
https://docs.google.com/document/d/12ATcyjCEKh8T-MPDZ-VMiQ1XMa9FUvvk2QazrsKoiR8/edit?tab=t.0
https://www.youtube.com/watch?v=LMH62T_XCF4
Dia, Fish Speech
https://www.reddit.com/r/LocalLLaMA/comments/1ev8n2s/exclude_top_choices_xtc_a_sampler_that_boosts/
https://blog.runpod.io/upscaling-videos-using/
koboldcpp.exe 13B-HyperMantis.ggmlv3.q4_K_M.bin --debug --usecublas --usemlock --contextsize 8192 --blasbatchsize 512 --psutil_set_threads --threads 6 --blasthreads 10 --gpulayers 5 --highpriority --stream --usemlock --unbantokens --smartcontext
If you're running 1.35 and a superHOT model, you should also add --linearrope which should make them perform better.
RP recommened models: "superhot, airoboros, wizard-vicuna, guanaco, chronos are a few commonly discussed models off the top of my head. For me, it's superhot or guanaco (one or the other, not the merge though)"
UI
jan.ai
open webui
CLI
mcptools - For inspecting mcp servers.
| Client Name | Description | Key Features | Implementation | URL |
|---|---|---|---|---|
| oterm | A text-based terminal client for Ollama with MCP tools, prompts, and sampling. | Supports MCP tools, prompts, sampling; Streamable HTTP & WebSocket transports. | TUI (Terminal UI) | GitHub |
| ollama-mcp-client | Python-based client for integrating local Ollama models with MCP servers. | Seamless MCP integration, Git operations support, tool discovery. | Python CLI | GitHub |
| mcp-client-for-ollama | TUI client for interacting with MCP servers using Ollama, offering interactivity. | Multi-server support, streaming responses, fuzzy autocomplete. | TUI (Terminal UI) | GitHub |
| Mcp-cli | General-purpose CLI for interacting with MCP servers, supporting Ollama. | Supports multiple providers, modular chat, context-aware completions. | Command-line | Source |
| Mcp Client Ollama | Python-based CLI for connecting Ollama to MCP servers, focusing on tool execution. | stdio and SSE transports, JSON configuration, multiple server support. | Python CLI | Source |
GUI
Tasksel
Not sure if this is common knowledge, but some advice to all fellow VRAMlets who are offloading to RAM. Setting the number of threads is not good enough, you can get extra speed by manually setting core affinity.
For context:
I have a 13600K which has 6 P-cores. I had read that you should set --threads to that number, so I would run koboldcpp with --threads 6 and from some testing this was indeed the best option with that argument alone.
BUT, I looked at which cores were actually used and found e-cores also being used sometimes.
So the next step was to set the core affinity to just P-cores. Each P-core has two threads and CPU0-11 was P-cores, CPU12-19 was E-cores. Thus, I ran koboldcpp with one thread from each core:
taskset -c 0,2,4,6,8,10 python kobodcpp.py [args]
My speed running command-r went from ~2.3 T/s to 2.67 T/s Pretty good. But, what if I use them fully, I thought. So I set --threads 12 and taskset -c 0,1,2,3,4,5,6,7,8,9,10,11 And I get a generation with 3.09 T/s That's a whooping 33% increase from my initial. Hope this is helpful, it actually had my basedfacing captcha: pic rel
Benchmarks
https://artificialanalysis.ai/
https://eqbench.com/creative_writing.html
https://huggingface.co/spaces/hf-audio/open_asr_leaderboard https://modal.com/blog/fast-cheap-batch-transcription
MCP Servers
https://www.reddit.com/r/RooCode/comments/1ijgk2x/roo_code_mcps_best_mcp_configs/mbej58g/
Agentic AI Dungeon
n8n - Web tool for AI agents
chroma - Vector database
pgvector
lmstudio
LLama
llama.cpp pull requests
cuBLAS: use host pinned memory and dequantize while copying
Add NVIDIA cuBLAS support (Merged now)
TextGeneration-webui pull requests
I think I made the Triton-GPTQ model work with the GPTQ-for-LLaMa (Triton branch) repository. (slower than non-triton)
Tried more Vicuna + superCOT lora and I am fairly impressed at how well it understands me for a 13B model in terms of followup instructions.
Models
https://huggingface.co/Melbourne/Alpacino-30b-ggml/tree/main
Misc
GPT4-x-Alpaca-30B is particularly good at writing stories
"There was a tokenizer caching error, some people said. Redownload the hf_output files from the repo or just change the use_cache line in the config.json to say: "use_cache": true," for the Vicuna13B-free https://github.com/stochasticai/xturing/tree/main/examples/int4_finetuning
https://wiki.installgentoo.com/wiki/Home_server#Expanding_Your_Storage
https://rentry.org/llama-tard-v2
https://hackmd.io/@reneil1337/alpaca
https://find.4chan.org/?q=AI+Dynamic+Storytelling+General
https://find.4chan.org/?q=AI+Chatbot+General
https://find.4chan.org/?q=%2Flmg%2F (local models general)
https://boards.4channel.org/g/thread/92400764#p92400764
https://files.catbox.moe/lvefgy.json
https://pytorch.org/hub/nvidia_deeplearningexamples_tacotron2/
python server.py --model llama-7b-4bit --wbits 4python server.py --model llama-13b-4bit-128g --wbits 4 --groupsize 128
https://github.com/qwopqwop200/GPTQ-for-LLaMa/issues/59 for installing with out of space error
https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model#4-bit-mode
https://github.com/pybind/pybind11/discussions/4566
https://lmsysvicuna.miraheze.org/wiki/How_to_use_Vicuna#Use_with_llama.cpp%3A
https://huggingface.co/anon8231489123/vicuna-13b-GPTQ-4bit-128g
Vicuna generating its own prompts
should be worse than Q4_1 (which is QK=32) but there are several PRs in the work that should improve quantization accuracy in general
https://github.com/ggerganov/llama.cpp/pull/729
https://github.com/ggerganov/llama.cpp/pull/835
https://github.com/ggerganov/llama.cpp/pull/896
https://huggingface.co/TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g - python3 llama.py vicuna-AlekseyKorshuk-7B c4 --wbits 4 --true-sequential --act-order --groupsize 128 --save_safetensors vicuna-AlekseyKorshuk-7B-GPTQ-4bit-128g.safetensors
≈65% speedup of the AVX-512 implementation of ggml_vec_dot_q4_0() #933
"Speaking of which, for any 30b anons struggling with context size, I figured something out. If you use the Triton branch on WSL, go into GPTQ_loader.py and comment out make_quant_attn like so" from here
GGML Quantization
Papers
Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision
Local Benchmarks
TODO: Try different cublas batch sizes
Main
| Interface | Model | GPTQ | Xformers? | HW | Load | Speed | |
|---|---|---|---|---|---|---|---|
| text-gen | anon8231489123-vicuna-13b-GPTQ-4bit-128g | GPTQ-for-LLaMa-triton | yes | 240gb SSD, 16gb,desktop off | 10.53 | 7.97 tokens/s | |
| text-gen | anon8231489123-vicuna-13b-GPTQ-4bit-128g | GPTQ-for-LLaMa-triton | No xformers | 240gb SSD, 16gb,desktop off | 10.22s | 7.55 tokens/s | |
| text-gen | anon8231489123-vicuna-13b-GPTQ-4bit-128g | GPTQ-for-LLaMa-cuda | No xformers | 240gb SSD, 16gb,desktop off | 16.68s | 4.03 tokens/s | |
| text-gen | anon8231489123-vicuna-13b-GPTQ-4bit-128g | GPTQ-for-LLaMa-cuda | yes | 240gb SSD, 16gb,desktop off | 9.34s | 4.01 tokens/s | |
| text-gen | llama-30b-sft-oa-alpaca-epoch-2-4bit-ggml | no | no | 2TB SSD, 64gb | ? | 0.67 tokens/s | |
| text-gen | llama-30b-sft-oa-alpaca-epoch-2-4bit-ggml | no | no | 2TB SSD, 64gb, --threads 8 | maybe 30s? | 0.51 tokens/s | |
| text-gen | llama-30b-sft-oa-alpaca-epoch-2-4bit-ggml | no | no | 2TB SSD, 64gb, --threads 7 | 0.68 tokens/s | ||
| text-gen | llama-30b-sft-oa-alpaca-epoch-2-4bit-ggml | no | no | 2TB SSD, 64gb, --threads 6 | 0.61 tokens/s | ||
| text-gen | anon8231489123-vicuna-13b-GPTQ-4bit-128g-ggml | no | no | 2TB SSD, 64gb | 1.17 tokens/s | ||
| text-gen | anon8231489123-vicuna-13b-GPTQ-4bit-128g | GPTQ-for-LLaMa-triton | yes | 2TB SSD, 64gb, --pre_layer 25 | 45.69 | 0.25 tokens/s | |
| text-gen | anon8231489123-vicuna-13b-GPTQ-4bit-128g | GPTQ-for-LLaMa-triton | yes | 2TB SSD, 64gb | 36.47 | 9.63 tokens/s | |
| llama.cpp | llama-30b-sft-oa-alpaca-epoch-2-4bit-ggml | 2TB SSD, 64gb | 10317.90 ms | 1096.21 ms per token | |||
| llama.cpp-modern-avx512 | llama-30b-sft-oa-alpaca-epoch-2-4bit-ggml | 2TB SSD, 64gb | 9288.69 ms | 1049.03 ms per token | |||
| llama.cpp-avx512-pr833 | llama-30b-sft-oa-alpaca-epoch-2-4bit-ggml | 2TB SSD, 64gb | 13864.06 ms | 0.89 tokens/s, 820.68 ms per token | |||
| text-gen | TheBloke-gpt4-alpaca-lora-30B-4bit-GGML/ggml-model-q4_0 | 2TB SSD, 64gb | 0.78 tokens/s | ||||
| text-gen+avx512-pr833 | TheBloke-gpt4-alpaca-lora-30B-4bit-GGML/ggml-model-q4_0 | 2TB SSD, 64gb | 1.04 tokens/s | ||||
| 2023-04-24 | text-gen | anon8231489123-vicuna-13b-GPTQ-4bit-128g | GPTQ-for-LLaMa-triton | yes | 2TB SSD, 64gb, also running llama.cpp with another model | 16.36 | 5.07 tokens/s |
| 2023-04-26 | koboldcpp | gozfarb-llama-30b-supercot-ggml/ggml-model-q4_0.bin | clblast | n/a | 2TB SSD, 64gb, --threads 8 | 1073ms/T | |
| 2023-04-29 | koboldcpp | Alpacino-30b-q4_0.bin | clblast | n/a | 2TB SSD, 64gb | 700ms/T | |
| 2023-07-13 | koboldcpp | llama-33b-supercot-ggml-q5_1 (complains about old format) | cublas | n/a | 2TB SSD, 64gb, --nommap --smartcontext --usecublas --gpulayers 18 | 643ms/T 1.4T/s | |
| 2023-07-13 | koboldcpp | llama-33b-supercot-ggml-q5_1 (complains about old format) | clblast | n/a | 2TB SSD, 64gb, --nommap --smartcontext --useclblast 0 0 --gpulayers 18 | 685ms/T 1.2T/s | |
| 2023-07-13 | koboldcpp | airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin | cublas | n/a | 2TB SSD, 64gb, --nommap --smartcontext --usecublas --gpulayers 18 (probably space for more) | 652ms/T 1.5T/s | |
| 2023-07-13 | koboldcpp | airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin | cublas | n/a | 2TB SSD, 64gb, --nommap --smartcontext --usecublas --gpulayers 26 (I note 3 threads are set by default) | 593ms/T 1.6T/s | |
| 2023-07-13 | koboldcpp | airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin | cublas | n/a | 2TB SSD, 64gb, --nommap --smartcontext --usecublas --gpulayers 26 --psutil _set_threads (4 threads) | 514ms/T 1.8T/s | |
| 2023-07-13 | koboldcpp | airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin | cublas | n/a | 2TB SSD, 64gb, --smartcontext --usecublas --gpulayers 26 --psutil _set_threads (removed nommap) | 508ms/T 1.9T/s | |
| 2023-07-13 | koboldcpp | airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin | cublas | n/a | 2TB SSD, 64gb, --smartcontext --usecublas --gpulayers 26 --threads 5 | 454ms/T 2.1T/s | |
| 2023-07-13 | koboldcpp | airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin | cublas | n/a | 2TB SSD, 64gb, --smartcontext --usecublas --gpulayers 26 --threads 6 | 422ms/T 2.2T/s | |
| 2023-07-13 | koboldcpp | airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin | cublas | n/a | 2TB SSD, 64gb, --smartcontext --usecublas --gpulayers 26 --threads 7 | 509ms/T 1.8T/s | |
| 2023-07-13 | koboldcpp | airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin | cublas | n/a | 2TB SSD, 64gb, --smartcontext --usecublas --gpulayers 26 --threads 8 | 494ms/T 1.7T/s | |
| 2023-07-13 | koboldcpp | airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin | cublas | n/a | 2TB SSD, 64gb, --smartcontext --usecublas --gpulayers 26 --threads 6 --linearrope (no difference, needs supercot?) | 425ms/T 2.2T/s | |
| 2023-07-13 | koboldcpp | airoboros-33b-gpt4-1.4.ggmlv3.q4_K_M.bin | cublas | n/a | 2TB SSD, 64gb, --smartcontext --usecublas --gpulayers 26 --threads 6 | 400ms/T 2.3T/s | |
| 2023-07-13 | koboldcpp | airoboros-65b-gpt4-1.4.ggmlv3.q4_K_M.bin | cublas | n/a | 2TB SSD, 64gb, --nommap --smartcontext --usecublas --gpulayers 13 --threads 6 | 1366ms/T 0.7T/s | |
| 2023-07-14 | koboldcpp | airoboros-65b-gpt4-1.4.ggmlv3.q2_K.bin | cublas | n/a | 2TB SSD, 64gb, --nommap --smartcontext --usecublas --gpulayers 13 --threads 6 | 765ms/T - 1.2T/s | |
| 2023-09-06 | koboldcpp | guanaco-33B.ggmlv3.q4_K_M.bin | cublas | n/a | 2TB SSD, 64gb, --stream --smartcontext --usecublas --gpulayers 29 --threads 6 | 562ms/T - 1.3T/s | |
| 2023-09-06 | koboldcpp | guanaco-33B.ggmlv3.q4_K_M.bin | cublas | n/a | 2TB SSD, 64gb, --nommap --stream --smartcontext --usecublas --gpulayers 29 --threads 6 | 567ms/T), Total:70.7s (1.4T/s | |
| 2023-09-06 | koboldcpp | guanaco-33B.ggmlv3.q4_K_M.bin | cublas | n/a | 2TB SSD, 64gb, --nommap --stream --smartcontext --usecublas --gpulayers 25 --threads 6 | 563ms/T), Total:70.2s (1.4T/s | |
| 2023-12-03 | koboldcpp | guanaco-33B.q4_K_M.gguf | cublas | n/a | 2TB SSD, 64gb, --nommap --smartcontext --usecublas --threads 6 --gpulayers 27 | 330.7ms/T), Total:40.79s 2.94T/s | |
| 2023-12-07 | koboldcpp | guanaco-33B.q4_K_M.gguf | cublas | n/a | 7950x3d, 2TB SSD, 64gb, --nommap --smartcontext --usecublas --threads 6 --gpulayers 27 | 202.1ms/T, 4.78T/s | |
| 2023-12-07 | koboldcpp | guanaco-33B.q4_K_M.gguf | cublas | n/a | 7950x3d, 2TB SSD, 64gb, --nommap --smartcontext --usecublas --threads 32 --gpulayers 27 | 360.8ms/T, 2.68T/s | |
| 2023-12-07 | koboldcpp | guanaco-33B.q4_K_M.gguf | cublas | n/a | 7950x3d, 2TB SSD, 64gb, --nommap --smartcontext --usecublas --threads 16 --gpulayers 27 | 202.6ms/T, 4.82T/s | |
| 2023-12-07 | koboldcpp | guanaco-33B.q4_K_M.gguf | cublas | n/a | 7950x3d, 2TB SSD, 64gb, --nommap --smartcontext --usecublas --threads 15 --gpulayers 27 | 195.0ms/T, 5.03T/s | |
| 2023-12-16 | koboldcpp | mistral-7b-instruct-v0.2.Q8_0.gguf | cublas | n/a | 7950x3d, 2TB SSD, 64gb, --nommap --smartcontext --usecublas --threads 15 --gpulayers 33 | 22.9ms/T, 42.90T/s | |
| 2023-12-17 | koboldcpp | mixtral-8x7b-moe-rp-story.Q8_0.gguf | cublas | n/a | 7950x3d, 2TB SSD, 64gb, --nommap --smartcontext --usecublas --threads 15 --gpulayers 6 | 214.9ms/T, 4.47T/s | |
| 2024-02-04 | SillyTavern | miqu 70b | gpu layers 9 | 1.4T/s |
Hextor
| Date | Engine | Model | Thinking | Total Time | Eval Tokens | Prompt (tokens/s) | Eval (token/s) |
|---|---|---|---|---|---|---|---|
| 2025-08-01 | ollama | mashriram/gemma3nTools:e4b | ❌ | 1m12s | 537 | 22.12 | 7.95 |
| 2025-08-01 | ollama | gemma3:12b-it-qat | ❌ | 4m2s | 716 | 9.79 | 2.99 |
| 2025-08-01 | ollama | gemma3:27b-it-qat | ❌ | 9m11s | 778 | 4.16 | 1.43 |
| 2025-08-01 | ollama | deepseek-r1:8b | ✅ | 6m9s | 1807 | 13.53 | 4.90 |
| 2025-08-01 | ollama | hf.co/unsloth/Qwen3-30B-A3B-Thinking-2507-GGUF:Q4_K_M | ✅ | 1m50s | 1203 | 25.88 | 10.87 |
| 2025-08-01 | ollama | hf.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:Q4_K_M | ❌ | 36s | 504 | 136.43 | 11.33 |
| 2025-08-01 | ollama | alibayram/Qwen3-30B-A3B-Instruct-2507:latest (dupes first word) | ❌ | 1m16s | 242 | 32.20 | 10.67 |
| 2025-08-01 | ollama | qwq:latest | ✅ | 14m17s | 1111 | 3.21 | 1.30 |