Editing
AI
Jump to navigation
Jump to search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
See also: [[AI Art]] == New Shit == "[https://old.reddit.com/r/LocalLLaMA/comments/1owskm6/windows_llamacpp_is_20_faster/ The bf16 support is a big difference. It will eventually end up on your machine next year]" [https://www.youtube.com/@stanfordonline/videos Stanford Online][https://www.youtube.com/watch?v=Ub3GoFaUcds Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 1 - Transformer] https://github.com/onyx-dot-app/onyx [https://www.onyx.app/ Onyx] is a feature-rich, self-hostable Chat UI (Another psudo-open project) https://github.com/microsoft/ai-agents-for-beginners?tab=readme-ov-file https://github.com/dwmkerr/terminal-ai https://medevel.com/15-free-open-source-ai-terminal-assistants/?utm_source=chatgpt.com [https://old.reddit.com/r/LocalLLaMA/comments/1nt2c38/llamacpp_moe_models_find_best_ncpumoe_value/ MOE Benchmark] [https://www.librechat.ai/docs/features/code_interpreter https://www.librechat.ai/] https://github.com/danny-avila/LibreChat - Replace open-webui [https://rentry.org/geechan geechan - SillyTavern GLM System prompts] https://github.com/github/spec-kit https://www.youtube.com/watch?v=em3vIT9aUsg https://joeyagreco.medium.com/reverse-engineering-the-hottest-new-game-5362cfe7c452 https://blog.plasticlabs.ai/blog/YouSim%3B-Explore-The-Multiverse-of-Identity?utm_source=chatgpt.com https://worldsim.nousresearch.com/console [https://www.reddit.com/r/LocalLLaMA/comments/1n6od0s/%E6%AE%8B%E5%BF%83_zanshin_navigate_through_media_by_speaker/ ๆฎๅฟ / Zanshin - Navigate through media by speaker] [https://www.youtube.com/watch?v=Jaj_SQsF-BI LLM from Scratch Tutorial โ Code & Train Qwen 3] [https://github.com/Qetesh/miniflux-ai miniflux-ai] [https://www.reddit.com/r/SillyTavernAI/comments/1msz8ao/glm_45_preset/ glm_45_preset] [https://github.com/simstudioai/sim simstudioai/sim] [https://www.youtube.com/watch?v=mFKyiyGPu1I Install Sim Locally with Ollama: AI Agent Workflow Builder] [https://huggingface.co/fofr/sdxl-emoji sdxl-emoji] [https://github.com/gabber-dev/gabber gabber] [https://www.youtube.com/watch?v=JbHKMibTb5Q RamaLama] - Ollama alternative [https://github.com/GewoonJaap/qwen-code-cli-wrapper qwen-code-cli-wrapper] Docling, DOTS OCR, and Ollama OCR https://agents.md/ [https://github.com/ComfyUI-Workflow/awesome-comfyui awesome-comfyui] http rest https://github.com/sakalond/StableGen - generate textures for blender [https://www.reddit.com/r/comfyui/comments/1mvqeyw/what_upscaler_is_the_best_now/ what_upscaler_is_the_best_now/] [https://www.reddit.com/r/LocalLLaMA/comments/1mi7bem/new_llamacpp_options_make_moe_offloading_trivial/ You can use `-ngl 49` and just pass `--n-cpu-moe 20`. Also add `-fa` and `-ctk q8_0 -ctv q8_0`.] PIP_BREAK_SYSTEM_PACKAGES=1 comfy install https://overpass-turbo.eu/ [https://www.youtube.com/watch?v=mi2KmpV3Wvg Qwen-3 Coder CLI Forgets Everything. I Gave It a Perfect Memory.] https://docs.vllm.ai/en/latest/getting_started/installation/cpu.html#related-runtime-environment-variables https://modal.com/blog/fast-cheap-batch-transcription [https://www.youtube.com/watch?v=EUG65dIY-2k Make your AI Agents 10x Smarter with GraphRAG (n8n)] https://huggingface.co/rednote-hilab/dots.ocr [https://www.youtube.com/watch?v=yNPwsKa52zs YOLOE: Next Gen Computer Vision - Zero Training Required!] [https://github.com/campfirein/cipher Cipher] - Cipher is an opensource memory layer specifically designed for coding agents. https://smcleod.net/2024/12/bringing-k/v-context-quantisation-to-ollama/ [https://www.youtube.com/watch?v=g21royNJ4fw Local LightRAG: A GraphRAG Alternative but Fully Local with Ollama] [https://www.youtube.com/watch?v=oetP9uksUwM Graph RAG Evolved: PathRAG (Relational Reasoning Paths)] [https://uigenoutput.tesslate.com/uigen-x-4b-0729 UIGEN-X-4B-0729] [https://www.youtube.com/watch?v=p7yRLIj9IyQ The Only Embedding Model You Need for RAG] https://ollama.com/library/smallthinker - Can be used as a draft model for QwQ-32B giving a %70 speed up. sqrt(params * active) - A rule of thumb to calculate the equivalent number of parameters that a dense model would have. [https://huggingface.co/wushuang98/Direct3D-S2 Direct3DโS2: Gigascale 3D Generation Made Easy with Spatial Sparse Attention] [https://www.youtube.com/watch?v=PxcOIINgiaA Make RAG 100x Better with Real-Time Knowledge Graphs] [https://huggingface.co/rednote-hilab/dots.llm1.inst dots1] [https://github.com/lmg-anon/mikupad mikupad] [https://github.com/RVC-Boss/GPT-SoVITS/wiki/GPT%E2%80%90SoVITS%E2%80%90features-(%E5%90%84%E7%89%88%E6%9C%AC%E7%89%B9%E6%80%A7) GPTโSoVITSโfeatures (ๅ็ๆฌ็นๆง)] [https://www.reddit.com/r/StableDiffusion/comments/1je3b9m/are_there_any_free_working_voice_cloning_ais/ Are there any free working voice cloning AIs?] [https://github.com/zylon-ai/private-gpt privategpt] - [https://www.reddit.com/r/LocalLLaMA/comments/1ktm248/what_model_should_i_choose/mtunwvz/ privategpt imho Is the best for rag if you need the source, It not only lists the PDF used for the answer but also the page, and Is quite precise. So for studyb and search in a library Is the best i know] [https://diffusers-flux-quant.hf.space/ FLUX Model Quantization Challenge] [https://www.reddit.com/r/LocalLLaMA/comments/1ki7tg7/dont_offload_gguf_layers_offload_tensors_200_gen/mrdqlfr/ dont_offload_gguf_layers_offload_tensors_200_gen] [https://www.youtube.com/watch?v=ZoyPqXvnnZ8 I Built the Ultimate RAG MCP Server for AI Coding (Better than Context7)] [https://www.youtube.com/watch?v=Dgo6dyPMv_Q NEW FramePack F1 Model - Much Better Results - Bonus How to Install Sage] [https://huggingface.co/DavidAU/Qwen3-30B-A6B-16-Extreme Qwen3-30B-A6B-16-Extreme] https://docs.google.com/document/d/12ATcyjCEKh8T-MPDZ-VMiQ1XMa9FUvvk2QazrsKoiR8/edit?tab=t.0 [https://www.youtube.com/watch?v=LMH62T_XCF4 This AI Model has me excited about the future of Local LLM's | Qwen3-30B-A3B] https://www.reddit.com/r/LocalLLaMA/comments/1ev8n2s/exclude_top_choices_xtc_a_sampler_that_boosts/ https://blog.runpod.io/upscaling-videos-using/ [https://github.com/LostRuins/koboldcpp/wiki koboldcpp wiki] [https://rentry.org/MixtralForRetards Mixtral for Retards] koboldcpp.exe 13B-HyperMantis.ggmlv3.q4_K_M.bin --debug --usecublas --usemlock --contextsize 8192 --blasbatchsize 512 --psutil_set_threads --threads 6 --blasthreads 10 --gpulayers 5 --highpriority --stream --usemlock --unbantokens --smartcontext If you're running 1.35 and a superHOT model, you should also add --linearrope which should make them perform better. RP recommened models: "superhot, airoboros, wizard-vicuna, guanaco, chronos are a few commonly discussed models off the top of my head. For me, it's superhot or guanaco (one or the other, not the merge though)" == Programs == === UI === jan.ai open webui ==== CLI ==== [https://github.com/nazdridoy/ngpt ngpt] [https://github.com/adhikasp/mcp-client-cli mcp-client-cli] [https://github.com/f/mcptools mcptools] - For inspecting mcp servers. {| class="wikitable" !'''Client Name''' !'''Description''' !'''Key Features''' !'''Implementation''' !'''URL''' |- |oterm |A text-based terminal client for Ollama with MCP tools, prompts, and sampling. |Supports MCP tools, prompts, sampling; Streamable HTTP & WebSocket transports. |TUI (Terminal UI) |[https://github.com/ggozad/oterm GitHub] |- |ollama-mcp-client |Python-based client for integrating local Ollama models with MCP servers. |Seamless MCP integration, Git operations support, tool discovery. |Python CLI |[https://github.com/mihirrd/ollama-mcp-client GitHub] |- |mcp-client-for-ollama |TUI client for interacting with MCP servers using Ollama, offering interactivity. |Multi-server support, streaming responses, fuzzy autocomplete. |TUI (Terminal UI) |[https://github.com/jonigl/mcp-client-for-ollama GitHub] |- |Mcp-cli |General-purpose CLI for interacting with MCP servers, supporting Ollama. |Supports multiple providers, modular chat, context-aware completions. |Command-line |[https://mcpmarket.com/server/mcp-cli Source] |- |Mcp Client Ollama |Python-based CLI for connecting Ollama to MCP servers, focusing on tool execution. |stdio and SSE transports, JSON configuration, multiple server support. |Python CLI |[https://mcpmarket.com/server/mcp-client-ollama Source] |} ===== GUI ===== === Tasksel === Not sure if this is common knowledge, but some advice to all fellow VRAMlets who are offloading to RAM. Setting the number of threads is not good enough, you can get extra speed by manually setting core affinity. For context: I have a 13600K which has 6 P-cores. I had read that you should set --threads to that number, so I would run koboldcpp with --threads 6 and from some testing this was indeed the best option with that argument alone. BUT, I looked at which cores were actually used and found e-cores also being used sometimes. So the next step was to set the core affinity to just P-cores. Each P-core has two threads and CPU0-11 was P-cores, CPU12-19 was E-cores. Thus, I ran koboldcpp with one thread from each core: taskset -c 0,2,4,6,8,10 python kobodcpp.py [args] My speed running command-r went from ~2.3 T/s to 2.67 T/s Pretty good. But, what if I use them fully, I thought. So I set --threads 12 and taskset -c 0,1,2,3,4,5,6,7,8,9,10,11 And I get a generation with 3.09 T/s That's a whooping 33% increase from my initial. Hope this is helpful, it actually had my basedfacing captcha: pic rel == Benchmarks == [https://lmarena.ai/leaderboard LMArena] https://llm-stats.com/benchmarks https://scale.com/leaderboard https://artificialanalysis.ai/ https://livebench.ai/#/ https://xbench.org/ [https://mcpmark.ai/ MCPMark] https://eqbench.com/creative_writing.html [https://www.tbench.ai/ terminal-bench] [https://gorilla.cs.berkeley.edu/leaderboard.html BFCL: From Tool Use to Agentic Evaluation of Large Language Models] [https://huggingface.co/spaces/mteb/leaderboard MTEB Leaderboard] - Embedding models [https://huggingface.co/spaces/hf-audio/open_asr_leaderboard Open ASR Leaderboard] [https://huggingface.co/spaces/DontPlanToEnd/UGI-Leaderboard ๐ขUGI-Leaderboard] - Uncensored General Intelligence [https://artificialanalysis.ai/evaluations/artificial-analysis-long-context-reasoning Artificial Analysis Long Context Reasoning Benchmark Leaderboard] [https://artificialanalysis.ai/evaluations AI Model Evaluations] [https://old.reddit.com/r/LocalLLaMA/comments/1ok2kr3/support_for_qwen3_vl_has_been_merged_into_llamacpp/nm7k3z7/ Qwen3-VL vs Qwen3] === Vision Benchmarks === https://huggingface.co/spaces/opencompass/open_vlm_leaderboard https://dubesor.de/visionbench == MCP Servers == [https://www.youtube.com/watch?v=G7gK8H6u7Rs Context7] [https://github.com/punkpeye/awesome-mcp-clients/ awesome-mcp-clients/] https://smithery.ai/ https://mcp.so/servers [https://github.com/f/mcptools mcptools] https://www.reddit.com/r/RooCode/comments/1ijgk2x/roo_code_mcps_best_mcp_configs/mbej58g/ == Prompts == [https://old.reddit.com/r/SillyTavernAI/comments/1pia5ng/glm_is_amazing/nt5w63a/ GLM 4.6 prompt] https://rentry.org/geechan#model-specific-presets [https://old.reddit.com/r/LocalLLaMA/comments/1obqkpe/best_local_llms_october_2025/nkhx25s/ GPT-OSS jailbreak] https://cookbook.openai.com/examples/enhance_your_prompts_with_meta_prompting https://www.prompthub.us/blog/a-complete-guide-to-meta-prompting [https://www.youtube.com/watch?v=bkra7oV5_I8 rubric prompting] https://generateprompt.ai/en/ [https://github.com/f/awesome-chatgpt-prompts awesome-chatgpt-prompts] https://alidocs.dingtalk.com/i/nodes/EpGBa2Lm8aZxe5myC99MelA2WgN7R35y https://github.com/elder-plinius/L1B3RT4S == Video == [https://www.youtube.com/watch?v=h9xVYraOZwM&t=3s ComfyUI - Wan 2.2 & FFLF with Flux Kontext for Quick Keyframes for Video] [https://www.youtube.com/watch?v=U3tx8_QPZLs ๐คฟ One-Step Video Upscaling: Complete ComfyUI SeedVR2 Guide (Free workflow included) | AInVFX July 11] [https://www.youtube.com/watch?v=U3tx8_QPZLs Upscaling Maximizing VRAM | Free ComfyUI Workflow!] [https://alidocs.dingtalk.com/i/nodes/EpGBa2Lm8aZxe5myC99MelA2WgN7R35y Easy Creation with One Click - AI Videos] - Wan cheatsheet. [https://www.reddit.com/r/comfyui/comments/1mulz85/tlbvfi_temporalaware_latent_brownian_bridge/ TLB-VFI: Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation] == Voice == [https://huggingface.co/stepfun-ai/Step-Audio-2-mini?utm_source=perplexity Step-Audio-2-mini] - an 8 billion parameter (8B) speech-to-speech model. It outperforms GPT-4o-Audio [https://www.reddit.com/r/speechtech/comments/1mzyic2/vibevoice_opensource_texttospeech_from_microsoft/ There are other tts that are much better like Spark-tts and Higgs-tts. Keep in mind that higss tts full model with voice clone need 18gb of vram an it much slower then spark-tts] https://www.reddit.com/r/speechtech/ [https://github.com/Marvis-Labs/marvis-tts marvis-tts] https://github.com/nuvious/coqui-ai-api [https://www.youtube.com/watch?v=X-_rG6OtgfU Microsoft VibeVoice TTS LOCAL Testing โ A Multi-Speaker Podcast TTS!] Dia, [https://github.com/fishaudio/fish-speech Fish Speech] == Agentic AI Dungeon == n8n - Web tool for AI agents chroma - Vector database pgvector lmstudio == LLama == [https://boards.4channel.org/g/catalog#s=lmg%2F /lmg/] === Models === [https://old.reddit.com/r/LocalLLaMA/comments/1obrvab/support_for_ling_and_ring_models_1000b103b16b_has/ support_for_ling_and_ring_models_1000b103b16b_has] https://github.com/PrimeIntellect-ai/prime-rl - Claims to be a smarter finetune of GLM4.5 Air ==== Abliteration/MXFP4 MOE: ==== [https://old.reddit.com/r/LocalLLaMA/comments/1p5epot/the_most_objectively_correct_way_to_abliterate_so/ the_most_objectively_correct_way_to_abliterate_so/] [https://huggingface.co/noctrex/GLM-4.5-Air-Derestricted-MXFP4_MOE-GGUF noctrex/GLM-4.5-Air-Derestricted-MXFP4_MOE-GGUF] [https://old.reddit.com/r/LocalLLaMA/comments/1oypwa7/a_more_surgical_approach_to_abliteration/ a_more_surgical_approach_to_abliteration] [https://www.reddit.com/r/LocalLLaMA/comments/1oymku1/heretic_fully_automatic_censorship_removal_for/ heretic_fully_automatic_censorship_removal_for] [https://old.reddit.com/r/LocalLLaMA/comments/1ozh8py/mxfp4_hybrid_dense_models_ready_to_share_near/ mxfp4_hybrid_dense_models_ready_to_share_near/] - magiccodingman ==== Upcoming: ==== * [https://github.com/ggml-org/llama.cpp/issues/16186 <bdi>Qwen3-Omni-30B-A3B</bdi>] * [https://qwen.ai/blog?id=qwen3-omni-flash-20251201 Qwen3-Omni-Flash-2025-12-01] * [https://huggingface.co/inclusionAI/Ring-flash-2.0-GGUF Ring-Flash-2.0] * [https://old.reddit.com/r/LocalLLaMA/comments/1oh5asg/new_model_from_the_minimax_team_minimaxm2_an/ MiniMax-M2, an impressive 230B-A10B LLM.New Model] - Not sure if this works with llama.cpp * [https://old.reddit.com/r/LocalLLaMA/comments/1ojzekg/moonshotaikimilinear48ba3binstruct_hugging_face/ moonshotai/Kimi-Linear-48B-A3B-Instruct] * [https://huggingface.co/unsloth/Olmo-3-32B-Think-GGUF Olmo 3] * [https://github.com/ggml-org/llama.cpp/pull/17420 GigaChat3] * [https://github.com/ggml-org/llama.cpp/issues/15512 ERNIE-4.5-VL-28B-A3B-Thinking] * [https://old.reddit.com/r/LocalLLaMA/comments/1p6gsjh/llada20_103b16b_has_been_released/ LLaDA2.0 (103B/16B) has been released] ==== Other: ==== * [https://huggingface.co/ibm-granite/models Granite 4.0] * LLaDA-MoE-7B-A1B-Instruct * OLMoE === Misc === [https://boards.4channel.org/g/thread/93415313#p93421310 lammacpp server?] "There was a tokenizer caching error, some people said. Redownload the hf_output files from the repo or just change the use_cache line in the config.json to say: "use_cache": true," for the Vicuna13B-free https://github.com/stochasticai/xturing/tree/main/examples/int4_finetuning https://wiki.installgentoo.com/wiki/Home_server#Expanding_Your_Storage https://rentry.org/llama-tard-v2 https://rentry.org/llamaaids https://hackmd.io/@reneil1337/alpaca https://find.4chan.org/?q=AI+Dynamic+Storytelling+General https://find.4chan.org/?q=AI+Chatbot+General https://find.4chan.org/?q=%2Flmg%2F (local models general) https://boards.4channel.org/g/thread/92400764#p92400764 https://rentry.org/llamaaids <nowiki>https://files.catbox.moe/lvefgy.json</nowiki> https://pytorch.org/hub/nvidia_deeplearningexamples_tacotron2/ <blockquote> python server.py --model llama-7b-4bit --wbits 4 python server.py --model llama-13b-4bit-128g --wbits 4 --groupsize 128</blockquote>https://github.com/qwopqwop200/GPTQ-for-LLaMa/issues/59 for installing with out of space error https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model#4-bit-mode https://github.com/pybind/pybind11/discussions/4566 https://lmsysvicuna.miraheze.org/wiki/How_to_use_Vicuna#Use_with_llama.cpp%3A https://huggingface.co/anon8231489123/vicuna-13b-GPTQ-4bit-128g [https://huggingface.co/ShreyasBrill/Vicuna-13B Here's the uncucked Vicuna model (trained on the dataset that don't have the moralistic bullshit anymore) Too bad it's just the CPU quantized version] [https://www.reddit.com/r/Oobabooga/comments/12hyini/vicuna_generating_its_own_prompts/jfrtvh3/ Vicuna generating its own prompts] https://huggingface.co/TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g - <code>python3 llama.py vicuna-AlekseyKorshuk-7B c4 --wbits 4 --true-sequential --act-order --groupsize 128 --save_safetensors vicuna-AlekseyKorshuk-7B-GPTQ-4bit-128g.safetensors</code> [https://github.com/ggerganov/llama.cpp/pull/933 <bdi>โ65% speedup of the AVX-512 implementation of <code>ggml_vec_dot_q4_0()</code></bdi> #933] "Speaking of which, for any 30b anons struggling with context size, I figured something out. If you use the Triton branch on WSL, go into GPTQ_loader.py and comment out make_quant_attn like so" [https://boards.4channel.org/g/thread/92835207#p92837143 from here] [https://boards.4channel.org/g/thread/92842505#p92845181 Just grab the CUDA branch of qwop's GPTQ for LLaMA (or Triton if you want to be a dickhole) or if you have webui installed, go into the folder for GPTQ. Make sure all the requirements are installed and run this line:] [https://boards.4channel.org/g/thread/92842505#p92845181 python llama.py /path/to30b c4 --wbits 4 --true-sequential --groupsize 128 --save_safetensors alpacino-4bit-128g.safetensors] [https://boards.4channel.org/g/thread/92842505#p92845181 And it'll run. For windows, obviously flip the slashes the right way. And for linux, you may need to add CUDA_VISIBLE_DEVICES=0 to the front of the command.] == GGML Quantization == [https://www.reddit.com/r/LocalLLaMA/comments/14gjz8h/i_have_multiple_doubts_about_kquant_models_and/jp69o4l/ Some tables on Reddit] [https://www.reddit.com/r/LocalLLaMA/comments/142q5k5/updated_relative_comparison_of_ggml_quantization/ Relative quantization] == Papers == [https://arxiv.org/abs/2305.03047 Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision] http://attentionviz.com/ == Local Benchmarks == TODO: Try different cublas batch sizes === Main === ==== Older ==== {| class="wikitable sortable" |+ ! !Interface !Model !GPTQ !Xformers? !HW !Load !Speed |- | |text-gen |anon8231489123-vicuna-13b-GPTQ-4bit-128g |GPTQ-for-LLaMa-'''triton''' |yes |240gb SSD, 16gb,desktop off |10.53 |7.97 tokens/s |- | |text-gen |anon8231489123-vicuna-13b-GPTQ-4bit-128g |GPTQ-for-LLaMa-'''triton''' |No xformers |240gb SSD, 16gb,desktop off |10.22s |7.55 tokens/s |- | |text-gen |anon8231489123-vicuna-13b-GPTQ-4bit-128g |GPTQ-for-LLaMa-'''cuda''' |No xformers |240gb SSD, 16gb,desktop off |16.68s |4.03 tokens/s |- | |text-gen |anon8231489123-vicuna-13b-GPTQ-4bit-128g |GPTQ-for-LLaMa-'''cuda''' |yes |240gb SSD, 16gb,desktop off |9.34s |4.01 tokens/s |- | |text-gen |llama-30b-sft-oa-alpaca-epoch-2-4bit'''-ggml''' |no |no |2TB SSD, 64gb |? |0.67 tokens/s |- | |text-gen |llama-30b-sft-oa-alpaca-epoch-2-4bit'''-ggml''' |no |no |2TB SSD, 64gb, '''--threads 8''' |maybe 30s? |0.51 tokens/s |- | |text-gen |llama-30b-sft-oa-alpaca-epoch-2-4bit'''-ggml''' |no |no |2TB SSD, 64gb, '''--threads 7''' | |0.68 tokens/s |- | |text-gen |llama-30b-sft-oa-alpaca-epoch-2-4bit'''-ggml''' |no |no |2TB SSD, 64gb, '''--threads 6''' | |0.61 tokens/s |- | |text-gen |anon8231489123-vicuna-13b-GPTQ-4bit-128g'''-ggml''' |no |no |2TB SSD, 64gb | |1.17 tokens/s |- | |text-gen |anon8231489123-vicuna-13b-GPTQ-4bit-128g |GPTQ-for-LLaMa-'''triton''' |yes |2TB SSD, 64gb, '''--pre_layer 25''' |45.69 |0.25 tokens/s |- | |text-gen |anon8231489123-vicuna-13b-GPTQ-4bit-128g |GPTQ-for-LLaMa-'''triton''' |yes |2TB SSD, 64gb |36.47 |9.63 tokens/s |- | |'''llama.cpp''' |llama-30b-sft-oa-alpaca-epoch-2-4bit'''-ggml''' | | |2TB SSD, 64gb |10317.90 ms |1096.21 ms per token |- | |'''llama.cpp-modern-avx512''' |llama-30b-sft-oa-alpaca-epoch-2-4bit'''-ggml''' | | |2TB SSD, 64gb |9288.69 ms |1049.03 ms per token |- | |'''llama.cpp-avx512-pr833''' |llama-30b-sft-oa-alpaca-epoch-2-4bit'''-ggml''' | | |2TB SSD, 64gb |13864.06 ms |0.89 tokens/s, 820.68 ms per token |- | |text-gen |TheBloke-gpt4-alpaca-lora-30B-4bit-GGML/ggml-model-'''q4_0''' | | |2TB SSD, 64gb | |0.78 tokens/s |- | |text-gen+'''avx512-pr833''' |TheBloke-gpt4-alpaca-lora-30B-4bit-GGML/ggml-model-q4_0 | | |2TB SSD, 64gb | |1.04 tokens/s |- |2023-04-24 |text-gen |anon8231489123-vicuna-13b-GPTQ-4bit-128g |GPTQ-for-LLaMa-'''triton''' |yes |2TB SSD, 64gb, also running llama.cpp with another model |16.36 |5.07 tokens/s |- |2023-04-26 |koboldcpp |gozfarb-llama-30b-supercot-ggml/ggml-model-q4_0.bin |'''clblast''' |n/a |2TB SSD, 64gb, '''--threads 8''' | |1073ms/T |- |2023-04-29 |koboldcpp |Alpacino-30b-q4_0.bin |'''clblast''' |n/a |2TB SSD, 64gb | |700ms/T |- |2023-07-13 |koboldcpp |llama-33b-supercot-ggml-'''q5_1''' (complains about '''old format''') |'''cublas''' |n/a |2TB SSD, 64gb, '''--nommap --smartcontext --usecublas --gpulayers 18''' | |643ms/T 1.4T/s |- |2023-07-13 |koboldcpp |llama-33b-supercot-ggml-'''q5_1''' (complains about '''old format''') |'''clblast''' |n/a |2TB SSD, 64gb, --nommap --smartcontext '''--useclblast 0 0''' --gpulayers 18 | |685ms/T 1.2T/s |- |2023-07-13 |koboldcpp |'''airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin''' |'''cublas''' |n/a |2TB SSD, 64gb, --nommap --smartcontext --usecublas --gpulayers 18 '''(probably space for more)''' | |652ms/T 1.5T/s |- |2023-07-13 |koboldcpp |airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin |cublas |n/a |2TB SSD, 64gb, --nommap --smartcontext --usecublas '''--gpulayers 26 (I note 3 threads are set by default)''' | |593ms/T 1.6T/s |- |2023-07-13 |koboldcpp |airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin |cublas |n/a |2TB SSD, 64gb, --nommap --smartcontext --usecublas --gpulayers 26 '''--psutil _set_threads (4 threads)''' | |514ms/T 1.8T/s |- |2023-07-13 |koboldcpp |airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin |cublas |n/a |2TB SSD, 64gb, --smartcontext --usecublas --gpulayers 26 --psutil _set_threads '''(removed nommap)''' | |508ms/T 1.9T/s |- |2023-07-13 |koboldcpp |airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin |cublas |n/a |2TB SSD, 64gb, --smartcontext --usecublas --gpulayers 26 '''--threads 5''' | |454ms/T 2.1T/s |- |2023-07-13 |koboldcpp |airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin |cublas |n/a |2TB SSD, 64gb, --smartcontext --usecublas --gpulayers 26 '''--threads 6''' | |'''422ms/T 2.2T/s''' |- |2023-07-13 |koboldcpp |airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin |cublas |n/a |2TB SSD, 64gb, --smartcontext --usecublas --gpulayers 26 '''--threads 7''' | |509ms/T 1.8T/s |- |2023-07-13 |koboldcpp |airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin |cublas |n/a |2TB SSD, 64gb, --smartcontext --usecublas --gpulayers 26 '''--threads 8''' | |494ms/T 1.7T/s |- |2023-07-13 |koboldcpp |airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin |cublas |n/a |2TB SSD, 64gb, --smartcontext --usecublas --gpulayers 26 '''--threads 6 --linearrope (no difference, needs supercot?)''' | |425ms/T 2.2T/s |- |2023-07-13 |koboldcpp |airoboros-33b-gpt4-'''1.4'''.ggmlv3.q4_K_M.bin |cublas |n/a |2TB SSD, 64gb, --smartcontext --usecublas --gpulayers 26 '''--threads 6''' | |400ms/T 2.3T/s |- |2023-07-13 |koboldcpp |airoboros-'''65b'''-gpt4-1.4.ggmlv3.q4_K_M.bin |cublas |n/a |2TB SSD, 64gb, --nommap --smartcontext --usecublas --gpulayers 13 --threads 6 | |1366ms/T 0.7T/s |- |2023-07-14 |koboldcpp |airoboros-65b-gpt4-1.4.ggmlv3.'''q2_K.'''bin |cublas |n/a |2TB SSD, 64gb, --nommap --smartcontext --usecublas --gpulayers 13 --threads 6 | |765ms/T - 1.2T/s |- |2023-09-06 |koboldcpp |'''guanaco-33B.ggmlv3.q4_K_M.bin''' |cublas |n/a |2TB SSD, 64gb, --stream --smartcontext --usecublas --gpulayers 29 --threads 6 | |562ms/T - 1.3T/s |- |2023-09-06 |koboldcpp |guanaco-33B.ggmlv3.q4_K_M.bin |cublas |n/a |2TB SSD, 64gb, '''--nommap''' --stream --smartcontext --usecublas --gpulayers 29 --threads 6 | |567ms/T), Total:70.7s (1.4T/s |- |2023-09-06 |koboldcpp |guanaco-33B.ggmlv3.q4_K_M.bin |cublas |n/a |2TB SSD, 64gb, '''--nommap''' --stream --smartcontext --usecublas --gpulayers 25 --threads 6 | |563ms/T), Total:70.2s (1.4T/s |- |2023-12-03 |koboldcpp |guanaco-33B.q4_K_M.'''gguf''' |cublas |n/a |2TB SSD, 64gb, --nommap --smartcontext --usecublas --threads 6 '''--gpulayers 27''' | |330.7ms/T), Total:40.79s 2.94T/s |- |2023-12-07 |koboldcpp |guanaco-33B.q4_K_M.gguf |cublas |n/a |'''7950x3d''', 2TB SSD, 64gb, --nommap --smartcontext --usecublas --threads 6 --gpulayers 27 | |202.1ms/T, 4.78T/s |- |2023-12-07 |koboldcpp |guanaco-33B.q4_K_M.gguf |cublas |n/a |7950x3d, 2TB SSD, 64gb, --nommap --smartcontext --usecublas '''--threads 32''' --gpulayers 27 | |360.8ms/T, 2.68T/s |- |2023-12-07 |koboldcpp |guanaco-33B.q4_K_M.gguf |cublas |n/a |7950x3d, 2TB SSD, 64gb, --nommap --smartcontext --usecublas '''--threads 16''' --gpulayers 27 | |202.6ms/T, 4.82T/s |- |2023-12-07 |koboldcpp |guanaco-33B.q4_K_M.gguf |cublas |n/a |7950x3d, 2TB SSD, 64gb, --nommap --smartcontext --usecublas '''--threads 15''' --gpulayers 27 | |195.0ms/T, 5.03T/s |- |2023-12-16 |koboldcpp |'''mistral-7b-instruct-v0.2.Q8_0.gguf''' |cublas |n/a |7950x3d, 2TB SSD, 64gb, --nommap --smartcontext --usecublas --threads 15 '''--gpulayers 33''' | |22.9ms/T, 42.90T/s |- |2023-12-17 |koboldcpp |'''mixtral-8x7b-moe-rp-story.Q8_0.gguf''' |cublas |n/a |7950x3d, 2TB SSD, 64gb, --nommap --smartcontext --usecublas --threads 15 '''--gpulayers 6''' | |214.9ms/T, 4.47T/s |- |2024-02-04 |SillyTavern |miqu 70b | | |gpu layers 9 | |1.4T/s |} ==== Old (2080ti) ==== {| class="wikitable sortable" |+ ! !Interface !Model !Ram !VRam !Paramaters !Speed |- |2025-08-01 |ollama |qwen3-'''coder''':30b-a3b-q4_K_M | | | |17.01T/s |- |2025-08-01 |ollama |hf.co/'''unsloth'''/Qwen3-'''Coder'''-30B-A3B-Instruct-GGUF:Q4_K_M | | | |16.84T/s |- |2025-08-01 |ollama |hf.co/'''unsloth'''/Qwen3-30B-A3B-'''Thinking'''-2507-GGUF:Q4_K_M | | | |14.39T/s |- |2025-08-01 |ollama |qwen3:30b-a3b-thinking-2507-'''q4_K_M''' | | | |13.95T/s |- |2025-08-02 |ollama |Qwen3:30b-a3b-thinking-2507-'''q8_0''' | | | |10.93T/s |- |2025-08-13 |'''llama.cpp-cuda''' |unsloth_Qwen3-Coder-30B-A3B-Instruct-GGUF_Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf | | |No gpu, --ctx-size 32684 |'''20.62 T/s''' |- |2025-08-13 |llama.cpp-cuda |unsloth_Qwen3-Coder-30B-A3B-Instruct-GGUF_Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf | |9287MiB /ย 11264MiB | -ngl 22 --ctx-size 12684 |'''31.54 T/s''' |- |2025-08-13 |llama.cpp-cuda |unsloth_Qwen3-Coder-30B-A3B-Instruct-GGUF_Qwen3-Coder-30B-A3B-Instruct-UD-Q6_K_XL.gguf | | | -ngl 15 --ctx-size 12684 |'''21.51 T/s''' |- |2025-08-14 |llama.cpp-cuda |unsloth_gemma-3-27b-it-qat-GGUF_gemma-3-27b-it-qat-Q4_K_M.gguf | | |default |like 3? idk |- |2025-08-14 |llama.cpp-cuda |unsloth_gemma-3-27b-it-qat-GGUF_gemma-3-27b-it-qat-Q4_K_M.gguf | | | -fa --ctx-size 12684 |2.76 T/s |- |2025-08-14 |llama.cpp-cuda |unsloth_gemma-3-27b-it-qat-GGUF_gemma-3-27b-it-qat-Q4_K_M.gguf | | | -fa -ngl 25 --ctx-size 12684 |4.87 Ts |- |2025-08-14 |llama.cpp-cuda |unsloth_gemma-3-27b-it-qat-GGUF_gemma-3-27b-it-qat-Q4_K_M.gguf | | | -ngl 22 --ctx-size 12684 |4.10 T/s |- |2025-08-14 |llama.cpp-cuda |unsloth_gemma-3-27b-it-qat-GGUF_gemma-3-27b-it-qat-Q4_K_M.gguf | | | -ngl 22 --ctx-size 12684 '''-fa''' |4.43 T/s |- |2025-08-14 |llama.cpp-cuda |unsloth/'''gemma-3n-E4B'''-it-GGUF:UD-Q4_K_XL | | | -fa -ngl 99 --ctx-size 12684 |23.14 T/s |- |2025-08-14 |llama.cpp-cuda |unsloth/Mistral-Small-3.2-24B-Instruct-2506-GGUF:UD-Q4_K_XL | | |<nowiki>--ctx-size 12684</nowiki> --flash-attn --jinja --temp 0.15 --top-k -1 --top-p 1.00 -ngl 20 |7.48 T/s |- |2025-08-14 |llama.cpp-cuda |unsloth/Qwen3-4B-Instruct-2507-GGUF:UD-Q8_K_XL | | |<nowiki>--flash-attn</nowiki> -ngl 99 --jinja --ctx-size 12684 --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 --presence-penalty 1.5 -c 22684 -n 32768 --no-context-shift |48.69 T/s |- |2025-08-14 |ollama |hf.co/bartowski/L3.3-MS-Nevoria-70b-GGUF:Q4_K_M | | |Defaults |1.74 tokens/s |- |2025-08-14 |llama.cpp-cuda |hf.co/bartowski/L3.3-MS-Nevoria-70b-GGUF:Q4_K_M | | | --flash-attn -ngl 15 --ctx-size 6000 |1.82 T/s |- |2025-08-14 |llama.cpp-cuda |hf.co/bartowski/L3.3-MS-Nevoria-70b-GGUF:Q4_K_M | | |<nowiki>--hf-repo-draft unsloth/Llama-3.2-1B-Instruct-GGUF:Q4_K_M</nowiki> --flash-attn -ngl 13 --ctx-size 6000 --gpu-layers-draft 99 |2.98 T/s |- |2025-08-14 |llama.cpp-cuda |unsloth_gemma-3-27b-it-qat-GGUF_gemma-3-27b-it-qat-Q4_K_M.gguf | | | +270m draft model |7.19 T/s |- |2025-08-16 |llama.cpp-cuda |qwen3-coder-30b:Q4_K_XL-GPU | |8879MiB /ย 11264MiB |ย ย ย --flash-attn ย ย ย -ctk q4_0 -ctv q4_0 ย ย ย --jinja -ngl 22 --ctx-size 12684 |33.77 T/s |- |2025-08-16 |llama.cpp-cuda |qwen3-coder-30b:Q4_K_XL-GPU | |8907MiB /ย 11264MiB | --flash-attn ย ย ย -ctk q8_0 -ctv q8_0 ย ย ย --jinja -ngl 22 --ctx-size 12684 |35.71 T/s |- |2025-08-16 |llama.cpp-cuda |qwen3-coder-30b:Q4_K_XL-GPU | |10400MiB /ย 11264MiB | --flash-attn -ctk q4_0 -ctv q4_0 --jinja '''-ngl 26''' --ctx-size 12684 |37.55 T/s |- |2025-08-16 |llama.cpp-cuda |qwen3-coder-30b:Q4_K_XL-GPU | |10371MiB /ย 11264MiB |<nowiki>--flash-attn</nowiki> -ctk q8_0 -ctv q8_0 --jinja -ngl 26 --ctx-size 12684 |37.08 T/s |- |2025-08-21 |llama.cpp-cuda |ggml-org/gpt-oss-20b-GGUF:mxfp4 | |7907MiB /ย 11264MiB | --ctx-size 32768 --jinja -ub 2048 -b 2048 -ngl 99 -fa --n-cpu-moe 16 |38.15 T/s |- |2025-08-23 |llama.cpp-cuda |qwen3-coder-30b:'''Q6'''_K_XL-GPU | |10511MiB /ย 11264MiB |<nowiki>--flash-attn</nowiki> -ctk q8_0 -ctv q8_0 --n-cpu-moe 33 -ngl 99 --ctx-size 12684 |32.5 t/s |- |2025-08-25 |llama.cpp-cuda |unsloth/gemma-3-12b-it-qat-GGUF:UD-Q4_K_XL | |6030MiB /ย 11264MiB |<nowiki>--flash-attn</nowiki>-ngl 22 --ctx-size 12684 |12.94 T/s |- |2025-08-25 |llama.cpp-cuda |unsloth/gemma-3-12b-it-qat-GGUF:UD-Q4_K_XL | |10174MiB /ย 11264MiB |<nowiki>--flash-attn</nowiki>-ngl 99 --ctx-size 12684 |48.3 t/s |- |2025-08-25 |llama.cpp-cuda |unsloth/gemma-3-12b-it-qat-GGUF:UD-Q4_K_XL | |10674MiB /ย 11264MiB |<nowiki>--flash-attn</nowiki>-ngl 99 --ctx-size 12684 '''-ctk q8_0 -ctv q8_0''' |'''37.45 T/s''' |- |2025-08-25 |llama.cpp-cuda |unsloth/gemma-3-12b-it-qat-GGUF:UD-Q4_K_XL | |10728MiB /ย 11264MiB |<nowiki>--flash-attn</nowiki>-ngl 99 -'''-ctx-size 21845''' |49.20 T/s |- |2025-08-25 |llama.cpp-cuda |QuantStack/InternVL3_5-30B-A3B-gguf:IQ4_XS ๐ | |9220MiB /ย 11264MiB |${KV_CACHE} --flash-attn --n-cpu-moe 30 -ngl 99 --ctx-size 12684 |47.1 t/s |- |2025-08-28 |llama.cpp-cuda |mradermacher_Dolphin-Mistral-24B-Venice-Edition-i1-GGUF_Dolphin-Mistral-24B-Venice-Edition.i1-Q4_K_M.gguf | |9550MiB /ย 11264MiB |<nowiki>--ctx-size 8192</nowiki> --flash-attn --jinja --temp 0.15 --top-p 0.95 --top-k 20 --min-p 0 -ngl 20 |7.9 t/s |- |2025-08-28 |llama.cpp-cuda |mradermacher_Dolphin-Mistral-24B-Venice-Edition-i1-GGUF_Dolphin-Mistral-24B-Venice-Edition.i1-Q4_K_M.gguf | |9107MiB /ย 11264MiB |'''${KV_CACHE}''' <nowiki>--ctx-size 8192</nowiki> --flash-attn --jinja --temp 0.15 --top-p 0.95 --top-k 20 --min-p 0 -ngl 20 |7.8 t/s |- |2025-08-28 |llama.cpp-cuda |mradermacher_Dolphin-Mistral-24B-Venice-Edition-i1-GGUF_Dolphin-Mistral-24B-Venice-Edition.i1-Q4_K_M.gguf | |9766MiB /ย 11264MiB |${KV_CACHE} <nowiki>--ctx-size 8192</nowiki> --flash-attn --jinja --temp 0.15 --top-p 0.95 --top-k 20 --min-p 0 '''-ngl 22''' |8.3 t/s |- |2025-08-28 |llama.cpp-cuda |mradermacher_Dolphin-Mistral-24B-Venice-Edition-i1-GGUF_Dolphin-Mistral-24B-Venice-Edition.i1-Q4_K_M.gguf | |10429MiB /ย 11264MiB |${KV_CACHE} <nowiki>--ctx-size 8192</nowiki> --flash-attn --jinja --temp 0.15 --top-p 0.95 --top-k 20 --min-p 0 '''-ngl 24''' |8.9 t/s |- |2025-08-28 |llama.cpp-cuda |mradermacher_Dolphin-Mistral-24B-Venice-Edition-i1-GGUF_Dolphin-Mistral-24B-Venice-Edition.i1-Q4_K_M.gguf | |10542MiB /ย 11264MiB | --'''hf-repo-draft''' bartowski/alamios_Mistral-Small-3.1-DRAFT-0.5B-GGUF:Q4_K_M --jinja --temp 0.15 --top-p 0.95 --top-k 20 --min-p 0 --ctx-size 8192 ${KV_CACHE} --flash-attn '''-ngl 22''' --gpu-layers-draft 99 |9.7 t/s |- |2025-08-29 |llama.cpp-cuda |unsloth/GLM-4.5-Air-UD-Q2_K_XL.gguf |8864MiB |10253MiB /ย 11264MiB |${KV_CACHE} --flash-attn --n-cpu-moe 42 -ngl 99 --ctx-size 8192 |17.1 t/s |- |2025-08-30 |llama.cpp-cuda |unsloth/gpt-oss-20b:f16 |8293M |7876MiB / 11264MiB | --ctx-size 32768 -ub 2048 -b 2048 -ngl 99 --n-cpu-moe 16 ${KV_CACHE} |39.4 t/s |- |2025-08-30 |llama.cpp-cuda |unsloth/gpt-oss-20b:f16 |13.6G |10320MiB /ย 11264MiB | --ctx-size 32768 -ub 2048 -b 2048 -ngl 99 --n-cpu-moe 12 ${KV_CACHE} |42.8 t/s |- |2025-08-30 |llama.cpp-cuda |unsloth/gpt-oss-'''120b''':f16 |33.6G |9411MiB /ย 11264MiB | --ctx-size 32768 -ub 2048 -b 2048 ${KV_CACHE} |6.0 t/s |- |2025-08-30 |llama.cpp-cuda |unsloth/Seed-OSS-36B-Instruct-UD-Q4_K_XL |14.5G |10327MiB /ย 11264MiB |${KV_CACHE} --flash-attn --jinja --n-cpu-moe 40 -ngl 24 --ctx-size 12684 |4.2t/s |- |2025-09-01 |llama.cpp-cuda |unsloth/gpt-oss-'''120b''':f16 |44.4G |9541MiB /ย 11264MiB |${'''UNSLOTH_GPT_OSS_PARAMS'''} (k_top 100 not 0) --jinja --ctx-size '''12768''' -ub 2048 -b 2048 -ngl 99 --n-cpu-moe 34 --flash-attn ${KV_CACHE_Q4} |11.7 t/s |- |2025-09-01 |llama.cpp-cuda |unsloth/gpt-oss-120b:f16 |45.9G |9576MiB /ย 11264MiB |${'''UNSLOTH_GPT_OSS_PARAMS'''} (k_top 100 not 0) --jinja --ctx-size '''22768''' -ub 2048 -b 2048 -ngl 99 --n-cpu-moe 34 --flash-attn ${KV_CACHE_Q4} |12.9 t/s |- |2025-09-05 |llama.cpp-cuda |'''Hermes-4-14B-GGUF:Q4_K_M''' |853.63 MiB |10625MiB /ย 11264MiB |${KV_CACHE} --flash-attn on --jinja -ngl 99 --ctx-size 12684 |'''42.7 t/s''' |- |2025-09-10 |llama.cpp-cuda |gabriellarson/'''ERNIE-4.5-21B-A3B-Thinking-GGUF:Q4_K_M''' | |'''2742MiB''' /ย 11264MiB |${KV_CACHE} --flash-attn on --n-cpu-moe 35 -ngl 99 --jinja --ctx-size 12684 |'''31.1 t/s''' |- |2025-09-10 |llama.cpp-cuda |gabriellarson/ERNIE-4.5-21B-A3B-Thinking-GGUF:Q4_K_M | |10759MiB /ย 11264MiB |${KV_CACHE} --flash-attn on --n-cpu-moe '''10''' -ngl 99 --jinja --ctx-size 12684 |'''49.0 t/s''' |- |2025-09-10 |llama.cpp-cuda |gabriellarson/ERNIE-4.5-21B-A3B-Thinking-GGUF:Q8_0 |15.7G |9408MiB /ย 11264MiB |${KV_CACHE} --flash-attn on --n-cpu-moe '''20''' -ngl 99 --jinja --ctx-size 12684 |'''28.5 t/s''' |- |2025-09-18 |llama.cpp-cuda |unsloth_Magistral-Small-2509-GGUF_Magistral-Small-2509-UD-Q4_K_XL.gguf ๐ |8900M |10237MiB /ย 11264MiB |<nowiki>--special</nowiki> --ctx-size 12684 --flash-attn on -ngl 20 --jinja --temp 0.7 --top-k -1 --top-p 0.95 |7.9 t/s |- |2025-09-18 |llama.cpp-cuda |glm-4.5-air:Q3_K_XL |42.6G |8881MiB /ย 11264MiB |${KV_CACHE} --flash-attn on --jinja --n-cpu-moe 44 -ngl 99 --ctx-size 8192 |7.6 t/s |- |2025-09-26 |llama.cpp-cuda |qwen3-coder-30b:Q4_K_XL-GPU |10.5G |8632MiB |${KV_CACHE} --flash-attn on --jinja --n-cpu-moe 30 -ngl 99 --ctx-size 25000 |49.33t/s |- |2025-10-03 |llama.cpp-cuda |granite-4.0-h-small:Q4_K_XL ('''32B-9B''') |11.4G |9324MiB |${KV_CACHE} --flash-attn on --n-cpu-moe 25 -ngl 99 --ctx-size 12684 |25.23t/s |- |2025-10-03 |llama.cpp-cuda |granite-4.0-h-tiny:Q4_K_XL ('''7B-1B)''' |0 |5487MiB |${KV_CACHE} --flash-attn on --n-cpu-moe 0 -ngl 99 --ctx-size 16384 |112.28 |- |2025-10-03 |llama.cpp-cuda |granite-4.0-h-micro:Q4_K_XL ('''3B''') |0 |3431MiB |${KV_CACHE} --flash-attn on --n-cpu-moe 0 -ngl 99 --ctx-size 16384 |108.58 |- |2025-10-31 |llama.cpp-cuda |unsloth/Qwen3-VL-30B-A3B-Thinking-UD-Q4_K_XL.gguf +unsloth/Qwen3-VL-30B-A3B-Thinking-GGUF-mmproj-F16.gguf |S10.6g |V10855MiB |${KV_CACHE} --flash-attn on --ctx-size 12684 --n-cpu-moe 30 -ngl 99 |42.42 |- |2025-11-06 |llama.cpp-cuda |Face314/Qwen3-Coder-30B-A3B-Instruct-'''MXFP4_MOE'''.gguf |10582MiB |8542MiB |${KV_CACHE} --flash-attn on --n-cpu-moe 30 -ngl 99 --ctx-size 25000 |42.72 tokens/s |- |2025-11-25 |llama.cpp-cuda |noctrex/Qwen3-VL-30B-A3B-Instruct-1M-'''MXFP4_MOE'''.gguf +unsloth/Qwen3-VL-30B-A3B-Thinking-GGUF-mmproj-F16.gguf |10.3G |9328MiB |${KV_CACHE} --flash-attn on --ctx-size 12684 --n-cpu-moe 30 -ngl 99 |43.72 tokens/s |- |2025-11-27 |llama.cpp-cuda |unsloth/glm-4.5-air:Q3_K_XL |46.2g |7632MiB |${KV_CACHE_Q4} --flash-attn on --jinja --n-cpu-moe 44 -ngl 99 --ctx-size 12000 |13.0 tokens/sec |- |2025-11-27 |llama.cpp-cuda |mradermacher/GLM-4.5-Air-Derestricted.i1-IQ4_XS.gguf |51G |7226MiB |${KV_CACHE_Q4} --flash-attn on --jinja --n-cpu-moe 44 -ngl 99 --ctx-size 12000 |'''13.50''' tokens/s |- |2025-11-27 |llama.cpp-cuda |agiccodingman/Qwen3-VL-'''32B'''-Thinking-unsloth-MXFP4_MOE-Q6_K.gguf No vision or draft |24.6G |9644MiB |${KV_CACHE} --flash-attn on --ctx-size 8192 -ngl 17 # 502MiB per-layer |'''2.4''' tokens/sec |- |2025-11-29 |llama.cpp-cuda |unsloth/'''Qwen3-Next-80B-A3B'''-Instruct-UD-Q4_K_XL.gguf |41.3G |9358MiB |${KV_CACHE} --flash-attn on --jinja --ctx-size 12684 --n-cpu-moe '''40''' -ngl 99 |'''22.78''' tokens/s |} deepseek-r1-distill-qwen-32b - 4.20 tok/sec ==== Newer (5080) ==== {| class="wikitable sortable" ! !Interface !Model !Ram !VRam !Paramaters !Speed |- |2025-12-01 |llama.cpp-cuda |unsloth/glm-4.5-air:Q3_K_XL |49.5G |7768MiB |${KV_CACHE} --flash-attn on --jinja --n-cpu-moe 44 -ngl 99 --ctx-size 8192 |16.34 tokens/s |- |2025-12-01 |llama.cpp-cuda |unsloth/glm-4.5-air:Q3_K_XL |49.5G |14302MiB |${KV_CACHE} --flash-attn on --jinja --n-cpu-moe '''38''' -ngl 99 --ctx-size 8192 |'''18.20 tokens/s''' |- |2025-12-05 |llama.cpp-cuda |Qwen3-Next-80B-A3B-Thinking |41.0G |9528MiB |${KV_CACHE} --flash-attn on --jinja --ctx-size 12684 --n-cpu-moe '''40''' -ngl 99 |23.14 tokens/s |- |2025-12-05 |llama.cpp-cuda |Qwen3-Next-80B-A3B-Thinking |41.0G |9528MiB |${KV_CACHE} --flash-attn on --jinja --ctx-size 12684 --n-cpu-moe '''34''' -ngl 99 |23.75 tokens/s |- |2025-12-05 |llama.cpp-cuda ('''Updated to 7278''' with Qwen3-next fixes) |Qwen3-Next-80B-A3B-Thinking |41.0G |14628MiB |${KV_CACHE} --flash-attn on --jinja --ctx-size 12684 --n-cpu-moe 34 -ngl 99 |'''26.40 tokens/s''' |} === Hextor === https://artificialanalysis.ai/models/comparisons/qwen3-30b-a3b-2507-vs-gemma-3-27b https://artificialanalysis.ai/models/comparisons/qwen3-30b-a3b-2507-vs-deepseek-r1-0120 [https://arxiv.org/html/2507.20984v2#S5 Smallthinker comparison, compares gemma3, older qwen3] {| class="wikitable" |+ !Date !Engine !Model !Thinking ! !Total Time !Eval Tokens !Prompt (tokens/s) !Eval (token/s) |- |2025-08-01 |ollama |gemma3:1B |โ | |20s |573 |91.61 |28.04 |- |2025-08-01 |ollama |mashriram/gemma3nTools:e4b |โ | |1m12s |537 |22.12 |7.95 |- |2025-08-01 |ollama |gemma3:12b-it-qat |โ | |4m2s |716 |9.79 |2.99 |- |2025-08-01 |ollama |gemma3:27b-it-qat |โ | |9m11s |778 |4.16 |1.43 |- |2025-08-01 |ollama |deepseek-r1:8b |โ | |6m9s |1807 |13.53 |4.90 |- |2025-08-01 |ollama |qwen3:30b-a3b-instruct-2507-q4_K_M |โ | |2m10s |1454 |26.60 |11.17 |- |2025-08-01 |ollama |hf.co/unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF:Q4_K_M |โ | |56s |623 |21.37 |11.26 |- |2025-08-01 |ollama |hf.co/unsloth/Qwen3-30B-A3B-Thinking-2507-GGUF:Q4_K_M |โ | |1m50s |1203 |25.88 |10.87 |- |2025-08-01 |ollama |hf.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:Q4_K_M |โ | |36s |504 |136.43 |11.33 |- |2025-08-01 |ollama |qwq:latest |โ | |14m17s |1111 |3.21 |1.30 |- |2025-08-01 |ollama |hf.co/unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF:Q4_K_XL |โ | |6m9s |1860 |12.93 |5.05 |- |2025-08-13 |llama.cpp |unsloth_Qwen3-Coder-30B-A3B-Instruct-GGUF_Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf |โ | |30903.17 ms |357 |30.27 |11.70 |- |2025-08-15 |llama.cpp |unsloth_Qwen3-30B-A3B-Instruct-2507-GGUF_Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf |โ | |14703.99 ms |177 |30.83 |12.04 |- |2025-08-15 |llama.cpp |unsloth_Qwen3-4B-Instruct-2507-GGUF_Qwen3-4B-Instruct-2507-UD-Q8_K_XL.gguf |โ | |28738.32 ms |147 |28.28 |5.22 |- |2025-08-15 |llama.cpp |unsloth_Qwen3-4B-Instruct-2507-GGUF_Qwen3-4B-Instruct-2507-UD-'''Q4'''_K_XL.gguf |โ | |8580.48 ms |79 |31.71 |9.82 |- |2025-08-16 |llama.cpp |unsloth_Qwen3-Coder-30B-A3B-Instruct-GGUF_Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf |โ |kv_cache q8 | | | |12.47, 11.73 |- |2025-08-16 |llama.cpp |unsloth_Qwen3-Coder-30B-A3B-Instruct-GGUF_Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf |โ |kv_cache q4 | | | |11.89, 11.94 |- |2025-08-30 |llama.cpp |unsloth/gpt-oss-20b-F16.gguf |โ |no fa? | | | |11.7 t/s |- |2025-10-04 |llama.cpp |unsloth/granite-4.0-h-small:Q4_K_XL |โ | | | | |4.17 tokens/s |} == Prompts == === File sorting === <blockquote>You are an AI, your purpose the sorting of files/folders recently downloaded from BitTorrent. RULE: All interactions with the system must be one of the following "SYSTEM COMMANDS", you may also think by putting thoughts between 'think' html block tags. <nowiki>*</nowiki>ย ย โฏLOG message - Logs a message to the system describing any actions you are taking. <nowiki>*</nowiki>ย ย โฏABORT optional reason - Aborts the operation. Use this if there is an error. <nowiki>*</nowiki>ย ย โฏMOVE [SOURCE] [DESTINATION] - Moves the file. [SOURCE] is the fullpath of the filename and [DESTINATION] is a folder. <nowiki>*</nowiki>ย ย โฏMKDIR [PATH] - Makes a directory. For example 'โฏMKDIR "/mnt/storage12/Videos/Series/Example Show"'. <nowiki>*</nowiki>ย ย โฏREQUEST_HELP message - Requests help from a human operator. You may also think by putting it between html block tags with the name 'think'. Put quotes around all file/folder names to ensure spaces are handled correctly. All "SYSTEM COMMANDS" are prefixed with a โฏ character. Each system command should be on a line on it's own. All "SYSTEM COMMANDS" (other than "โฏLOG") should be prefixed with a "LOG" command explaining what action is being taken and why." If you think there is an "ERROR" or you have been given faulty information then issue the "โฏABORT" command. If a task is to complex, your confused or you require more information then issue a "โฏREQUEST_HELP" command to request help or information from a "HUMAN OPERATOR". Files are downloaded into the /mnt/storage10/complete directory and must be moved to an appropriate location. There are multiple harddrives on the server, they are mounted as /mnt/storage0, /mt/storage1, /mnt/storage2, ending with /mnt/storage12. Your goal is to move the file to an appropriate location. Each drive has directories for "/mnt/storage#/Videos/Series", "/mnt/storage#/Videos/Anime", "/mnt/storage#/Videos/Movies", you must move the file/folder to one of these locations. A file catagorized as a "TV Series" will go in the folder name "Series" NOT "TV Series". You will be given the filename of a download, file size, information about the amount of space on each drive and possible relevant folders. You should start by classifying the file into a 'Movie', 'TV Series' , 'Anime' or 'Other'. If a movies is also an Anime should be classified as 'Anime'. Both 'Anime" and "TV Series" can have a season so that information by itself isn't enough to determine it's a 'TV Series'. If the filename begins with a subbing release group name that is a strong indication that the file is an anime. If the category of a file is "Other", do not attempt to move it. Instead output "ABORT" all in capital letters and stop all other output. Extract the season number if there is season information in the filename. When ready to move a file issue the command "โฏMOVE [SOURCE] [DESTINATION]" Movies should just be moved to a drive with the least free space into the "/mnt/storage#Videos/Movies/" directory. Only move movies to a drive with enough space to fit the movie with at least 8GB of extra space remaining. Do not rename the file/folder, instead leave the filename intact. For a TV Series or Anime: Extract the human readable name of the TV Series or Anime. For a TV Series or Anime: Check to see if there are any folder given to you with a human readable name on a drive with enough space. For a TV Series or Anime: The file/folder can be either an individual episode or a folder with multiple episodes. For a TV Series or Anime: If there is a season number then put it into a subfolder with that season number, for example for "foundation.s03e03.1080p.web.h264-successfulcrab[EZTVx.to].mkv" might be put in "/mnt/storage#/Videos/Series/Foundation/Season 03/". For Anime: If the filename is a folder and there is no season information, assume it is a complete series and just put it into the /Anime/ on the chosen drive. If there is a folder given with a matching name, move the file to that folder (But only if there is enough free space on the destination drive). If there are multiple appropriate folders given then choose the one on a drive with the least amount of free space. For a TV Series/Anime, if there is no given folder with the correct name, or all the appropriate folders are on a drive without enough space then create a folder on the drive with the most free space. In order to avoid a situation with a "Movie" and a "TV Series" having the same name, a if a folder exists does it have the correct category in it's path. Does the drive have enough space to move the episode to? If not try another folder. If there are no appropriate folders create one in the correct location on a drive with enough free space by using the "โฏMKDIR" SYSTEM COMMAND. Here is the disk free space information: โญโโโโโโโโโโโโโโโโโโโโโโโโโฎ โ 13 local devicesย ย ย ย โ โโโโโโโโโโโโโโโโโโฌโโโโโโโโค โ MOUNTED ONย ย ย โ AVAIL โ โโโโโโโโโโโโโโโโโโผโโโโโโโโค โ /mnt/storage0ย โย 8.9G โ โ /mnt/storage1ย โ 37.3G โ โ /mnt/storage10 โ 63.7G โ โ /mnt/storage11 โ 10.6G โ โ /mnt/storage12 โย 1.8T โ โ /mnt/storage2ย โ 81.8G โ โ /mnt/storage3ย โ 10.9G โ โ /mnt/storage4ย โ 11.9G โ โ /mnt/storage5ย โ 59.0G โ โ /mnt/storage6ย โ 23.4G โ โ /mnt/storage7ย โ 33.2G โ โ /mnt/storage8ย โ 19.4G โ โ /mnt/storage9ย โย 6.7G โ โฐโโโโโโโโโโโโโโโโโดโโโโโโโโฏ There are currently no folders that match the filename. Here is the filename: "/mnt/storage10/complete/[Bolshevik] Killing Bites [BD 1080p x264 10-bit FLAC]" The filesize of the file/folder is 18G.</blockquote> === File sorting 2 === <blockquote>You are an AI, your purpose the sorting of files/folders recently downloaded from BitTorrent. RULE: All interactions with the system must be one of the following "SYSTEM COMMANDS", you may also think by putting thoughts between 'think' html block tags. <nowiki>*</nowiki>ย ย โฏLOG message - Logs a message to the system describing any actions you are taking. <nowiki>*</nowiki>ย ย โฏABORT optional reason - Aborts the operation. Use this if there is an error. <nowiki>*</nowiki>ย ย โฏMOVE [SOURCE] [DESTINATION] - Moves the file. [SOURCE] is the fullpath of the filename and [DESTINATION] is a folder. <nowiki>*</nowiki>ย ย โฏMKDIR [PATH] - Makes a directory. For example 'โฏMKDIR "/mnt/storage12/Videos/Series/Example Show"'. <nowiki>*</nowiki>ย ย โฏREQUEST_HELP message - Requests help from a human operator. Filenames should be quoted ensure spaces are handled correctly. All "SYSTEM COMMANDS" are prefixed with a โฏ character. Each system command should be on a line on it's own. All "SYSTEM COMMANDS" (other than "โฏLOG") should be prefixed with a "LOG" command explaining what action is being taken and why." If you think there is an "ERROR" or you have been given faulty information then issue the "โฏABORT" command. If a task is to complex, your confused or you require more information then issue a "โฏREQUEST_HELP" command to request help or information from a "HUMAN OPERATOR". Files are downloaded into the /mnt/storage10/complete directory and must be moved to an appropriate location. There are multiple harddrives on the server, they are mounted as /mnt/storage0, /mt/storage1, /mnt/storage2, ending with /mnt/storage12. Your goal is to move the file to an appropriate location. Each drive has directories for "/mnt/storage#/Videos/Series", "/mnt/storage#/Videos/Anime", "/mnt/storage#/Videos/Movies", you must move the file/folder to one of these locations. A file catagorized as a "TV Series" will go in the folder name "Series" NOT "TV Series". You will be given the filename of a download, file size, information about the amount of space on each drive and possible relevant folders. '''STEP 1:''' Start by classifying the file into a 'Movie', 'TV Series' , 'Anime' or 'Other'. If a movies is also an Anime should be classified as 'Anime'. Both 'Anime" and "TV Series" can have a season so that information by itself isn't enough to determine it's a 'TV Series'. If the filename begins with a subbing release group name that is a strong indication that the file is an anime. If the category of a file is "Other", do not attempt to move it. Instead output "ABORT" all in capital letters and stop all other output. '''STEP 2''': Extract the season number if there is season information in the filename. Note if there is no season information. '''STEP 3:''' Determine if the filename refers to a file or folder. If there is no extension then it is a folder. When ready to move a file issue the command "โฏMOVE [SOURCE] [DESTINATION]" '''STEP 4''': Determine the destination for the move Movies should just be moved to a drive with the least free space into the "/mnt/storage#Videos/Movies/" directory. Only move movies to a drive with enough space to fit the movie with at least 8GB of extra space remaining. Do not rename the file/folder, instead leave the filename intact. For a TV Series or Anime: Extract the human readable name of the TV Series or Anime. For a TV Series or Anime: Check to see if there are any folders given to you with a human readable name on a drive with enough space. For a TV Series or Anime: The input filename could be a file with a individual episode or a folder with multiple episodes. For a TV Series or Anime: If there is a season number then put it into a subfolder with that season number, for example for "foundation.s03e03.1080p.web.h264-successfulcrab[EZTVx.to].mkv" might be put in "/mnt/storage#/Videos/Series/Foundation/Season 03/". For Anime: If the filename is a folder and there is no season information, assume it is a complete series and just put it into the /Anime/ on the chosen drive. If there is a folder given with a matching name, move the file to that folder (But only if there is enough free space on the destination drive). If there are multiple appropriate folders given then choose the one on a drive with the least amount of free space. For a TV Series/Anime, if there is no given folder with the correct name, or all the appropriate folders are on a drive without enough space then create a folder on the drive with the most free space. In order to avoid a situation with a "Movie" and a "TV Series" having the same name, a if a folder exists does it have the correct category in it's path. Does the drive have enough space to move the episode to? If not try another folder. If there are no appropriate folders create one in the correct location on a drive with enough free space by using the "โฏMKDIR" SYSTEM COMMAND. '''STEP 5''': Make a note of the following information * Will a folder need to be created? * What is the destination path. '''STEP 6''': Output the sequence of commands. IMPORTANT: Once you have thought a valid sequence of commands give the solution based on the thinking directly. Here is the disk free space information: โญโโโโโโโโโโโโโโโโโโโโโโโโโฎ โ 13 local devicesย ย ย ย โ โโโโโโโโโโโโโโโโโโฌโโโโโโโโค โ MOUNTED ONย ย ย โ AVAIL โ โโโโโโโโโโโโโโโโโโผโโโโโโโโค โ /mnt/storage0ย โย 8.9G โ โ /mnt/storage1ย โ 37.3G โ โ /mnt/storage10 โ 63.7G โ โ /mnt/storage11 โ 10.6G โ โ /mnt/storage12 โย 1.8T โ โ /mnt/storage2ย โ 81.8G โ โ /mnt/storage3ย โ 10.9G โ โ /mnt/storage4ย โ 11.9G โ โ /mnt/storage5ย โ 59.0G โ โ /mnt/storage6ย โ 23.4G โ โ /mnt/storage7ย โ 33.2G โ โ /mnt/storage8ย โ 19.4G โ โ /mnt/storage9ย โย 6.7G โ โฐโโโโโโโโโโโโโโโโโดโโโโโโโโฏ There are currently no folders that match the filename. Here is the filename: "/mnt/storage10/complete/[Bolshevik] Killing Bites [BD 1080p x264 10-bit FLAC]" The filesize of the file/folder is 18G.</blockquote>
Summary:
Please note that all contributions to Hegemon Wiki may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
Hegemon Wiki:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Navigation menu
Personal tools
Not logged in
Talk
Contributions
Log in
Namespaces
Page
Discussion
English
Views
Read
Edit
Edit source
View history
More
Search
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Tools
What links here
Related changes
Special pages
Page information