Editing AI (section)

== LLama ==

[https://boards.4channel.org/g/catalog#s=lmg%2F /lmg/]

=== Models ===
[https://huggingface.co/TheDrummer/Agatha-111B-v1.1-GGUF/blob/main/Agatha-X-111B-v1b-Q4_K_M.gguf Agatha-X-111B-v1b-Q4_K_M] (Use --swa-full for prompt caching)

[https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF Qwen-Next-Coder]

LLaDA2.1-flash (103B)

LLaDA2.1-mini (16B)

[https://huggingface.co/AIDC-AI/Ovis2.6-30B-A3B/tree/main AIDC-AI/Ovis2.6-30B-A3B]

Kimi-Linear-48B-A3B-Instruct

NVIDIA-Nemotron-3-Nano-30B-A3B

Step 3.5 Flash (199B, Probably only Q2 might fit)

[https://old.reddit.com/r/LocalLLaMA/comments/1obrvab/support_for_ling_and_ring_models_1000b103b16b_has/ support_for_ling_and_ring_models_1000b103b16b_has]

[https://huggingface.co/meituan-longcat/LongCat-Flash-Lite LongCat-Flash-Lite]

https://github.com/PrimeIntellect-ai/prime-rl - Claims to be a smarter finetune of GLM4.5 Air

==== Abliteration/MXFP4 MOE: ====
Check out PRISM abliteration 

[https://old.reddit.com/r/LocalLLaMA/comments/1p5epot/the_most_objectively_correct_way_to_abliterate_so/ the_most_objectively_correct_way_to_abliterate_so/] [https://huggingface.co/noctrex/GLM-4.5-Air-Derestricted-MXFP4_MOE-GGUF noctrex/GLM-4.5-Air-Derestricted-MXFP4_MOE-GGUF]

[https://old.reddit.com/r/LocalLLaMA/comments/1oypwa7/a_more_surgical_approach_to_abliteration/ a_more_surgical_approach_to_abliteration]

[https://www.reddit.com/r/LocalLLaMA/comments/1oymku1/heretic_fully_automatic_censorship_removal_for/ heretic_fully_automatic_censorship_removal_for]

[https://old.reddit.com/r/LocalLLaMA/comments/1ozh8py/mxfp4_hybrid_dense_models_ready_to_share_near/ mxfp4_hybrid_dense_models_ready_to_share_near/] - magiccodingman

==== Upcoming: ====
* [https://huggingface.co/inclusionAI/Ming-flash-omni-2.0 inclusionAI/Ming-flash-omni-2.0] - 100B MoE (6B active) omni-modal model - unified speech/SFX/music generation
* [https://old.reddit.com/r/LocalLLaMA/comments/1q0bgvl/solaropen100b_is_out/ Solar-Open-100B]
* [https://github.com/ggml-org/llama.cpp/issues/16186 <bdi>Qwen3-Omni-30B-A3B</bdi>]
* [https://qwen.ai/blog?id=qwen3-omni-flash-20251201 Qwen3-Omni-Flash-2025-12-01]

* [https://huggingface.co/inclusionAI/Ring-flash-2.0-GGUF Ring-Flash-2.0]
* [https://old.reddit.com/r/LocalLLaMA/comments/1oh5asg/new_model_from_the_minimax_team_minimaxm2_an/ MiniMax-M2, an impressive 230B-A10B LLM.New Model] - Not sure if this works with llama.cpp
* [https://old.reddit.com/r/LocalLLaMA/comments/1ojzekg/moonshotaikimilinear48ba3binstruct_hugging_face/ moonshotai/Kimi-Linear-48B-A3B-Instruct]
* [https://huggingface.co/unsloth/Olmo-3-32B-Think-GGUF Olmo 3]
* [https://github.com/ggml-org/llama.cpp/pull/17420 GigaChat3]
* [https://github.com/ggml-org/llama.cpp/issues/15512 ERNIE-4.5-VL-28B-A3B-Thinking]
* [https://old.reddit.com/r/LocalLLaMA/comments/1p6gsjh/llada20_103b16b_has_been_released/ LLaDA2.0 (103B/16B) has been released]

==== Other: ====
* [https://huggingface.co/ibm-granite/models Granite 4.0]
* LLaDA-MoE-7B-A1B-Instruct
* OLMoE

==== MOE Usable Models: ====
{| class="wikitable"
|+
!
!
!
!
|-
|[https://huggingface.co/bartowski/moonshotai_Kimi-Linear-48B-A3B-Instruct-GGUF/tree/main Kimi-Linear-48B-A3B]
|48B
|3B
|
|-
|Qwen3-Coder-30B-A3B-Instruct
|30B
|3B
|
|-
|Qwen3-VL-30B-A3B-Instruct
|30B
|3B
|
|-
|Qwen3-Coder-Next
|80B
|3B
|
|-
|GLM-4.5-Air
|106B
|12B
|
|-
|GLM-4.6V 
|106B
|12B?
|
|-
|GLM-4.6V-Flash
|9B
|
|
|-
|GLM-4.7-Flash
|30B
|3B
|
|-
|gpt-oss-20b
|21B
|3.6B
|
|-
|gpt-oss-120b
|120B
|5.1B
|
|-
|ERNIE
|21B
|3B
|
|-
|Granite
|32B
|9B
|Sep 17, 2025
|-
|Aquif
|42B
|3B
|
|-
|CohereLabs/command-a-reasoning-08-2025
|111B
|
|
|-
|Ovis2.6-30B-A3B
|30B
|A3B
|
|-
|LLaDA2.1-flash (103B)
|103B
|
|
|-
|LLaDA2.1-mini (16B)
|16B
|
|
|}

=== Misc ===

[https://boards.4channel.org/g/thread/93415313#p93421310 lammacpp server?]

"There was a tokenizer caching error, some people said. Redownload the hf_output files from the repo or just change the use_cache line in the config.json to say: "use_cache": true," for the Vicuna13B-free
https://github.com/stochasticai/xturing/tree/main/examples/int4_finetuning

https://wiki.installgentoo.com/wiki/Home_server#Expanding_Your_Storage

https://rentry.org/llama-tard-v2

https://rentry.org/llamaaids

https://hackmd.io/@reneil1337/alpaca


https://find.4chan.org/?q=AI+Dynamic+Storytelling+General

https://find.4chan.org/?q=AI+Chatbot+General

https://find.4chan.org/?q=%2Flmg%2F (local models general)

https://boards.4channel.org/g/thread/92400764#p92400764

https://rentry.org/llamaaids


<nowiki>https://files.catbox.moe/lvefgy.json</nowiki>

https://pytorch.org/hub/nvidia_deeplearningexamples_tacotron2/

<blockquote>


python server.py --model llama-7b-4bit --wbits 4

python server.py --model llama-13b-4bit-128g --wbits 4 --groupsize 128</blockquote>https://github.com/qwopqwop200/GPTQ-for-LLaMa/issues/59 for installing with out of space error

https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model#4-bit-mode


https://github.com/pybind/pybind11/discussions/4566

https://lmsysvicuna.miraheze.org/wiki/How_to_use_Vicuna#Use_with_llama.cpp%3A

https://huggingface.co/anon8231489123/vicuna-13b-GPTQ-4bit-128g


[https://huggingface.co/ShreyasBrill/Vicuna-13B Here's the uncucked Vicuna model (trained on the dataset that don't have the moralistic bullshit anymore) Too bad it's just the CPU quantized version]

[https://www.reddit.com/r/Oobabooga/comments/12hyini/vicuna_generating_its_own_prompts/jfrtvh3/ Vicuna generating its own prompts]

https://huggingface.co/TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g - <code>python3 llama.py vicuna-AlekseyKorshuk-7B c4 --wbits 4 --true-sequential --act-order --groupsize 128 --save_safetensors vicuna-AlekseyKorshuk-7B-GPTQ-4bit-128g.safetensors</code>


[https://github.com/ggerganov/llama.cpp/pull/933 <bdi>≈65% speedup of the AVX-512 implementation of <code>ggml_vec_dot_q4_0()</code></bdi>      #933]

"Speaking of which, for any 30b anons struggling with context size, I figured something out. If you use the Triton branch on WSL, go into GPTQ_loader.py and comment out make_quant_attn like so" [https://boards.4channel.org/g/thread/92835207#p92837143 from here]



[https://boards.4channel.org/g/thread/92842505#p92845181 Just grab the CUDA branch of qwop's GPTQ for LLaMA (or Triton if you want to be a dickhole) or if you have webui installed, go into the folder for GPTQ. Make sure all the requirements are installed and run this line:]

[https://boards.4channel.org/g/thread/92842505#p92845181 python llama.py /path/to30b c4 --wbits 4 --true-sequential --groupsize 128 --save_safetensors alpacino-4bit-128g.safetensors]

[https://boards.4channel.org/g/thread/92842505#p92845181 And it'll run. For windows, obviously flip the slashes the right way. And for linux, you may need to add CUDA_VISIBLE_DEVICES=0 to the front of the command.]