AI: Difference between revisions

From Hegemon Wiki
Jump to navigation Jump to search
Line 2: Line 2:


== New Shit ==
== New Shit ==
[https://huggingface.co/wushuang98/Direct3D-S2 Direct3D‑S2: Gigascale 3D Generation Made Easy with Spatial Sparse Attention]

[https://www.youtube.com/watch?v=PxcOIINgiaA Make RAG 100x Better with Real-Time Knowledge Graphs]
[https://www.youtube.com/watch?v=PxcOIINgiaA Make RAG 100x Better with Real-Time Knowledge Graphs]



Revision as of 20:48, 6 June 2025

See also: AI Art

New Shit

Direct3D‑S2: Gigascale 3D Generation Made Easy with Spatial Sparse Attention

Make RAG 100x Better with Real-Time Knowledge Graphs

dots1

Are there any free working voice cloning AIs?

privategpt - privategpt imho Is the best for rag if you need the source, It not only lists the PDF used for the answer but also the page, and Is quite precise. So for studyb and search in a library Is the best i know

FLUX Model Quantization Challenge

dont_offload_gguf_layers_offload_tensors_200_gen

I Built the Ultimate RAG MCP Server for AI Coding (Better than Context7)

NEW FramePack F1 Model - Much Better Results - Bonus How to Install Sage

Qwen3-30B-A6B-16-Extreme

https://docs.google.com/document/d/12ATcyjCEKh8T-MPDZ-VMiQ1XMa9FUvvk2QazrsKoiR8/edit?tab=t.0

https://www.youtube.com/watch?v=LMH62T_XCF4

Dia, Fish Speech

https://www.reddit.com/r/LocalLLaMA/comments/1ev8n2s/exclude_top_choices_xtc_a_sampler_that_boosts/

https://blog.runpod.io/upscaling-videos-using/

koboldcpp wiki

Mixtral for Retards


koboldcpp.exe 13B-HyperMantis.ggmlv3.q4_K_M.bin --debug --usecublas --usemlock --contextsize 8192 --blasbatchsize 512 --psutil_set_threads --threads 6 --blasthreads 10 --gpulayers 5 --highpriority --stream --usemlock --unbantokens --smartcontext

If you're running 1.35 and a superHOT model, you should also add --linearrope which should make them perform better.

RP recommened models: "superhot, airoboros, wizard-vicuna, guanaco, chronos are a few commonly discussed models off the top of my head. For me, it's superhot or guanaco (one or the other, not the merge though)"

Tasksel

Not sure if this is common knowledge, but some advice to all fellow VRAMlets who are offloading to RAM. Setting the number of threads is not good enough, you can get extra speed by manually setting core affinity.

For context:

I have a 13600K which has 6 P-cores. I had read that you should set --threads to that number, so I would run koboldcpp with --threads 6 and from some testing this was indeed the best option with that argument alone.

BUT, I looked at which cores were actually used and found e-cores also being used sometimes.

So the next step was to set the core affinity to just P-cores. Each P-core has two threads and CPU0-11 was P-cores, CPU12-19 was E-cores. Thus, I ran koboldcpp with one thread from each core:

taskset -c 0,2,4,6,8,10 python kobodcpp.py [args]

My speed running command-r went from ~2.3 T/s to 2.67 T/s Pretty good. But, what if I use them fully, I thought. So I set --threads 12 and taskset -c 0,1,2,3,4,5,6,7,8,9,10,11 And I get a generation with 3.09 T/s That's a whooping 33% increase from my initial. Hope this is helpful, it actually had my basedfacing captcha: pic rel

MCP Servers

Context7

awesome-mcp-clients/

https://smithery.ai/

https://mcp.so/servers

https://www.reddit.com/r/RooCode/comments/1ijgk2x/roo_code_mcps_best_mcp_configs/mbej58g/

Agentic AI Dungeon

n8n - Web tool for AI agents

chroma - Vector database

pgvector

lmstudio

LLama

/lmg/

llama.cpp pull requests

cuBLAS: use host pinned memory and dequantize while copying

Add NVIDIA cuBLAS support (Merged now)

TextGeneration-webui pull requests

I think I made the Triton-GPTQ model work with the GPTQ-for-LLaMa (Triton branch) repository. (slower than non-triton)

Tried more Vicuna + superCOT lora and I am fairly impressed at how well it understands me for a 13B model in terms of followup instructions.

Models

VicUnlocked-30B

q5 models (reddit)

q5 models

https://huggingface.co/Melbourne/Alpacino-30b-ggml/tree/main

Misc

GPT4-x-Alpaca-30B is particularly good at writing stories

lammacpp server?

"There was a tokenizer caching error, some people said. Redownload the hf_output files from the repo or just change the use_cache line in the config.json to say: "use_cache": true," for the Vicuna13B-free https://github.com/stochasticai/xturing/tree/main/examples/int4_finetuning

https://wiki.installgentoo.com/wiki/Home_server#Expanding_Your_Storage

https://rentry.org/llama-tard-v2

https://rentry.org/llamaaids

https://hackmd.io/@reneil1337/alpaca


https://find.4chan.org/?q=AI+Dynamic+Storytelling+General

https://find.4chan.org/?q=AI+Chatbot+General

https://find.4chan.org/?q=%2Flmg%2F (local models general)

https://boards.4channel.org/g/thread/92400764#p92400764

https://rentry.org/llamaaids


https://files.catbox.moe/lvefgy.json

https://pytorch.org/hub/nvidia_deeplearningexamples_tacotron2/


python server.py --model llama-7b-4bit --wbits 4

python server.py --model llama-13b-4bit-128g --wbits 4 --groupsize 128

https://github.com/qwopqwop200/GPTQ-for-LLaMa/issues/59 for installing with out of space error

https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model#4-bit-mode


https://github.com/pybind/pybind11/discussions/4566

https://lmsysvicuna.miraheze.org/wiki/How_to_use_Vicuna#Use_with_llama.cpp%3A

https://huggingface.co/anon8231489123/vicuna-13b-GPTQ-4bit-128g


Here's the uncucked Vicuna model (trained on the dataset that don't have the moralistic bullshit anymore) Too bad it's just the CPU quantized version

Vicuna generating its own prompts


should be worse than Q4_1 (which is QK=32) but there are several PRs in the work that should improve quantization accuracy in general

https://github.com/ggerganov/llama.cpp/pull/729

https://github.com/ggerganov/llama.cpp/pull/835

https://github.com/ggerganov/llama.cpp/pull/896

benchmark_threads.txt

https://huggingface.co/TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g - python3 llama.py vicuna-AlekseyKorshuk-7B c4 --wbits 4 --true-sequential --act-order --groupsize 128 --save_safetensors vicuna-AlekseyKorshuk-7B-GPTQ-4bit-128g.safetensors


≈65% speedup of the AVX-512 implementation of ggml_vec_dot_q4_0() #933

"Speaking of which, for any 30b anons struggling with context size, I figured something out. If you use the Triton branch on WSL, go into GPTQ_loader.py and comment out make_quant_attn like so" from here


Just grab the CUDA branch of qwop's GPTQ for LLaMA (or Triton if you want to be a dickhole) or if you have webui installed, go into the folder for GPTQ. Make sure all the requirements are installed and run this line:

python llama.py /path/to30b c4 --wbits 4 --true-sequential --groupsize 128 --save_safetensors alpacino-4bit-128g.safetensors

And it'll run. For windows, obviously flip the slashes the right way. And for linux, you may need to add CUDA_VISIBLE_DEVICES=0 to the front of the command.

GGML Quantization

Some tables on Reddit

Relative quantization

Papers

Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision

http://attentionviz.com/

Benchmarks

TODO: Try different cublas batch sizes

Interface Model GPTQ Xformers? HW Load Speed
text-gen anon8231489123-vicuna-13b-GPTQ-4bit-128g GPTQ-for-LLaMa-triton yes 240gb SSD, 16gb,desktop off 10.53 7.97 tokens/s
text-gen anon8231489123-vicuna-13b-GPTQ-4bit-128g GPTQ-for-LLaMa-triton No xformers 240gb SSD, 16gb,desktop off 10.22s 7.55 tokens/s
text-gen anon8231489123-vicuna-13b-GPTQ-4bit-128g GPTQ-for-LLaMa-cuda No xformers 240gb SSD, 16gb,desktop off 16.68s 4.03 tokens/s
text-gen anon8231489123-vicuna-13b-GPTQ-4bit-128g GPTQ-for-LLaMa-cuda yes 240gb SSD, 16gb,desktop off 9.34s 4.01 tokens/s
text-gen llama-30b-sft-oa-alpaca-epoch-2-4bit-ggml no no 2TB SSD, 64gb ? 0.67 tokens/s
text-gen llama-30b-sft-oa-alpaca-epoch-2-4bit-ggml no no 2TB SSD, 64gb, --threads 8 maybe 30s? 0.51 tokens/s
text-gen llama-30b-sft-oa-alpaca-epoch-2-4bit-ggml no no 2TB SSD, 64gb, --threads 7 0.68 tokens/s
text-gen llama-30b-sft-oa-alpaca-epoch-2-4bit-ggml no no 2TB SSD, 64gb, --threads 6 0.61 tokens/s
text-gen anon8231489123-vicuna-13b-GPTQ-4bit-128g-ggml no no 2TB SSD, 64gb 1.17 tokens/s
text-gen anon8231489123-vicuna-13b-GPTQ-4bit-128g GPTQ-for-LLaMa-triton yes 2TB SSD, 64gb, --pre_layer 25 45.69 0.25 tokens/s
text-gen anon8231489123-vicuna-13b-GPTQ-4bit-128g GPTQ-for-LLaMa-triton yes 2TB SSD, 64gb 36.47 9.63 tokens/s
llama.cpp llama-30b-sft-oa-alpaca-epoch-2-4bit-ggml 2TB SSD, 64gb 10317.90 ms 1096.21 ms per token
llama.cpp-modern-avx512 llama-30b-sft-oa-alpaca-epoch-2-4bit-ggml 2TB SSD, 64gb 9288.69 ms 1049.03 ms per token
llama.cpp-avx512-pr833 llama-30b-sft-oa-alpaca-epoch-2-4bit-ggml 2TB SSD, 64gb 13864.06 ms 0.89 tokens/s, 820.68 ms per token
text-gen TheBloke-gpt4-alpaca-lora-30B-4bit-GGML/ggml-model-q4_0 2TB SSD, 64gb 0.78 tokens/s
text-gen+avx512-pr833 TheBloke-gpt4-alpaca-lora-30B-4bit-GGML/ggml-model-q4_0 2TB SSD, 64gb 1.04 tokens/s
2023-04-24 text-gen anon8231489123-vicuna-13b-GPTQ-4bit-128g GPTQ-for-LLaMa-triton yes 2TB SSD, 64gb, also running llama.cpp with another model 16.36 5.07 tokens/s
2023-04-26 koboldcpp gozfarb-llama-30b-supercot-ggml/ggml-model-q4_0.bin clblast n/a 2TB SSD, 64gb, --threads 8 1073ms/T
2023-04-29 koboldcpp Alpacino-30b-q4_0.bin clblast n/a 2TB SSD, 64gb 700ms/T
2023-07-13 koboldcpp llama-33b-supercot-ggml-q5_1 (complains about old format) cublas n/a 2TB SSD, 64gb, --nommap --smartcontext --usecublas --gpulayers 18 643ms/T 1.4T/s
2023-07-13 koboldcpp llama-33b-supercot-ggml-q5_1 (complains about old format) clblast n/a 2TB SSD, 64gb, --nommap --smartcontext --useclblast 0 0 --gpulayers 18 685ms/T 1.2T/s
2023-07-13 koboldcpp airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin cublas n/a 2TB SSD, 64gb, --nommap --smartcontext --usecublas --gpulayers 18 (probably space for more) 652ms/T 1.5T/s
2023-07-13 koboldcpp airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin cublas n/a 2TB SSD, 64gb, --nommap --smartcontext --usecublas --gpulayers 26 (I note 3 threads are set by default) 593ms/T 1.6T/s
2023-07-13 koboldcpp airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin cublas n/a 2TB SSD, 64gb, --nommap --smartcontext --usecublas --gpulayers 26 --psutil _set_threads (4 threads) 514ms/T 1.8T/s
2023-07-13 koboldcpp airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin cublas n/a 2TB SSD, 64gb, --smartcontext --usecublas --gpulayers 26 --psutil _set_threads (removed nommap) 508ms/T 1.9T/s
2023-07-13 koboldcpp airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin cublas n/a 2TB SSD, 64gb, --smartcontext --usecublas --gpulayers 26 --threads 5 454ms/T 2.1T/s
2023-07-13 koboldcpp airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin cublas n/a 2TB SSD, 64gb, --smartcontext --usecublas --gpulayers 26 --threads 6 422ms/T 2.2T/s
2023-07-13 koboldcpp airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin cublas n/a 2TB SSD, 64gb, --smartcontext --usecublas --gpulayers 26 --threads 7 509ms/T 1.8T/s
2023-07-13 koboldcpp airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin cublas n/a 2TB SSD, 64gb, --smartcontext --usecublas --gpulayers 26 --threads 8 494ms/T 1.7T/s
2023-07-13 koboldcpp airoboros-33b-gpt4-1.2.ggmlv3.q4_K_M.bin cublas n/a 2TB SSD, 64gb, --smartcontext --usecublas --gpulayers 26 --threads 6 --linearrope (no difference, needs supercot?) 425ms/T 2.2T/s
2023-07-13 koboldcpp airoboros-33b-gpt4-1.4.ggmlv3.q4_K_M.bin cublas n/a 2TB SSD, 64gb, --smartcontext --usecublas --gpulayers 26 --threads 6 400ms/T 2.3T/s
2023-07-13 koboldcpp airoboros-65b-gpt4-1.4.ggmlv3.q4_K_M.bin cublas n/a 2TB SSD, 64gb, --nommap --smartcontext --usecublas --gpulayers 13 --threads 6 1366ms/T 0.7T/s
2023-07-14 koboldcpp airoboros-65b-gpt4-1.4.ggmlv3.q2_K.bin cublas n/a 2TB SSD, 64gb, --nommap --smartcontext --usecublas --gpulayers 13 --threads 6 765ms/T - 1.2T/s
2023-09-06 koboldcpp guanaco-33B.ggmlv3.q4_K_M.bin cublas n/a 2TB SSD, 64gb, --stream --smartcontext --usecublas --gpulayers 29 --threads 6 562ms/T - 1.3T/s
2023-09-06 koboldcpp guanaco-33B.ggmlv3.q4_K_M.bin cublas n/a 2TB SSD, 64gb, --nommap --stream --smartcontext --usecublas --gpulayers 29 --threads 6 567ms/T), Total:70.7s (1.4T/s
2023-09-06 koboldcpp guanaco-33B.ggmlv3.q4_K_M.bin cublas n/a 2TB SSD, 64gb, --nommap --stream --smartcontext --usecublas --gpulayers 25 --threads 6 563ms/T), Total:70.2s (1.4T/s
2023-12-03 koboldcpp guanaco-33B.q4_K_M.gguf cublas n/a 2TB SSD, 64gb, --nommap --smartcontext --usecublas --threads 6 --gpulayers 27 330.7ms/T), Total:40.79s 2.94T/s
2023-12-07 koboldcpp guanaco-33B.q4_K_M.gguf cublas n/a 7950x3d, 2TB SSD, 64gb, --nommap --smartcontext --usecublas --threads 6 --gpulayers 27 202.1ms/T, 4.78T/s
2023-12-07 koboldcpp guanaco-33B.q4_K_M.gguf cublas n/a 7950x3d, 2TB SSD, 64gb, --nommap --smartcontext --usecublas --threads 32 --gpulayers 27 360.8ms/T, 2.68T/s
2023-12-07 koboldcpp guanaco-33B.q4_K_M.gguf cublas n/a 7950x3d, 2TB SSD, 64gb, --nommap --smartcontext --usecublas --threads 16 --gpulayers 27 202.6ms/T, 4.82T/s
2023-12-07 koboldcpp guanaco-33B.q4_K_M.gguf cublas n/a 7950x3d, 2TB SSD, 64gb, --nommap --smartcontext --usecublas --threads 15 --gpulayers 27 195.0ms/T, 5.03T/s
2023-12-16 koboldcpp mistral-7b-instruct-v0.2.Q8_0.gguf cublas n/a 7950x3d, 2TB SSD, 64gb, --nommap --smartcontext --usecublas --threads 15 --gpulayers 33 22.9ms/T, 42.90T/s
2023-12-17 koboldcpp mixtral-8x7b-moe-rp-story.Q8_0.gguf cublas n/a 7950x3d, 2TB SSD, 64gb, --nommap --smartcontext --usecublas --threads 15 --gpulayers 6 214.9ms/T, 4.47T/s
2024-02-04 SillyTavern miqu 70b gpu layers 9 1.4T/s