Bonsai Ternary LLM on the CPU — no GPU at all
A full 8-billion-parameter language model that runs entirely on your processor. No GPU, no CUDA, no VRAM — just a couple of gigabytes of ordinary RAM. Prism ML’s ternary (1.58-bit) quantization shrinks the weights to 2 GB, and the CPU does the rest. Prefill is genuinely fast; generation is the honest tradeoff. This guide is about running it well on hardware you already own.
An 8B model with zero GPU
The point of this guide isn’t speed — it’s independence from the GPU. Bonsai 8B loads into about 2.5 GB of RAM and runs on the CPU alone: no NVIDIA driver, no CUDA install, no VRAM budget, no -ngl. If a machine can run llama.cpp, it can run this — a headless server, a laptop, a NAS, a cheap VPS, a workstation whose GPU is busy with something else.
The trick is ternary quantization: every weight is stored as just −1, 0, or +1 (1.58 bits). That collapses the model to an eighth of its FP16 size and turns the heavy matrix math into add-only operations — no per-weight multiply. On a CPU that pays off most during prefill (reading your prompt), where a Ryzen 3900X hits ~440 tok/s, competitive with a modern laptop GPU.
The honest catch is generation. Producing new tokens one at a time is memory-bandwidth-bound, not compute-bound, so a desktop CPU lands around 3 tok/s for the 8B. That’s fine for batch jobs, background summarization, scripted pipelines, and drafting where you don’t sit and watch — and too slow for snappy interactive chat. If you need real-time typing speed, a GPU is still the answer. If you need an 8B model to exist at all on a GPU-less box, this is how. The FAQ explains why prefill flies and generation crawls.
Three steps, no GPU anywhere
One-step script (Linux): installs deps, builds a clean CPU-only Prism fork (-DGGML_CUDA=OFF, so the binary has zero CUDA dependencies), downloads the model, and runs a test.
curl -sL https://ndgold.com/guides/bonsai-llm/setup.sh | bash
Or download manually: setup.sh — defaults to the Q2_0 ternary build; pass --quant q4_0 for the fork-free Q4_0-lossless model. Prefer to do it by hand? The three steps below are exactly what the script automates.
1. Build the Prism llama.cpp fork (CPU-only)
The native ternary kernel (GGUF Q2_0 with g128 grouping) is not in mainstream llama.cpp yet — you need the Prism fork to load Q2_0. Build it with -DGGML_CUDA=OFF so the binary is pure CPU with no CUDA runtime to chase down later.
git clone --depth 1 -b prism https://github.com/PrismML-Eng/llama.cpp.git
cd llama.cpp
cmake -B build-cpu -DGGML_CUDA=OFF -DGGML_NATIVE=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build-cpu -j"$(nproc)" --target llama-cli llama-bench llama-server
-DGGML_NATIVE=ON lets the compiler target your CPU’s exact instruction set (AVX2 on the Ryzen), which the ternary kernel leans on. Confirm the build has no GPU baggage: ldd build-cpu/bin/llama-cli | grep -c "not found" should print 0. -j"$(nproc)" uses every core — drop to -j4 if you want to keep the machine responsive during the build.
Already have a CUDA-linked llama.cpp? You can still run CPU-only with -ngl 0, but a CUDA build references libcudart.so at startup and will refuse to launch with error while loading shared libraries: libcudart.so.12 unless CUDA is on your library path. Either rebuild with -DGGML_CUDA=OFF as above, or prefix every run with LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH. The clean CPU build sidesteps the whole problem.
2. Download a model
Recommended — Q2_0 ternary (2.03 GB, native 1.58-bit — needs the Prism fork):
mkdir -p ~/models/bonsai
curl -L --progress-bar -o ~/models/bonsai/Ternary-Bonsai-8B-Q2_0.gguf \
"https://huggingface.co/prism-ml/Ternary-Bonsai-8B-gguf/resolve/main/Ternary-Bonsai-8B-Q2_0.gguf?download=1"
The real ternary format: smallest footprint (2 GB), best prefill throughput on CPU. This is what the benchmarks below were measured on. --progress-bar shows progress; add -C - to resume if the download is interrupted. Prefer the smaller 4B? Swap 8B for 4B in both the URL and filename (1.0 GB) — but see the FAQ on why 8B is usually the better pick.
Or Q4_0-lossless (4.3 GB, no fork needed — runs on any llama.cpp):
mkdir -p ~/models/bonsai
curl -L --progress-bar -o ~/models/bonsai/Bonsai-8B-Q4_0-lossless.gguf \
https://huggingface.co/Minarut/Ternary-Bonsai-8B-GGUF-llamacpp-compatible/resolve/main/Ternary-Bonsai-8B-Q4_0-lossless.gguf
A lossless re-encoding of the same ternary weights in the standard 4-bit format, so it loads on any llama.cpp build — upstream included — with no Prism fork. Larger on disk and in RAM (~4.3 GB), same generation ceiling on CPU. Reach for it if you’d rather skip the fork entirely.
3. Run it (CPU-only)
# One-shot prompt, 12 threads, no GPU layers
./build-cpu/bin/llama-cli \
-m ~/models/bonsai/Ternary-Bonsai-8B-Q2_0.gguf \
-ngl 0 -t 12 -fa 1 \
-p "Explain ternary (1.58-bit) neural networks in 3 sentences." \
-n 150 -e
# Interactive chat (multiline, history, /regen, /clear)
./build-cpu/bin/llama-cli \
-m ~/models/bonsai/Ternary-Bonsai-8B-Q2_0.gguf \
-ngl 0 -t 12 -fa 1 -c 4096
-ngl 0 is the whole point. It keeps every layer on the CPU — the opposite of a GPU guide. -t 12 pins the thread count to your physical cores (the Ryzen 3900X has 12; more than that hits SMT contention and slows down). -fa 1 turns on flash attention so context memory grows O(1) instead of O(n). -e tells the CLI to interpret backslash escapes in the prompt. Not sure which quant you grabbed? See the FAQ.
-e alone doesn’t exit cleanly. On Prism fork c85e97a, a one-shot -p prompt posts its answer but then drops into the interactive > REPL rather than returning to the shell. For scripting, append --no-display-prompt and pipe through head -n 1, or send Ctrl-D (or type /exit) to leave the REPL. The setup script’s test run already uses --no-display-prompt for exactly this reason.
No nvidia-smi to check here — watch htop instead. During prefill you’ll see all 12 threads light up; during generation the load drops to roughly one busy core, which is exactly why new tokens come out slowly. Set your thread count to your physical core count, not the SMT/thread count.
Use it day-to-day: an OpenAI-compatible API
Because generation is slow, the most practical setup is to run Bonsai as a persistent local server and talk to it from your tools — fire off a request, let it work, collect the result. llama-server exposes a full OpenAI-compatible HTTP API and ships with a built-in web chat UI.
Start the server
./build-cpu/bin/llama-server \
-m ~/models/bonsai/Ternary-Bonsai-8B-Q2_0.gguf \
-ngl 0 -t 12 -fa 1 -c 4096 \
--host 0.0.0.0 --port 8080
# → "server listening on http://0.0.0.0:8080"
Open http://localhost:8080/ for the built-in web UI — chat history, parameter tuning, and Markdown rendering, with the chat template auto-detected from the GGUF metadata. --host 0.0.0.0 makes it reachable from other machines (see remote access below); drop it to keep the server local-only.
Call the API
Any tool that speaks the OpenAI API — LangChain, Continue.dev, your own scripts — can point at it. Standard chat completions:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Ternary-Bonsai-8B",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What makes ternary neural networks special?"}
],
"temperature": 0.7,
"max_tokens": 150
}'
Add "stream": true to get token-by-token server-sent events instead of waiting for the whole response — useful when generation is slow and you want output as it arrives.
# Python — point any OpenAI client at the local server
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
resp = client.chat.completions.create(
model="Ternary-Bonsai-8B",
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=100,
)
print(resp.choices[0].message.content)
Keep it running (systemd)
To keep the server up across reboots and SSH disconnects, run it as a service:
sudo tee /etc/systemd/system/bonsai.service << 'EOF'
[Unit]
Description=Bonsai ternary LLM (llama-server, CPU-only)
After=network.target
[Service]
Type=simple
User=%i
WorkingDirectory=%h/llama.cpp/build-cpu
ExecStart=%h/llama.cpp/build-cpu/bin/llama-server \
-m %h/models/bonsai/Ternary-Bonsai-8B-Q2_0.gguf \
-ngl 0 -t 12 -fa 1 -c 4096 --host 0.0.0.0 --port 8080
Restart=on-failure
RestartSec=10
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable --now bonsai.service
Replace %i/%h with a concrete user and home path if your systemd doesn’t expand them, or install it as a user service. Because the build is CPU-only, there’s no LD_LIBRARY_PATH to set — the unit is as simple as it looks.
Reach it remotely
Started with --host 0.0.0.0, the server is reachable from any machine that can route to it. Two clean options:
- Tailscale — if both machines are on the same Tailnet, run
tailscale ip -4on the server and openhttp://100.x.x.x:8080from anywhere. Zero config, no port forwarding, encrypted mesh. - SSH tunnel — no Tailscale? Forward the port from your laptop with
ssh -L 8080:localhost:8080 user@server, then usehttp://localhost:8080locally. In that case you can leave off--host 0.0.0.0and keep the server bound to localhost.
The honest CPU numbers
Measured with llama-bench on a Ryzen 9 3900X (12 cores / 24 threads, DDR4), -ngl 0 -fa 1, no GPU. PP512 = 512-token prompt processing (prefill); TG128 = 128-token generation.
| Model | Quant | Size | Prefill (PP512) | Generation (TG128) |
|---|---|---|---|---|
| Bonsai 8B | Q2_0 (ternary) | 2.03 GB | 439.5 tok/s | 3.2 tok/s |
| Bonsai 4B | Q2_0 (ternary) | 1.0 GB | 704.6 tok/s | 4.9 tok/s |
Two things jump out. Prefill is fast — ~440 tok/s means the model reads a long prompt in well under a second. Generation is slow and memory-bandwidth-bound: because producing tokens means streaming the whole weight matrix from RAM for each one, the smaller 4B (4.85 ± 0.04 tok/s) generates about 50% faster than the 8B (3.22 ± 0.23 tok/s) — it reads half the bytes per token. The 8B is still usually the model to run: you get double the parameters at only a ~35% generation-speed penalty, not for free but cheaply. Reach for the 4B when generation throughput or RAM matters more than quality.
How prefill scales with threads (8B, PP512)
| Threads | Prefill tok/s | Notes |
|---|---|---|
| 1 | 356.6 | Single-core baseline |
| 2 | 394.4 | Marginal gain |
| 4 | 424.3 | Good scaling |
| 8 | 439.6 | Sweet spot |
| 12 | 442.5 | Peak — one thread per physical core |
| 16 | 442.1 | No further gain |
| 24 | 422.8 | Regression — SMT contention |
Prefill scales modestly — a single core already does 80% of peak because each AVX2 core is doing serious work. The sweet spot is 8–12 threads, one per physical core; pushing to all 24 SMT threads actually regresses. Generation, meanwhile, is essentially single-core-bound no matter how many threads you give it.
CPU vs Apple Silicon vs GPU
Where does a desktop x86 CPU land? Same 8B Q2_0 model, same PP512/TG128 test:
| Hardware | Backend | Prefill | Generation |
|---|---|---|---|
| Ryzen 3900X (this guide) | CPU · AVX2, 12t | 440 tok/s | 3.2 tok/s |
| Apple M4 Pro 48 GB | CPU · NEON, 10t | 146 tok/s | 32 tok/s |
| Apple M4 Pro 48 GB | Metal (GPU) | 455 tok/s | 76 tok/s |
The Ryzen’s prefill (440 tok/s) is neck-and-neck with the M4’s GPU (455 tok/s) and 3× the M4’s CPU — the add-only ternary matmul is well-optimized for x86 AVX2. But generation is ~10× slower than the M4 CPU (3.2 vs 32 tok/s): Apple Silicon’s huge unified-memory bandwidth wins the autoregressive loop, which the x86 Q2_0 path hasn’t caught up on. The takeaway for CPU users: budget your prompts freely, but treat generation as a background job, not an interactive one.
Ternary, threads, and common issues
What is ternary quantization? How does 1.58-bit work?
Most LLM quantization goes 16-bit → 8-bit → 4-bit. Bonsai goes all the way to 1.58-bit: each weight is stored as one of three values — −1, 0, or +1. That’s “ternary” (three states = log₂3 ≈ 1.58 bits).
The payoff is that matrix multiplications become addition-only — multiplying an activation by −1, 0, or +1 is just a sign flip, a skip, or a copy, so no real multiply is needed. On a CPU that’s a big win for prefill, where you multiply the weight matrix against many token vectors at once (compute-bound work the add-only path speeds up). It helps less for generation, which is dominated by streaming the whole weight matrix from RAM for each single token — a bandwidth problem the quantization can’t fix. The other tradeoff is precision: an 8B ternary model still beats a much smaller full-precision model on most benchmarks, but a 4-bit 8B will edge it on hard reasoning.
Why is prefill so fast but generation so slow?
They stress different resources. Prefill (reading your prompt) processes many tokens in parallel, so it’s compute-bound — and the add-only ternary matmul is exactly what AVX2 does well, so all 12 cores stay busy at ~440 tok/s. Generation produces one token at a time, and each token requires reading the entire 2 GB of weights from RAM. That makes it memory-bandwidth-bound: adding cores doesn’t help because they’re all waiting on the same memory bus. Hence ~3 tok/s regardless of thread count — and why the 4B, with only half the weights to stream per token, generates about 50% faster (4.9 vs 3.2 tok/s) rather than matching the 8B. It’s a property of CPU generation in general, sharpened by the on-the-fly 2-bit unpacking.
Is 3 tok/s actually usable?
Depends on the job. For interactive chat where you watch it type, no — a 200-token answer takes about a minute, and a 500-token one around 2.5 minutes. For batch and background work, absolutely: overnight summarization, scripted extraction/classification pipelines, drafting where you queue a request and come back, or any tool that calls the API and does something else while it waits. Prefill being fast means large prompts (long documents, big system prompts) barely add latency — the cost is almost entirely in how many tokens you ask it to write. Keep -n tight and it stays practical.
Q2_0 vs Q4_0-lossless: which should I use on CPU?
🌲 Q2_0 — native ternary
2.03 GB · 440 tok/s prefill
- Smallest footprint — ~2.5 GB RAM total
- Best prefill throughput on CPU
- The real 1.58-bit format the benchmarks use
- Requires the Prism fork to load
📦 Q4_0-lossless — fork-free
4.29 GB · same gen ceiling
- Lossless re-encoding of ternary weights in 4-bit
- Loads on any llama.cpp build — no fork
- Larger on disk and in RAM (~4.3 GB)
- Generation speed is the same on CPU (bandwidth-bound)
Short version: if you’re building the Prism fork anyway (this guide does), use Q2_0 — it’s smaller, prefills faster, and is the real thing. Choose Q4_0-lossless only if you want to avoid the fork and run on a stock llama.cpp you already have.
8B or 4B on CPU?
Usually the 8B. On CPU, generation speed is set by memory bandwidth, so the 4B — with half the weights to stream per token — does generate faster (4.9 vs 3.2 tok/s, about 50%) and prefills faster (705 vs 440 tok/s). But the 8B is the stronger model, and you get double the parameters for only a ~35% generation-speed penalty, so it’s the better default for general use. Pick the 4B when you’re on a RAM-constrained box where 1 GB vs 2 GB matters, or when generation throughput matters more than quality — e.g. prompt-heavy, generation-light pipelines.
How many threads should I use?
Set -t to your physical core count — 12 on the Ryzen 3900X. Prefill peaks at 8–12 threads; going up to all 24 SMT threads actually slows it down from SMT contention. Generation is single-core-bound and won’t speed up with more threads at all, so don’t bother oversubscribing. If you’re on a different CPU, match -t to physical cores and leave one free if the machine is doing other work.
llama-bench hangs / produces no output on a CPU-only build
On some builds (Prism fork c85e97a), a CPU-only llama-bench loads the model but then hangs without printing any results. If you hit this, benchmark with a CUDA-linked build run with -ngl 0 instead, or just skip benchmarking and use the published numbers above. It doesn’t affect inference — llama-cli and llama-server work fine on the clean CPU build; only the benchmarking tool is affected.
error while loading shared libraries: libcudart.so.12
Your llama.cpp was built with CUDA linkage, so the binary references the CUDA runtime at startup even though you’re running -ngl 0. Two fixes: rebuild cleanly with -DGGML_CUDA=OFF (recommended — the binary then has zero CUDA deps), or point the loader at your CUDA runtime with LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH (or the lib dir inside your conda env) before the command. The setup script builds CPU-only, so you won’t hit this if you use it.
unknown model architecture / failed to load model on Q2_0
You’re loading a Q2_0 ternary model with upstream llama.cpp instead of the Prism fork. The native ternary kernel (Q2_0 with g128 grouping) only exists in PrismML-Eng/llama.cpp on the prism branch. Clone and build that (step 1), or switch to the Q4_0-lossless quant, which any build can load.
Do I need a GPU or CUDA installed at all?
No. That’s the entire point of this guide. Build with -DGGML_CUDA=OFF and run with -ngl 0 and nothing touches the GPU — no driver, no CUDA toolkit, no VRAM. It runs the same on a GPU-less server or VPS as it does on a workstation. If a GPU is present, it’s simply ignored.
Going further
The setup above is everything you need to run an 8B model on any CPU. If you do have a Pascal GPU and want interactive speed, the other guides cover the GPU path — Gemma-4 vision for multimodal, or Krea 2 Turbo for image generation — to round out a local stack.