3 images generated · June 30, 2026 · on a GTX 1070

Krea 2 Turbo on an 8 GB Pascal GPU

Image generation with a 12-billion-parameter diffusion model on a graphics card from 2016. It works, but there are five undocumented traps that could cost you hours. This guide walks past all of them.

ModelKrea 2 Turbo Q2_K

GPUGTX 1070 · 8 GB · Pascal

Per frame~155 seconds

Resolution640 × 640

Get the setup script What it does

What it does for you

Generate images locally. No cloud.

Krea 2 Turbo generates 640×640 images from a text prompt on your 8 GB Pascal card — about 2.5 minutes per image at good quality. No cloud credits, no API keys, no upload of your prompts to anyone. The model runs entirely on the GPU you already own.

The catch is setup. A GTX 1070 and an RTX 3060 Ti both have 8 GB of VRAM, but Pascal has memory-allocation quirks the newer card hides from you. Get past them once with the script below and you have a local image generator that costs nothing per image. The benchmarks, the pipeline internals, and every trap are in the FAQ — you don’t need them to start.

Proof it works

Three frames generated with a “hand-drawn colored pencil” style prompt. 640×640, 20 refinement steps, Euler sampler, no LoRAs.

Thomas pulling Annie and Clarabel carriages through the Sodor countryside, hand-drawn colored pencil style — Thomas & Annie & Clarabel — seed 1001 · 970 KB · 154s

Thomas at Tidmouth Sheds with Gordon, James, and Percy — Tidmouth Sheds — seed 1002 · 978 KB · 155s

Thomas crossing a stone viaduct over a river — Viaduct crossing — seed 1003 · 946 KB · 155s

Setup

Three steps to your first image

The one-step script handles everything from the CUDA check through building stable-diffusion.cpp and a test generation. Budget 15–30 minutes, mostly download and compile time.

1. Get the setup script

Linux:

Download krea2_pascal_setup.sh

chmod +x krea2_pascal_setup.sh && ./krea2_pascal_setup.sh

Requires an NVIDIA driver and sudo for system packages.

Windows (PowerShell as Administrator):

Download krea2_pascal_setup.ps1

Set-ExecutionPolicy -Scope Process -ExecutionPolicy Bypass
.\krea2_pascal_setup.ps1

Installs VS Build Tools, CMake, and Miniconda automatically.

Building by hand? The one flag that matters on Pascal is the CUDA architecture target. Without it, the binary compiles fine and runs — on the CPU, at ~20 minutes per frame, with no error message:

cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES="61"
cmake --build build --config Release -j$(nproc)
# Binary at: build/bin/sd-cli

Full manual install (system deps, Miniconda) is in the script. The FAQ explains why this flag is invisible if you miss it.

2. Download the models

Three files, ~7.1 GB total. Pull them straight from HuggingFace into the paths below:

mkdir -p ~/models/krea2 ~/models/vae ~/models/text-encoders

# Krea 2 Turbo diffusion model — Q2_K GGUF (~4.55 GB)
curl -L -C - --progress-bar -o ~/models/krea2/krea2_turbo-Q2_K.gguf \
  https://huggingface.co/vantagewithai/Krea-2-Turbo-GGUF/resolve/main/krea2_turbo-Q2_K.gguf

# Qwen3-4B text encoder — Q4_K_M GGUF (~2.33 GB)
curl -L -C - --progress-bar -o ~/models/text-encoders/Qwen3-4B-Instruct-2507-Q4_K_M.gguf \
  https://huggingface.co/lmstudio-community/Qwen3-4B-Instruct-2507-GGUF/resolve/main/Qwen3-4B-Instruct-2507-Q4_K_M.gguf

# qwen_image VAE — safetensors (~242 MB)
curl -L -C - --progress-bar -o ~/models/vae/qwen_image_vae.safetensors \
  https://huggingface.co/Comfy-Org/Qwen-Image_ComfyUI/resolve/main/split_files/vae/qwen_image_vae.safetensors

The diffusion repo (vantagewithai/Krea-2-Turbo-GGUF) carries the full quant ladder — Q2_K through Q8_0 — if you want to trade VRAM for quality on a larger card. The text encoder stays on the CPU at run time (--backend "llm=cpu"), so only ~4.8 GB of the 7.1 GB ever touches VRAM. The setup script downloads all three for you.

3. Run your first image

fuser ships in the psmisc package, which isn’t installed on a minimal Ubuntu image. If fuser: command not found, run sudo apt install psmisc first (the setup script does this for you).

# Clear any lingering GPU processes first
fuser /dev/nvidia* 2>/dev/null | xargs -r kill 2>/dev/null
sleep 2

~/stable-diffusion.cpp/build/bin/sd-cli \
  --diffusion-model ~/models/krea2/krea2_turbo-Q2_K.gguf \
  --vae ~/models/vae/qwen_image_vae.safetensors \
  --llm ~/models/text-encoders/Qwen3-4B-Instruct-2507-Q4_K_M.gguf \
  --backend "llm=cpu" --vae-tiling --diffusion-fa \
  --cfg-scale 1.0 --steps 20 --sampling-method euler \
  -H 640 -W 640 --seed 42 \
  --output ~/test.png \
  -p "A cute robot painting a picture in a garden, \
Pixar style 3D render, Disney CG animation, \
cinematic lighting, subsurface scattering, \
detailed textured skin, volumetric rim light, \
cgi film still, highly detailed, sharp focus"

This takes roughly 2.5 minutes — about 145–155 s on a GTX 1070, depending on thermals and what else is on the card. A 640×640 PNG with no errors means you’re set:

file ~/test.png   # should say "PNG image data, 640 x 640"

Want to be sure the GPU did the work and not the CPU? Run watch -n 0.5 nvidia-smi in a second terminal during generation — you should see sd-cli holding ~4.9 GB of VRAM with GPU-Util above 0%. If VRAM stays near zero, the build is silently on the CPU (see the FAQ on the architecture flag).

Prompt engineering

Working prompts for Pascal

The full prompts and seeds for the three Thomas frames above. The scene description and style are separated by a comma — this matters because the text encoder treats them as distinct concepts.

The style suffix

hand drawn colored pencil illustration, soft textured paper grain,
visible pencil strokes and crosshatching, crayon texture,
childrens book illustration, warm earth tones, sketchbook quality,
soft shading, gentle color palette, storybook style, cozy nostalgic feel

Frame prompts

Frame	Seed	Scene prompt (before the style suffix)
1	1001	`Thomas the Tank Engine pulling blue Annie and Clarabel carriages along a railway track through the green countryside of Sodor, rolling hills and meadows, blue sky with fluffy clouds, distant farmhouse`
2	1002	`Thomas the Tank Engine parked at Tidmouth engine sheds roundhouse with other steam engines, Gordon the big blue engine, James the red engine, Percy the little green engine, railway yard setting`
3	1003	`Thomas the Tank Engine crossing a stone railway viaduct bridge over a sparkling river in Sodor, forested hills in background, steam puffing from funnel, sunny day`

Style tips: Longer prompts (~40 words for colored pencil vs ~27 for Pixar) don’t cost VRAM since the text encoder is on CPU. Append style with a , (comma+space) separator. Without the comma, the text encoder fuses the scene and style into a single run-on concept.

Performance

What you get on a 2016 GPU

The 20-step refinement phase is memory-bandwidth-bound. The GTX 1070’s 256 GB/s GDDR5 runs at ~6.9 seconds per step. An RTX 3060 Ti (448 GB/s GDDR6) runs at ~3–4 seconds — roughly 2× faster for the same quality.

Metric	GTX 1070	RTX 3060 Ti
Memory bandwidth	256 GB/s	448 GB/s
Refinement speed	~6.9 s/it	~3-4 s/it
Frame time (20 steps)	~155 s	~80 s
Diffusion throughput	~3.75 GB/s	~7+ GB/s

Tuning frame time: The --steps N flag controls only the refinement phase. Lower it to 10 for ~90 s frames (lower quality), or raise it to 30–40 for better quality at ~200–270 s. The 430 internal diffusion steps are fixed by the model and don’t change with this flag.

FAQ

Troubleshooting & the Pascal traps

Everything that cost hours of trial and error, collapsed. Open what you need.

Why does it take 155 seconds? What are the phases?

One frame runs through five phases. Knowing them helps you read the progress bars and spot where a failure happened:

Phase	Steps	Time	What it does
1. Model load	386 tensors	~1.3 s	Load Krea2 weights + VAE + LLM into VRAM/RAM
2. Text conditioning	—	~2.7 s	Qwen3-4B encodes your prompt (on CPU)
3. Internal diffusion	430 steps	~20 s	Fixed by model architecture — raw latent generation
4. Refinement	20 steps	~138 s	`--steps 20` controls this. ~6.9 s/it on Pascal
5. VAE decode	16 tiled steps	~14 s	Decodes latent to 640×640 PNG

Phase 4 dominates, and it’s the only one --steps changes. The 430 internal steps in phase 3 are fixed by the model.

cudaMalloc: out of memory during generation

Newer GPUs page memory between CPU and GPU under pressure via cudaMallocManaged. Pascal has no such fallback — the allocation just fails, opaque and final. Work through this list:

Kill all GPU processes: fuser /dev/nvidia* | xargs kill
Wait 3 seconds, then confirm nvidia-smi shows < 100 MiB used
Close browser tabs, games, or any GPU-accelerated apps
Verify you did not add --params-backend cpu or --offload-to-cpu (see below)
Running multiple frames in a script? Make sure each sd-cli invocation starts fresh — don’t reuse a long-running process

Build succeeds but runs on CPU (the CMAKE_CUDA_ARCHITECTURES flag)

cmake auto-detect skips Pascal (CC 6.1) because the default targets are CC 7.0+. Your binary compiles, links, and runs without errors — at CPU speeds, ~20 minutes per frame. There’s no warning. Rebuild with the architecture flag:

cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES="61"

Verify the build output shows sm_61 in the compiler flags, not sm_70 or sm_75. Also confirm your library path includes the CUDA libraries:

export PATH=~/miniconda/bin:$PATH
export LD_LIBRARY_PATH=~/miniconda/lib:~/miniconda/targets/x86_64-linux/lib:$LD_LIBRARY_PATH

The --params-backend cpu / --offload-to-cpu trap

You’d think offloading model weights to CPU RAM would help an 8 GB GPU. On Ampere it does. On Pascal it causes an instant OOM: the refinement phase tries to allocate a second 4.6 GB workspace in VRAM, and with weights already evicted, the workspace can’t reuse their memory region.

The fix: let weights load into VRAM (~4613 MB). The compute workspace reuses the same memory. Counterintuitive, but it’s the only way it works — do not pass --offload-to-cpu or --params-backend cpu.

Memory fragmentation — free VRAM but allocation still fails

On Pascal, even after nvidia-smi shows 8 GB free, the driver may not be able to allocate a contiguous 4.6 GB block. Fragmentation from zombie CUDA contexts is invisible and persistent. On Ampere the driver coalesces free pages and remaps virtual addresses transparently; Pascal does not.

Practical consequences on the GTX 1070:

Kill GPU processes between runs: fuser /dev/nvidia* | xargs kill
Any leftover process fragments memory
Even Xorg (50 MiB) can break the contiguous allocation

SSH session kills the batch after a few frames

A foreground SSH session drops after ~3 minutes of no stdout, and the sd-cli process inside it gets killed. Always detach with nohup or a background terminal:

ssh user@host 'nohup bash generate.sh > run.log 2>&1 & echo PID=$!'

Generation completes but no file appears

sd-cli prints “DONE” even when VAE decode fails after diffusion. Check the full log for lines containing [ERROR]. The most common cause is an OOM during the refinement-to-VAE transition — see the cudaMalloc item above.

sd.cpp vs the prism fork — which do I build?

Krea 2 Turbo needs upstream stable-diffusion.cpp, not the prism fork or llama.cpp. The prism fork does not support the Krea2 architecture. The sd-cli binary is separate from llama.cpp and the two don’t share model files.

What’s next

Going further

The setup script and sample prompts above are enough to start generating. For multi-frame storyboard sequences (10–100+ frames with narrative structure), check the full storyboard pipeline skill in the Hermes Agent skill library.

For more frames in the colored pencil style, vary the seeds and scene descriptions in the same prompt structure. The style suffix stays the same across all frames for consistency.

← Back to all guides