whichllm: Hardware-Aware LLM Rankings in One Command
Forget parameter count. This open-source CLI auto-detects your GPU, pulls live HuggingFace data, and ranks local models by real benchmark scores weighted for your exact rig.
Picking a local LLM used to mean consulting a spreadsheet, squinting at VRAM numbers, and hoping the model you grabbed wasn't already a generation stale. whichllm short-circuits that process: give it your hardware (or let it auto-detect), and it returns a ranked list of the best models you can actually run — scored on real evals, not raw parameter count.
The project hit 3.3k GitHub stars and has no mandatory setup friction — the entire workflow starts with a single uvx invocation.
uvx whichllm@latest
What's Wrong with "What Fits in My VRAM?"
The tool's README makes the core argument with a concrete example. On an RTX 4090, a naive "biggest model that fits" heuristic would hand you the 32B Qwen3 variant. whichllm ranks the 27.8B Qwen3.6 first:
#1 Qwen/Qwen3.6-27B 27.8B Q5_K_M score 92.8 27 t/s
#2 Qwen/Qwen3-32B 32.0B Q4_K_M score 83.0 31 t/s
#3 Qwen/Qwen3-30B-A3B 30.0B Q5_K_M score 82.7 102 t/s
The 27B model is a newer generation and outscores the 32B on benchmarks — size alone doesn't tell you that. The #3 slot is a MoE model running at 102 t/s because whichllm scores speed on active parameters while scoring quality on total parameters, which is the correct split for mixture-of-experts architectures.
Benchmark scores are drawn from a merged pool: LiveBench, Artificial Analysis, Aider, multimodal/vision evaluations, Chatbot Arena ELO, and the Open LLM Leaderboard. Every score is tagged with a confidence grade — direct, variant, base, interpolated, or self-reported — and discounted accordingly. The tool actively rejects fabricated uploader claims and cross-family score inheritance (a small fine-tune borrowing its base model's numbers). Stale leaderboard entries are demoted along each model's lineage so an old 2024 score can't outrank a current-generation one.
VRAM and Speed Modeling
The VRAM calculation isn't a lookup table. It sums weights, GQA KV-cache, activations, and overhead. Speed estimation is bandwidth-bound and accounts for per-quant efficiency, per-backend factors, MoE active/total splits, and whether you're on unified memory (Apple Silicon) versus discrete PCIe. That last distinction matters: the same model can have meaningfully different practical throughput on an M3 Max 36 GB versus a 3090 24 GB even though the M3 Max has more addressable memory.
A snapshot from the README (live data will differ):
| Hardware | VRAM | Top pick | Speed |
|---|---|---|---|
| RTX 5090 | 32 GB | Qwen3.6-27B · Q6_K · score 94.7 | ~40 t/s |
| RTX 4090 / 3090 | 24 GB | Qwen3.6-27B · Q5_K_M · score 92.8 | ~27 t/s |
| RTX 4060 | 8 GB | Qwen3-14B · Q3_K_M · score 71.0 | ~22 t/s |
| Apple M3 Max | 36 GB | Qwen3.6-27B · Q5_K_M · score 89.4 | ~9 t/s |
| CPU only | — | gpt-oss-20b (MoE) · Q4_K_M · score 45.2 | ~6 t/s |
Beyond the Ranking: The Full CLI Surface
whichllm covers several workflows beyond the default recommendation:
- GPU simulation —
whichllm --gpu "RTX 4090"lets you test any card before buying it. - Reverse lookup —
whichllm plan "llama 3 70b"tells you what GPU you'd need for a specific model. - Upgrade comparison —
whichllm upgrade "RTX 4090" "RTX 5090" "H100"diffs candidates side by side. - One-command chat —
whichllm run "qwen 2.5 1.5b gguf"spins up an isolated environment viauv, downloads the model, and drops you into an interactive session. Supports GGUF (viallama-cpp-python), AWQ, and GPTQ. - Code generation —
whichllm snippet "qwen 7b"prints copy-paste Python for the chosen model. - Scripting —
--jsonoutput makes every command pipeline-friendly. - Task profiles — filter results by
general,coding,vision, ormath.
Data comes from the HuggingFace API with curated frozen fallbacks for offline or rate-limited environments. The benchmark snapshot date is printed under every ranking, so a stale recommendation is visible rather than silently trusted.
Install via brew install andyyyy64/whichllm/whichllm, pip install whichllm, or uv tool install whichllm. For one-offs, uvx whichllm@latest requires nothing persistent.
Discussion 0
Join the discussion
Sign in with GitHub to comment and vote.
No comments yet
Be the first to weigh in.