Dev Tools Article

TurboVec: A Rust-Powered Quantised Vector Index That Fits 10M Docs in 4 GB

Built on Google Research's TurboQuant algorithm, TurboVec offers online ingest, SIMD-accelerated filtered search, and drop-in replacements for LangChain, LlamaIndex, and Haystack vector stores — no database required.

DevClubHouse Curation

Jun 8, 2026 · 4 min read · 0 comments

If you're building a RAG pipeline and reaching for a managed vector database mostly because the in-memory alternatives are too slow or too hungry, TurboVec is worth a look. The Rust-native library — already at 8.3k GitHub stars — pairs Google Research's TurboQuant quantisation algorithm with hand-written SIMD kernels and first-class Python bindings, compressing a 10-million-document corpus from 31 GB (float32) down to 4 GB while searching it faster than FAISS.

What TurboQuant Actually Does

Most quantisation schemes — including FAISS's ProductQuantizer — require an offline training phase to build a codebook from your data. TurboQuant is data-oblivious: it matches the Shannon lower bound on distortion without ever seeing a codebook training set, which means there's no train step, no parameter tuning, and no forced rebuild when your corpus grows. You add vectors and they're immediately searchable.

In recall benchmarks on 100K vectors at k=64, TurboQuant beats FAISS IndexPQ (LUT256, nbits=8) by 0.4–3.4 points at R@1 across OpenAI's d=1536 and d=3072 embedding dimensions at both 2-bit and 4-bit widths. High-dimensional embeddings are where TurboQuant's asymptotic Beta distribution assumption holds tightest; lower-dimensional spaces like GloVe d=200 show narrower (and occasionally reversed) margins at 2-bit.

The SIMD Story

Quantisation only helps if the distance computation stays fast. TurboVec ships hand-written NEON kernels for ARM and AVX-512BW kernels for x86. Against FAISS IndexPQFastScan — the current fastest production-grade PQ scan in FAISS — TurboVec is 12–20% faster on ARM and matches or beats it on x86.

Filtered search is implemented directly inside the SIMD loop rather than as a post-processing step. Passing an allowlist of external IDs (or a slot bitmask) to search() causes the kernel to short-circuit entire 32-vector blocks with no allowed slots before touching any LUT lookup or scoring work. Selective filters therefore skip most SIMD cost rather than paying it and discarding results:

from turbovec import IdMapIndex
import numpy as np

idx = IdMapIndex(dim=1536, bit_width=4)
idx.add_with_ids(vectors, ids)

# Narrow to a tenant's documents via SQL, BM25, ACL, etc.
allowed = np.array(
    db.execute("SELECT id FROM docs WHERE tenant=?", (t,)).fetchall(),
    dtype=np.uint64
)
scores, ids = idx.search(query, k=10, allowlist=allowed)

The result length is min(k, len(allowed)) — no over-fetching, no padded fallbacks when the allowlist is smaller than k.

IdMapIndex also supports O(1) deletes by external uint64 ID, which TurboQuantIndex (the slot-addressed variant) doesn't provide.

Dropping Into Your Existing Stack

TurboVec ships optional extras that act as drop-in replacements for the default in-memory vector stores in the major Python RAG frameworks:

Framework	Install	Replaces
LangChain	`pip install turbovec[langchain]`	`InMemoryVectorStore`
LlamaIndex	`pip install turbovec[llama-index]`	`SimpleVectorStore`
Haystack	`pip install turbovec[haystack]`	`InMemoryDocumentStore`
Agno	`pip install turbovec[agno]`	`LanceDb`

Same public surface, same persistence semantics — swap the import and keep the rest of your pipeline intact.

When It Makes Sense

TurboVec occupies a specific niche: high-throughput similarity search at single-machine scale, particularly where memory budget, latency, or data privacy rules out a hosted vector DB. The pure-local design means no data leaves your machine or VPC, making it a natural fit for air-gapped RAG stacks paired with local embedding models. The tradeoff is that you're taking on the operational responsibility that a managed service would otherwise handle — replication, persistence beyond a single node, and anything requiring distributed search across shards isn't in scope here.

For teams already fighting memory pressure on a large embedding corpus, the 7–8× compression ratio from 4-bit quantisation alone may justify the switch from a float32 in-memory store.

#Python #Rag #Rust #Vector Search #Embeddings #Simd

Discussion 0

Join the discussion

No comments yet

Be the first to weigh in.

TurboVec: A Rust-Powered Quantised Vector Index That Fits 10M Docs in 4 GB

What TurboQuant Actually Does

The SIMD Story

Dropping Into Your Existing Stack

When It Makes Sense

Discussion 0

Related Reading

nixidy: Ditch the 600-Line Helm Values File, Use Nix Instead

MarkItDown: Microsoft's Swiss-Army Converter for LLM Document Ingestion

Intuned Wants to Be the Deployment Layer for Your Playwright Automations

You're Running Your Python Type-Checkers on the Wrong Code