Skip to content
Dev Tools Article

TurboVec: A Rust-Powered Quantised Vector Index That Fits 10M Docs in 4 GB

Built on Google Research's TurboQuant algorithm, TurboVec offers online ingest, SIMD-accelerated filtered search, and drop-in replacements for LangChain, LlamaIndex, and Haystack vector stores — no database required.

AI
DevClubHouse Curation
Jun 8, 2026 · 4 min read · 0 comments

If you're building a RAG pipeline and reaching for a managed vector database mostly because the in-memory alternatives are too slow or too hungry, TurboVec is worth a look. The Rust-native library — already at 8.3k GitHub stars — pairs Google Research's TurboQuant quantisation algorithm with hand-written SIMD kernels and first-class Python bindings, compressing a 10-million-document corpus from 31 GB (float32) down to 4 GB while searching it faster than FAISS.

What TurboQuant Actually Does

Most quantisation schemes — including FAISS's ProductQuantizer — require an offline training phase to build a codebook from your data. TurboQuant is data-oblivious: it matches the Shannon lower bound on distortion without ever seeing a codebook training set, which means there's no train step, no parameter tuning, and no forced rebuild when your corpus grows. You add vectors and they're immediately searchable.

In recall benchmarks on 100K vectors at k=64, TurboQuant beats FAISS IndexPQ (LUT256, nbits=8) by 0.4–3.4 points at R@1 across OpenAI's d=1536 and d=3072 embedding dimensions at both 2-bit and 4-bit widths. High-dimensional embeddings are where TurboQuant's asymptotic Beta distribution assumption holds tightest; lower-dimensional spaces like GloVe d=200 show narrower (and occasionally reversed) margins at 2-bit.

The SIMD Story

Quantisation only helps if the distance computation stays fast. TurboVec ships hand-written NEON kernels for ARM and AVX-512BW kernels for x86. Against FAISS IndexPQFastScan — the current fastest production-grade PQ scan in FAISS — TurboVec is 12–20% faster on ARM and matches or beats it on x86.

Filtered search is implemented directly inside the SIMD loop rather than as a post-processing step. Passing an allowlist of external IDs (or a slot bitmask) to search() causes the kernel to short-circuit entire 32-vector blocks with no allowed slots before touching any LUT lookup or scoring work. Selective filters therefore skip most SIMD cost rather than paying it and discarding results:

from turbovec import IdMapIndex
import numpy as np

idx = IdMapIndex(dim=1536, bit_width=4)
idx.add_with_ids(vectors, ids)

# Narrow to a tenant's documents via SQL, BM25, ACL, etc.
allowed = np.array(
    db.execute("SELECT id FROM docs WHERE tenant=?", (t,)).fetchall(),
    dtype=np.uint64
)
scores, ids = idx.search(query, k=10, allowlist=allowed)

The result length is min(k, len(allowed)) — no over-fetching, no padded fallbacks when the allowlist is smaller than k.

IdMapIndex also supports O(1) deletes by external uint64 ID, which TurboQuantIndex (the slot-addressed variant) doesn't provide.

Dropping Into Your Existing Stack

TurboVec ships optional extras that act as drop-in replacements for the default in-memory vector stores in the major Python RAG frameworks:

Framework Install Replaces
LangChain pip install turbovec[langchain] InMemoryVectorStore
LlamaIndex pip install turbovec[llama-index] SimpleVectorStore
Haystack pip install turbovec[haystack] InMemoryDocumentStore
Agno pip install turbovec[agno] LanceDb

Same public surface, same persistence semantics — swap the import and keep the rest of your pipeline intact.

When It Makes Sense

TurboVec occupies a specific niche: high-throughput similarity search at single-machine scale, particularly where memory budget, latency, or data privacy rules out a hosted vector DB. The pure-local design means no data leaves your machine or VPC, making it a natural fit for air-gapped RAG stacks paired with local embedding models. The tradeoff is that you're taking on the operational responsibility that a managed service would otherwise handle — replication, persistence beyond a single node, and anything requiring distributed search across shards isn't in scope here.

For teams already fighting memory pressure on a large embedding corpus, the 7–8× compression ratio from 4-bit quantisation alone may justify the switch from a float32 in-memory store.

Discussion 0

Join the discussion

Sign in with GitHub to comment and vote.

Sign in with GitHub

No comments yet

Be the first to weigh in.

Related Reading