Serve an Open-Source LLM at Scale with vLLM on a Rented GPU Instance
Go from a bare cloud VM to a production-ready, OpenAI-compatible inference server in under an hour, using vLLM's continuous batching to hit thousands of output tokens per second on a single GPU.
What You'll Build
You'll deploy Llama 3.1 8B behind vLLM's OpenAI-compatible API on a rented GPU instance, then verify that continuous batching actually delivers an order-of-magnitude throughput advantage over sequential serving.
Prerequisites
- A cloud GPU instance with at least one NVIDIA A10G (24 GB VRAM). Lambda Labs (~$0.80/hr for A10), RunPod, and Vast.ai all work. An A100 40/80 GB handles larger models or higher concurrency.
- Ubuntu 22.04, CUDA 12.1+ (standard on ML-optimized images). Confirm with
nvidia-smi. - Python 3.10 or 3.11. vLLM 0.6.x is not fully validated on 3.12 yet.
- A Hugging Face account with the Meta Llama 3.1 license accepted at
hf.co/meta-llama/Meta-Llama-3.1-8B-Instruct, plus aread-scoped access token. - SSH access to the instance.
GPU driver installation and VPC networking are out of scope here.
1. Prepare the VM
Verify the GPU and CUDA runtime are visible before touching Python:
nvidia-smi
python3 --version
Create an isolated environment:
python3 -m venv ~/vllm-env
source ~/vllm-env/bin/activate
pip install --upgrade pip
2. Install vLLM
pip install vllm
This pulls PyTorch 2.4+ compiled for CUDA 12.1 along with vLLM's paged-attention kernels. Expect 3-5 minutes and roughly 4 GB of downloads. Confirm the install:
python -c "import vllm; print(vllm.__version__)"
If your image has CUDA 11.8 (uncommon on current offerings), vLLM 0.6.x no longer supports it. Use the official vLLM Docker image instead: docker pull vllm/vllm-openai:latest.
3. Authenticate with Hugging Face
export HF_TOKEN="hf_YOUR_TOKEN_HERE"
Add this to ~/.profile for persistence, or store it in your cloud provider's secrets manager. Never commit tokens to source control or bake them into Docker layers.
4. Start the Inference Server
vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.90 \
--max-model-len 8192 \
--served-model-name llama3
Key flags at a glance:
| Flag | Effect |
|---|---|
--tensor-parallel-size |
Shards the model across N GPUs. Set to 2 on a dual-A10G node for near-linear throughput scaling. |
--gpu-memory-utilization |
Fraction of VRAM reserved for the KV cache. Leave headroom; 0.90 works well for most cases. |
--max-model-len |
Caps total sequence length. Llama 3.1 supports 128k natively, but fitting that KV cache on 24 GB is impossible at BF16. |
--served-model-name |
The model ID clients send in requests; decouples your API surface from the HF repo path. |
Weights download on first run (~16 GB for BF16). Subsequent starts read from ~/.cache/huggingface. The server is ready when you see:
INFO: Application startup complete.
5. Send Your First Request
From a second terminal (no venv needed for curl):
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3",
"messages": [{"role": "user", "content": "What is PagedAttention?"}],
"max_tokens": 150,
"temperature": 0.7
}'
The response schema is identical to OpenAI's. Point any OpenAI SDK at the server by changing base_url:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="ignored")
response = client.chat.completions.create(
model="llama3",
messages=[{"role": "user", "content": "Write a haiku about GPU memory."}],
max_tokens=60,
)
print(response.choices[0].message.content)
api_key can be any non-empty string by default. See step 7 for enforcing real authentication.
6. Load Test: Concurrent Requests
vLLM's continuous batching combines in-flight requests into a single forward pass on every scheduling step, rather than waiting to fill a static batch. The throughput difference is dramatic. Run this to see it:
# bench.py
import asyncio, time
from openai import AsyncOpenAI
client = AsyncOpenAI(base_url="http://localhost:8000/v1", api_key="ignored")
PROMPT = "Summarize the history of neural networks in three sentences."
CONCURRENCY, TOTAL = 50, 200
async def one_request(sem):
async with sem:
t0 = time.monotonic()
r = await client.chat.completions.create(
model="llama3",
messages=[{"role": "user", "content": PROMPT}],
max_tokens=100,
)
return time.monotonic() - t0, r.usage.completion_tokens
async def main():
sem = asyncio.Semaphore(CONCURRENCY)
t_start = time.monotonic()
results = await asyncio.gather(*[one_request(sem) for _ in range(TOTAL)])
elapsed = time.monotonic() - t_start
total_tok = sum(t for _, t in results)
print(f"Requests: {TOTAL} | Wall time: {elapsed:.1f}s")
print(f"Aggregate throughput: {total_tok / elapsed:.0f} tokens/sec")
print(f"Avg latency: {sum(l for l, _ in results) / TOTAL:.2f}s")
asyncio.run(main())
pip install openai
python bench.py
A single A10G running Llama 3.1 8B in BF16 typically reaches 1,200-2,000 aggregate output tokens/sec under this load. Sequential, one-at-a-time serving on the same hardware delivers roughly 35-45 tokens/sec, because the GPU sits idle between requests while waiting for the next one to arrive.
7. Production Hardening
Run vLLM as a systemd service so it survives SSH disconnects:
# /etc/systemd/system/vllm.service
[Unit]
Description=vLLM inference server
After=network.target
[Service]
Type=simple
User=ubuntu
Environment="HF_TOKEN=hf_YOUR_TOKEN_HERE"
Environment="PATH=/home/ubuntu/vllm-env/bin:/usr/bin:/bin"
ExecStart=/home/ubuntu/vllm-env/bin/vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
--host 127.0.0.1 \
--port 8000 \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.90 \
--max-model-len 8192 \
--served-model-name llama3
Restart=on-failure
RestartSec=10
[Install]
WantedBy=multi-user.target
The server binds to 127.0.0.1 here, not 0.0.0.0. Never expose port 8000 directly to the internet without authentication. The simplest secure access pattern from a dev machine:
ssh -L 8000:localhost:8000 ubuntu@your-gpu-host
For production, put nginx in front with TLS termination and auth_basic or a JWT validation block, or use Caddy with an API key middleware.
Enable and start:
sudo systemctl daemon-reload
sudo systemctl enable --now vllm
sudo journalctl -fu vllm
Verify It Works
Query the models endpoint to confirm the server registered correctly:
curl http://localhost:8000/v1/models | python3 -m json.tool
Expected output contains "id": "llama3" and "object": "model". The scheduler logs in journalctl show live KV cache utilization and per-step batch sizes. Watch nvidia-smi dmon in a separate pane to confirm the GPU is saturated during the load test.
Troubleshooting
torch.cuda.OutOfMemoryError or CUDA out of memory on startup. The KV cache allocation exceeded available VRAM. Lower --gpu-memory-utilization to 0.80, or reduce --max-model-len (halving it roughly halves KV cache size). A 128k context at BF16 needs more than 24 GB on its own.
High latency at low concurrency. Continuous batching optimizes for aggregate throughput, not time-to-first-token on isolated requests. For latency-sensitive single-request workloads, set --max-num-seqs 1 to disable multi-request batching.
OSError: You are trying to access a gated repo. Either HF_TOKEN is not set in the current shell (echo $HF_TOKEN to verify), or your account has not accepted the Llama 3.1 license on Hugging Face. The acceptance must be done on the model page, not just in account settings.
Garbled or truncated outputs. Check that --max-model-len is larger than your prompt token count plus max_tokens. vLLM will silently truncate the prompt from the left when the combined length exceeds the configured limit.
Next Steps
- Quantization:
--quantization fp8runs on-the-fly FP8 quantization (vLLM 0.5+) and fits larger models into the same VRAM. For AWQ, you need a pre-quantized checkpoint from a community hub like Hugging Face; then pass--quantization awqpointing at that repo. - Multi-GPU tensor parallelism:
--tensor-parallel-size 2on a dual-A100 node gives near-linear throughput scaling with no code changes. - Structured outputs:
--guided-decoding-backend outlinesenforces JSON Schema constraints on generation, useful for tool-calling pipelines. - Metrics: vLLM exposes a Prometheus-compatible endpoint at
GET /metrics. Scrape it with a Grafana agent to track KV cache hit rates, queue depth, and inter-token latency percentiles in real time.
Priya covers AI frameworks, developer productivity tooling, and the startup ecosystem across South and Southeast Asia, bringing a researcher's rigour and a practitioner's empathy to every story. She is deeply sceptical of benchmarks and asks hard questions so her readers don't have to.
Discussion 0
No comments yet
Be the first to weigh in.