Profile and Fix a Slow Python Service Using py-spy Flame Graphs
Attach py-spy to a live process with zero code changes, read the flame graph to pinpoint an O(n²) bottleneck, and verify a 1000x+ speedup — all without restarting your service.
What you'll build
You'll attach py-spy to a running Python process without editing a single line of application code, generate an interactive SVG flame graph, identify the hot function visually, apply a targeted algorithmic fix, and confirm the speedup with timeit.
Prerequisites
| Requirement | Version / Notes |
|---|---|
| Python | 3.8+ (CPython only — not PyPy) |
| py-spy | 0.3+ |
| OS | Linux or macOS; Windows is supported but has limitations |
| Browser | Any modern browser to open the SVG |
Install py-spy:
pip install py-spy
py-spy --version # should print py-spy 0.3.x
Linux: py-spy uses
ptraceand usually requiressudo. macOS may also require it depending on SIP settings.
Step 1 — Write a realistically slow service
Save this as slow_service.py. It finds duplicates with a classic O(n²) nested loop:
# slow_service.py
import random
def find_duplicates(items):
duplicates = []
for i, item in enumerate(items):
for j in range(len(items)):
if i != j and items[j] == item and item not in duplicates:
duplicates.append(item)
return duplicates
def process_batch():
data = [random.randint(0, 2500) for _ in range(5000)]
return find_duplicates(data)
def run():
batch = 0
while True:
result = process_batch()
batch += 1
print(f"Batch {batch}: {len(result)} duplicates")
if __name__ == "__main__":
run()
Start it in one terminal — each batch takes several seconds:
python slow_service.py
Step 2 — Attach py-spy and record the flame graph
Open a second terminal and find the PID:
pgrep -f slow_service.py
Record 30 seconds of samples and write an SVG flame graph:
# Linux (typically needs sudo)
sudo py-spy record -o profile.svg --pid <PID> --duration 30
# macOS
py-spy record -o profile.svg --pid <PID> --duration 30
py-spy samples the call stack at 100 Hz by reading the target process's virtual memory directly without pausing its execution — it is non-blocking by default. When recording finishes, open profile.svg in your browser.
Alternatively, launch and profile in one command:
py-spy record -o profile.svg --duration 30 -- python slow_service.py
Step 3 — Read the flame graph
A py-spy flame graph encodes two things:
| Axis | Meaning |
|---|---|
| Width (x) | Proportion of CPU samples — wider = more time spent there |
| Height (y) | Call stack depth — bottom is the entry point, top is what's executing |
How to spot the bottleneck: Ignore wide base frames like run and process_batch — they're wide because all execution flows through them. Look for the widest frame at the top of a stack plateau; that is what's actually burning CPU.
In this flame graph, find_duplicates nearly fills the x-axis and will appear as a wide, flat topmost frame — because py-spy profiles Python-level call stacks by default. Built-in operations inside find_duplicates such as the range(len(items)) iteration, the item not in duplicates list-scan, and list.append are all implemented in C and do not push Python frames onto the stack, so they do not show up as separate frames. All the time spent on those operations is folded directly into the find_duplicates frame itself, making it the plateau you see. That single wide frame is your hot path.
Tip: If you want C-extension and native frames to appear separately (e.g., to profile NumPy internals), pass
--nativeto py-spy. Without--native, only Python frames are captured.
Click any frame in the SVG to zoom into that subtree; hover to see exact sample counts and percentages.
Step 4 — Fix the hot path
Replace the O(n²) implementation with a single-pass O(n) version:
def find_duplicates(items):
seen = set()
duplicates = set()
for item in items:
if item in seen:
duplicates.add(item)
seen.add(item)
return list(duplicates)
Two changes drive the speedup:
- One loop instead of nested loops eliminates the O(n²) iteration count.
item in seenis O(1) for asetversus O(n) for alist.
Step 5 — Benchmark the speedup
# benchmark.py
import timeit
import random
DATA = [random.randint(0, 2500) for _ in range(5000)]
def find_duplicates_slow(items):
duplicates = []
for i, item in enumerate(items):
for j in range(len(items)):
if i != j and items[j] == item and item not in duplicates:
duplicates.append(item)
return duplicates
def find_duplicates_fast(items):
seen, duplicates = set(), set()
for item in items:
if item in seen:
duplicates.add(item)
seen.add(item)
return list(duplicates)
slow = timeit.timeit(lambda: find_duplicates_slow(DATA), number=5)
fast = timeit.timeit(lambda: find_duplicates_fast(DATA), number=5)
print(f"Slow: {slow:.2f}s | Fast: {fast:.4f}s | Speedup: {slow/fast:.0f}x")
python benchmark.py
Verify it works
Expected output (numbers vary by hardware):
Slow: 12.41s | Fast: 0.0031s | Speedup: 4003x
The patched service should now print batches nearly instantly. Re-run py-spy on the updated service: find_duplicates should shrink to near-invisible, and process_batch → random.randint will dominate instead — confirmation that the original bottleneck is gone.
Troubleshooting
Operation not permitted on Linux
py-spy needs the ptrace capability. Run with sudo, or grant it permanently to the binary:
sudo setcap cap_sys_ptrace=eip $(which py-spy)
Permission denied inside Docker
Docker drops ptrace by default. Start your container with:
docker run --cap-add SYS_PTRACE ...
Avoid --privileged unless you have no alternative — it grants far more than you need.
Flame graph SVG is blank or shows only one frame
The process likely finished before enough samples were collected. Ensure the service is inside its hot loop during the recording window, or use the -- python slow_service.py launch form so py-spy owns the process lifetime.
macOS attach fails even with sudo
Use the launch form instead of --pid:
py-spy record -o profile.svg --duration 30 -- python slow_service.py
Next steps
py-spy top— livehtop-style view of hot functions with no output file; ideal for rapid triage on a running server.- Speedscope format — pass
--format speedscopeto produce a JSON file for speedscope.app, which offers left-heavy and sandwich views that py-spy's SVG doesn't. - Native extensions — add
--nativeto include C/C++ frames from NumPy, pandas, or your own Cython code in the same flame graph. - Continuous profiling — integrate py-spy with Pyroscope for always-on production profiling with time-series flame graph storage.
Mariana covers the fast-moving world of machine learning and generative AI, with a particular focus on how these technologies are reshaping development workflows. When she isn't stress-testing the latest foundation models, she's usually at a local hackathon.
Discussion 0
No comments yet
Be the first to weigh in.