Dev Tools Intermediate Tutorial

Profile and Fix a Slow Python Service Using py-spy Flame Graphs

Attach py-spy to a live process with zero code changes, read the flame graph to pinpoint an O(n²) bottleneck, and verify a 1000x+ speedup — all without restarting your service.

Mariana Souza

Senior Editor · Jun 21, 2026 · 6 min read

What you'll build

You'll attach py-spy to a running Python process without editing a single line of application code, generate an interactive SVG flame graph, identify the hot function visually, apply a targeted algorithmic fix, and confirm the speedup with timeit.

Prerequisites

Requirement	Version / Notes
Python	3.8+ (CPython only — not PyPy)
py-spy	0.3+
OS	Linux or macOS; Windows is supported but has limitations
Browser	Any modern browser to open the SVG

Install py-spy:

pip install py-spy
py-spy --version   # should print py-spy 0.3.x

Linux: py-spy uses ptrace and usually requires sudo. macOS may also require it depending on SIP settings.

Step 1 — Write a realistically slow service

Save this as slow_service.py. It finds duplicates with a classic O(n²) nested loop:

# slow_service.py
import random

def find_duplicates(items):
    duplicates = []
    for i, item in enumerate(items):
        for j in range(len(items)):
            if i != j and items[j] == item and item not in duplicates:
                duplicates.append(item)
    return duplicates

def process_batch():
    data = [random.randint(0, 2500) for _ in range(5000)]
    return find_duplicates(data)

def run():
    batch = 0
    while True:
        result = process_batch()
        batch += 1
        print(f"Batch {batch}: {len(result)} duplicates")

if __name__ == "__main__":
    run()

Start it in one terminal — each batch takes several seconds:

python slow_service.py

Step 2 — Attach py-spy and record the flame graph

Open a second terminal and find the PID:

pgrep -f slow_service.py

Record 30 seconds of samples and write an SVG flame graph:

# Linux (typically needs sudo)
sudo py-spy record -o profile.svg --pid <PID> --duration 30

# macOS
py-spy record -o profile.svg --pid <PID> --duration 30

py-spy samples the call stack at 100 Hz by reading the target process's virtual memory directly without pausing its execution — it is non-blocking by default. When recording finishes, open profile.svg in your browser.

Alternatively, launch and profile in one command:

py-spy record -o profile.svg --duration 30 -- python slow_service.py

Step 3 — Read the flame graph

A py-spy flame graph encodes two things:

Axis	Meaning
Width (x)	Proportion of CPU samples — wider = more time spent there
Height (y)	Call stack depth — bottom is the entry point, top is what's executing

How to spot the bottleneck: Ignore wide base frames like run and process_batch — they're wide because all execution flows through them. Look for the widest frame at the top of a stack plateau; that is what's actually burning CPU.

In this flame graph, find_duplicates nearly fills the x-axis and will appear as a wide, flat topmost frame — because py-spy profiles Python-level call stacks by default. Built-in operations inside find_duplicates such as the range(len(items)) iteration, the item not in duplicates list-scan, and list.append are all implemented in C and do not push Python frames onto the stack, so they do not show up as separate frames. All the time spent on those operations is folded directly into the find_duplicates frame itself, making it the plateau you see. That single wide frame is your hot path.

Tip: If you want C-extension and native frames to appear separately (e.g., to profile NumPy internals), pass --native to py-spy. Without --native, only Python frames are captured.

Click any frame in the SVG to zoom into that subtree; hover to see exact sample counts and percentages.

Step 4 — Fix the hot path

Replace the O(n²) implementation with a single-pass O(n) version:

def find_duplicates(items):
    seen = set()
    duplicates = set()
    for item in items:
        if item in seen:
            duplicates.add(item)
        seen.add(item)
    return list(duplicates)

Two changes drive the speedup:

One loop instead of nested loops eliminates the O(n²) iteration count.
item in seen is O(1) for a set versus O(n) for a list.

Step 5 — Benchmark the speedup

# benchmark.py
import timeit
import random

DATA = [random.randint(0, 2500) for _ in range(5000)]

def find_duplicates_slow(items):
    duplicates = []
    for i, item in enumerate(items):
        for j in range(len(items)):
            if i != j and items[j] == item and item not in duplicates:
                duplicates.append(item)
    return duplicates

def find_duplicates_fast(items):
    seen, duplicates = set(), set()
    for item in items:
        if item in seen:
            duplicates.add(item)
        seen.add(item)
    return list(duplicates)

slow = timeit.timeit(lambda: find_duplicates_slow(DATA), number=5)
fast = timeit.timeit(lambda: find_duplicates_fast(DATA), number=5)
print(f"Slow: {slow:.2f}s | Fast: {fast:.4f}s | Speedup: {slow/fast:.0f}x")

python benchmark.py

Verify it works

Expected output (numbers vary by hardware):

Slow: 12.41s | Fast: 0.0031s | Speedup: 4003x

The patched service should now print batches nearly instantly. Re-run py-spy on the updated service: find_duplicates should shrink to near-invisible, and process_batch → random.randint will dominate instead — confirmation that the original bottleneck is gone.

Troubleshooting

Operation not permitted on Linux py-spy needs the ptrace capability. Run with sudo, or grant it permanently to the binary:

sudo setcap cap_sys_ptrace=eip $(which py-spy)

Permission denied inside Docker Docker drops ptrace by default. Start your container with:

docker run --cap-add SYS_PTRACE ...

Avoid --privileged unless you have no alternative — it grants far more than you need.

Flame graph SVG is blank or shows only one frame The process likely finished before enough samples were collected. Ensure the service is inside its hot loop during the recording window, or use the -- python slow_service.py launch form so py-spy owns the process lifetime.

macOS attach fails even with sudo Use the launch form instead of --pid:

py-spy record -o profile.svg --duration 30 -- python slow_service.py

Next steps

py-spy top — live htop-style view of hot functions with no output file; ideal for rapid triage on a running server.
Speedscope format — pass --format speedscope to produce a JSON file for speedscope.app, which offers left-heavy and sandwich views that py-spy's SVG doesn't.
Native extensions — add --native to include C/C++ frames from NumPy, pandas, or your own Cython code in the same flame graph.
Continuous profiling — integrate py-spy with Pyroscope for always-on production profiling with time-series flame graph storage.

#Performance #Python #Profiling #Flame Graphs #Py Spy

Written by

Mariana Souza · Senior Editor

Mariana covers the fast-moving world of machine learning and generative AI, with a particular focus on how these technologies are reshaping development workflows. When she isn't stress-testing the latest foundation models, she's usually at a local hackathon.

Discussion 0

Join the discussion

No comments yet

Be the first to weigh in.