A simple, visual guide to CPU-bound work, the GIL, and why ProcessPoolExecutor sometimes beats ThreadPoolExecutor by a lot.
You have a batch job that takes 8 seconds. You split it into 8 pieces. You throw 8 threads at it. It still takes 8 seconds.
That feels wrong the first time you see it. You reached for parallelism, paid the complexity cost, and got basically nothing back. Sometimes you even make it slower.
This is one of the most common Python performance footguns, and it happens because three ideas collide in exactly the wrong way:
- CPU-bound vs I/O-bound work
- threads vs processes
- the GIL
The good news is that once you see the pattern, you start spotting it everywhere. And the fix is often just one small change: stop using threads for CPU-bound Python work, and use processes instead.
Let’s make that visible with a tiny experiment.
Imagine a batch job that transforms 8 chunks of data. Each chunk takes about 1 second of pure Python CPU work. Run them one after another, and the total time is about 8 seconds.
So far, so normal.
Then you try to speed it up with a ThreadPoolExecutor.
Key Definitions
Before the experiment, here are a few quick anchors:
- CPU-bound: the program is slow because it is busy computing.
- I/O-bound: the program is slow because it is waiting on disk, network, APIs, or a database.
- Thread: a worker inside a process; threads share memory.
- Process: a separate Python interpreter with its own memory space.
- GIL: in CPython, the Global Interpreter Lock allows only one thread at a time to execute Python bytecode.
That is enough vocabulary to make the rest of the post click.
Setup
Now we need a small benchmark that you can actually run.
The important thing is that transform() must be real CPU work, not time.sleep(). If you use sleep, you accidentally simulate waiting, which makes threads look good for the wrong reason.
import time
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
def transform(chunk):
total = 0
for i in range(10_000_000):
total += i * i
return total + chunk
chunks = list(range(8))
On your machine, you may need to tune the loop count so each chunk takes about 1 second.
Runnable Benchmark
Here are the three versions side by side:
## CPU
start = time.perf_counter()
results = [transform(chunk) for chunk in chunks]
print(f"Sequential: {time.perf_counter() - start:.2f}s")
## THREAD
start = time.perf_counter()
with ThreadPoolExecutor(max_workers=8) as ex:
results = list(ex.map(transform, chunks))
print(f"Threads: {time.perf_counter() - start:.2f}s")
## Process
start = time.perf_counter()
with ProcessPoolExecutor(max_workers=4) as ex:
results = list(ex.map(transform, chunks))
print(f"Processes: {time.perf_counter() - start:.2f}s")
On a typical machine, the threaded version is the surprise: it often stays close to the sequential runtime, while the process-based version gets a real speedup.
That is the clue that this job is CPU-bound, not I/O-bound.
Why Threads Fail Here
The important detail is that this work is CPU-bound. Each worker is actively computing in Python, not waiting on the network, disk, or a database.
That runs straight into the GIL.
In CPython, the GIL allows only one thread at a time to execute Python bytecode. So even if you create 8 threads, they do not all crunch Python code at the same instant. They mostly take turns.
That means the threaded version behaves more like this:
one Python process
one GIL
thread 1 -> runs
thread 2 -> waits
thread 3 -> waits
thread 4 -> waits
...
the lock gets handed around
You still pay extra overhead for scheduling, context switching, futures, executor bookkeeping, and contention around shared interpreter state. So you end up with almost the same total runtime as the sequential version, and sometimes slightly worse.
A simple way to picture the three approaches:
- Sequential: one core busy, one chunk after another.
- Threads: still mostly one core busy, because only one thread can hold the GIL at a time.
- Processes: multiple cores busy at once, because each process has its own interpreter and its own GIL.
The bug. Eight threads, but the GIL lets only one execute Python bytecode at a time. They take turns on a single core.
with ThreadPoolExecutor(max_workers=8) as ex: ...
In other words, the threads are real, but for pure Python CPU work, they still have to queue up for the interpreter.
Why Processes Work Better
ProcessPoolExecutor sidesteps the GIL by using separate processes instead of threads.
That matters because each process gets:
- its own Python interpreter
- its own memory space
- its own GIL
So on a 4-core machine, 4 chunks can genuinely run at the same time. That is real parallelism, not just concurrency.
Threads share one interpreter. Processes each bring their own.
The mental model becomes:
process 1 -> GIL 1 -> chunk 1
process 2 -> GIL 2 -> chunk 2
process 3 -> GIL 3 -> chunk 3
process 4 -> GIL 4 -> chunk 4
That is why ProcessPoolExecutor can produce a real speedup for this benchmark.
Why the Speedup Is Real, But Not Perfect
The process version does not usually give you a perfect 8x win. Two limits matter.
1. Core Count Sets the Ceiling
If you have 8 tasks but only 4 cores, the work has to happen in two waves.
So the best possible speedup is closer to 4x than 8x.
2. Moving Data Between Processes Costs Time
Processes do not share Python objects directly the way threads do.
That means inputs and outputs have to be:
- serialized with pickle
- sent to worker processes
- sent back again
Starting processes is also heavier than starting threads.
That overhead is why process pools shine on big enough tasks, but can disappoint on tiny ones.
Process pools buy real parallelism, but you pay for it in startup cost and data movement.
How to Recognize This Bug in Real Code
Usually the first clue is simple: adding threads does nothing.
If you increase max_workers and the runtime barely changes, that is a strong sign the slow part is not waiting on I/O. It is busy computing.
A few practical signs:
- One CPU core is pinned near 100%.
- The other cores are mostly quiet.
- Profiling shows time spent in your Python transform function, not in network or disk calls.
- More threads add overhead, but not speed.
If more threads do not help, stop asking “how do I parallelize this?” and start asking “what kind of work is this?”
Threads Are Still Useful
This is not an anti-thread post. Threads are great when the program spends most of its time waiting.
For example:
- an API
- a database
- the filesystem
- the network
In those cases, one thread can wait while another thread makes progress. The GIL is much less painful because waiting on I/O often releases it.
Same executor pattern, opposite result: threads are great for waiting, but poor for pure Python computing.
That is why this version would be misleading:
def transform(chunk):
time.sleep(1)
return chunk
If you benchmark that with threads, it will look great. But you did not test CPU-bound work. You tested waiting.
And threads are excellent at waiting.
The Rule of Thumb
When choosing between threads and processes, ask what your program is mostly doing.
If it is mostly waiting:
network, disk, database, API, sleep
Use threads or async I/O.
If it is mostly computing:
parsing, compression, image transforms, pure Python loops, feature extraction
Use processes, native extensions, vectorized libraries, or a distributed compute system.
The shortest version is:
Threads for waiting. Processes for working.
Or even shorter:
- CPU-bound: usually use processes.
- I/O-bound: often use threads.
One More Caveat
Some CPU-heavy Python libraries do not behave like the toy benchmark above.
Libraries like NumPy, Polars, PyTorch, TensorFlow, and many compression or image-processing libraries often push the expensive work into native code. That native code may release the GIL or use its own thread pools.
So the real question is not just “is my task CPU-heavy?”
It is:
Is my CPU-heavy work executing as Python bytecode, or inside native code that can run outside the GIL?
For pure Python loops, the GIL matters a lot. For native libraries, benchmark before assuming.
Final Takeaway
Adding threads does not automatically make Python CPU work run in parallel.
If your workload is pure Python computation, ThreadPoolExecutor can leave you staring at one busy core and seven disappointed expectations. ProcessPoolExecutor works better because each worker process gets its own interpreter and its own GIL.
That is the performance lesson:
- Use threads when the program is waiting.
- Use processes when the program is working.