A simple, visual guide to CPU-bound work, the GIL, and why ProcessPoolExecutor sometimes beats ThreadPoolExecutor by a lot.

You have a batch job that takes 8 seconds. You split it into 8 pieces. You throw 8 threads at it. It still takes 8 seconds.

That feels wrong the first time you see it. You reached for parallelism, paid the complexity cost, and got basically nothing back. Sometimes you even make it slower.

This is one of the most common Python performance footguns, and it happens because three ideas collide in exactly the wrong way:

  • CPU-bound vs I/O-bound work
  • threads vs processes
  • the GIL

The good news is that once you see the pattern, you start spotting it everywhere. And the fix is often just one small change: stop using threads for CPU-bound Python work, and use processes instead.

Let’s make that visible with a tiny experiment.

Imagine a batch job that transforms 8 chunks of data. Each chunk takes about 1 second of pure Python CPU work. Run them one after another, and the total time is about 8 seconds.

So far, so normal.

Then you try to speed it up with a ThreadPoolExecutor.

Key Definitions

Before the experiment, here are a few quick anchors:

  • CPU-bound: the program is slow because it is busy computing.
  • I/O-bound: the program is slow because it is waiting on disk, network, APIs, or a database.
  • Thread: a worker inside a process; threads share memory.
  • Process: a separate Python interpreter with its own memory space.
  • GIL: in CPython, the Global Interpreter Lock allows only one thread at a time to execute Python bytecode.

That is enough vocabulary to make the rest of the post click.

Setup

Now we need a small benchmark that you can actually run.

The important thing is that transform() must be real CPU work, not time.sleep(). If you use sleep, you accidentally simulate waiting, which makes threads look good for the wrong reason.

import time
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor

def transform(chunk):
    total = 0
    for i in range(10_000_000):
        total += i * i
    return total + chunk

chunks = list(range(8))

On your machine, you may need to tune the loop count so each chunk takes about 1 second.

Runnable Benchmark

Here are the three versions side by side:

## CPU
start = time.perf_counter()
results = [transform(chunk) for chunk in chunks]
print(f"Sequential: {time.perf_counter() - start:.2f}s")


## THREAD
start = time.perf_counter()
with ThreadPoolExecutor(max_workers=8) as ex:
    results = list(ex.map(transform, chunks))
print(f"Threads:    {time.perf_counter() - start:.2f}s")


## Process
start = time.perf_counter()
with ProcessPoolExecutor(max_workers=4) as ex:
    results = list(ex.map(transform, chunks))
print(f"Processes:  {time.perf_counter() - start:.2f}s")

On a typical machine, the threaded version is the surprise: it often stays close to the sequential runtime, while the process-based version gets a real speedup.

That is the clue that this job is CPU-bound, not I/O-bound.

Why Threads Fail Here

The important detail is that this work is CPU-bound. Each worker is actively computing in Python, not waiting on the network, disk, or a database.

That runs straight into the GIL.

In CPython, the GIL allows only one thread at a time to execute Python bytecode. So even if you create 8 threads, they do not all crunch Python code at the same instant. They mostly take turns.

That means the threaded version behaves more like this:

one Python process
one GIL

thread 1 -> runs
thread 2 -> waits
thread 3 -> waits
thread 4 -> waits
...

the lock gets handed around

You still pay extra overhead for scheduling, context switching, futures, executor bookkeeping, and contention around shared interpreter state. So you end up with almost the same total runtime as the sequential version, and sometimes slightly worse.

A simple way to picture the three approaches:

  • Sequential: one core busy, one chunk after another.
  • Threads: still mostly one core busy, because only one thread can hold the GIL at a time.
  • Processes: multiple cores busy at once, because each process has its own interpreter and its own GIL.
CPU core usage comparison Sequential and threaded CPU-bound Python work mostly use one core, while process-based work uses multiple cores. Core 1 Core 2 Core 3 Core 4 0s 2s 4s 6s 8s
Wall clock 0.0s
8.0stotal time
1.0xspeedup
1 of 4cores used

The bug. Eight threads, but the GIL lets only one execute Python bytecode at a time. They take turns on a single core.

with ThreadPoolExecutor(max_workers=8) as ex: ...
computing (CPU) spawn / pickle (IPC) idle core

In other words, the threads are real, but for pure Python CPU work, they still have to queue up for the interpreter.

Why Processes Work Better

ProcessPoolExecutor sidesteps the GIL by using separate processes instead of threads.

That matters because each process gets:

  • its own Python interpreter
  • its own memory space
  • its own GIL

So on a 4-core machine, 4 chunks can genuinely run at the same time. That is real parallelism, not just concurrency.

Threads share one interpreter. Processes each bring their own.

The mental model becomes:

process 1 -> GIL 1 -> chunk 1
process 2 -> GIL 2 -> chunk 2
process 3 -> GIL 3 -> chunk 3
process 4 -> GIL 4 -> chunk 4

That is why ProcessPoolExecutor can produce a real speedup for this benchmark.

Why the Speedup Is Real, But Not Perfect

The process version does not usually give you a perfect 8x win. Two limits matter.

1. Core Count Sets the Ceiling

If you have 8 tasks but only 4 cores, the work has to happen in two waves.

So the best possible speedup is closer to 4x than 8x.

2. Moving Data Between Processes Costs Time

Processes do not share Python objects directly the way threads do.

That means inputs and outputs have to be:

  • serialized with pickle
  • sent to worker processes
  • sent back again

Starting processes is also heavier than starting threads.

That overhead is why process pools shine on big enough tasks, but can disappoint on tiny ones.

Process pools buy real parallelism, but you pay for it in startup cost and data movement.

How to Recognize This Bug in Real Code

Usually the first clue is simple: adding threads does nothing.

If you increase max_workers and the runtime barely changes, that is a strong sign the slow part is not waiting on I/O. It is busy computing.

A few practical signs:

  • One CPU core is pinned near 100%.
  • The other cores are mostly quiet.
  • Profiling shows time spent in your Python transform function, not in network or disk calls.
  • More threads add overhead, but not speed.

If more threads do not help, stop asking “how do I parallelize this?” and start asking “what kind of work is this?”

Threads Are Still Useful

This is not an anti-thread post. Threads are great when the program spends most of its time waiting.

For example:

  • an API
  • a database
  • the filesystem
  • the network

In those cases, one thread can wait while another thread makes progress. The GIL is much less painful because waiting on I/O often releases it.

Same executor pattern, opposite result: threads are great for waiting, but poor for pure Python computing.

That is why this version would be misleading:

def transform(chunk):
    time.sleep(1)
    return chunk

If you benchmark that with threads, it will look great. But you did not test CPU-bound work. You tested waiting.

And threads are excellent at waiting.

The Rule of Thumb

When choosing between threads and processes, ask what your program is mostly doing.

If it is mostly waiting:

network, disk, database, API, sleep

Use threads or async I/O.

If it is mostly computing:

parsing, compression, image transforms, pure Python loops, feature extraction

Use processes, native extensions, vectorized libraries, or a distributed compute system.

The shortest version is:

Threads for waiting. Processes for working.

Or even shorter:

  • CPU-bound: usually use processes.
  • I/O-bound: often use threads.

One More Caveat

Some CPU-heavy Python libraries do not behave like the toy benchmark above.

Libraries like NumPy, Polars, PyTorch, TensorFlow, and many compression or image-processing libraries often push the expensive work into native code. That native code may release the GIL or use its own thread pools.

So the real question is not just “is my task CPU-heavy?”

It is:

Is my CPU-heavy work executing as Python bytecode, or inside native code that can run outside the GIL?

For pure Python loops, the GIL matters a lot. For native libraries, benchmark before assuming.

Final Takeaway

Adding threads does not automatically make Python CPU work run in parallel.

If your workload is pure Python computation, ThreadPoolExecutor can leave you staring at one busy core and seven disappointed expectations. ProcessPoolExecutor works better because each worker process gets its own interpreter and its own GIL.

That is the performance lesson:

  • Use threads when the program is waiting.
  • Use processes when the program is working.