A simple, visual guide to CPU-bound work, the GIL, and why ProcessPoolExecutor sometimes beats ThreadPoolExecutor by a lot.

You have a batch job that takes 8 seconds. You split it into 8 pieces. You throw 8 threads at it. It still takes 8 seconds.

That feels wrong the first time you see it. You reached for parallelism, paid the complexity cost, and got basically nothing back. Sometimes you even make it slower.

This is one of the most common Python performance footguns, and it happens because three ideas collide in exactly the wrong way:

  • CPU-bound vs I/O-bound work
  • threads vs processes
  • the GIL

The good news is that once you see the pattern, you start spotting it everywhere. And the fix is often just one small change: stop using threads for CPU-bound Python work, and use processes instead.

Let’s make that visible with a tiny experiment.

Imagine a batch job that transforms 8 chunks of data. Each chunk takes about 1 second of pure Python CPU work. Run them one after another, and the total time is about 8 seconds.

So far, so normal.

Then you try to speed it up with a ThreadPoolExecutor.

Key Definitions

Before the experiment, here are a few quick anchors:

  • CPU-bound: the program is slow because it is busy computing.
  • I/O-bound: the program is slow because it is waiting on disk, network, APIs, or a database.
  • Thread: a worker inside a process; threads share memory.
  • Process: a separate Python interpreter with its own memory space.
  • GIL: in CPython, the Global Interpreter Lock allows only one thread at a time to execute Python bytecode.

That is enough vocabulary to make the rest of the post click.

Setup

Now we need a small benchmark that you can actually run.

The important thing is that transform() must be real CPU work, not time.sleep(). If you use sleep, you accidentally simulate waiting, which makes threads look good for the wrong reason.

import time
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor

def transform(chunk):
    total = 0
    for i in range(10_000_000):
        total += i * i
    return total + chunk

chunks = list(range(8))

On your machine, you may need to tune the loop count so each chunk takes about 1 second.

Runnable Benchmark

Here are the three versions side by side in one runnable script:

import time
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor

def transform(chunk):
    total = 0
    for i in range(10_000_000):
        total += i * i
    return total + chunk

if __name__ == "__main__":
    chunks = list(range(8))

    start = time.perf_counter()
    results = [transform(chunk) for chunk in chunks]
    print(f"Sequential: {time.perf_counter() - start:.2f}s")

    start = time.perf_counter()
    with ThreadPoolExecutor(max_workers=8) as ex:
        results = list(ex.map(transform, chunks))
    print(f"Threads:    {time.perf_counter() - start:.2f}s")

    start = time.perf_counter()
    with ProcessPoolExecutor(max_workers=4) as ex:
        results = list(ex.map(transform, chunks))
    print(f"Processes:  {time.perf_counter() - start:.2f}s")

On a typical machine, the threaded version is the surprise: it often stays close to the sequential runtime, while the process-based version gets a real speedup.

That is the clue that this job is CPU-bound, not I/O-bound.

Why Threads Fail Here

The important detail is that this work is CPU-bound. Each worker is actively computing in Python, not waiting on the network, disk, or a database.

That runs straight into the GIL.

In CPython, the GIL allows only one thread at a time to execute Python bytecode. So even if you create 8 threads, they do not all crunch Python code at the same instant. They mostly take turns.

That means the threaded version behaves more like this:

one Python process
one GIL

thread 1 -> runs
thread 2 -> waits
thread 3 -> waits
thread 4 -> waits
...

the lock gets handed around

You still pay extra overhead for scheduling, context switching, futures, executor bookkeeping, and contention around shared interpreter state. So you end up with almost the same total runtime as the sequential version, and sometimes slightly worse.

A simple way to picture the three approaches:

  • Sequential: one core busy, one chunk after another.
  • Threads: still mostly one core busy, because only one thread can hold the GIL at a time.
  • Processes: multiple cores busy at once, because each process has its own interpreter and its own GIL.
CPU core usage comparison Sequential and threaded CPU-bound Python work mostly use one core, while process-based work uses multiple cores. Core 1 Core 2 Core 3 Core 4 0s 2s 4s 6s 8s
Wall clock 0.0s
8.0stotal time
1.0xspeedup
1 of 4cores used

The bug. Eight threads, but the GIL lets only one execute Python bytecode at a time. They take turns on a single core.

with ThreadPoolExecutor(max_workers=8) as ex: ...
computing (CPU) spawn / pickle (IPC) idle core

In other words, the threads are real, but for pure Python CPU work, they still have to queue up for the interpreter.

Why Processes Work Better

ProcessPoolExecutor sidesteps the GIL by using separate processes instead of threads.

That matters because each process gets:

  • its own Python interpreter
  • its own memory space
  • its own GIL

So on a 4-core machine, 4 chunks can genuinely run at the same time. That is real parallelism, not just concurrency.

Threads share one interpreter. Processes each bring their own.

The mental model becomes:

process 1 -> GIL 1 -> chunk 1
process 2 -> GIL 2 -> chunk 2
process 3 -> GIL 3 -> chunk 3
process 4 -> GIL 4 -> chunk 4

That is why ProcessPoolExecutor can produce a real speedup for this benchmark.

Why the Speedup Is Real, But Not Perfect

The process version does not usually give you a perfect 8x win. Two limits matter.

1. Core Count Sets the Ceiling

If you have 8 tasks but only 4 cores, the work has to happen in two waves.

So the best possible speedup is closer to 4x than 8x.

2. Moving Data Between Processes Costs Time

Processes do not share Python objects directly the way threads do.

That means inputs and outputs have to be:

  • serialized with pickle
  • sent to worker processes
  • sent back again

Starting processes is also heavier than starting threads.

That overhead is why process pools shine on big enough tasks, but can disappoint on tiny ones.

Process pools buy real parallelism, but you pay for it in startup cost and data movement.

Process Pool Details That Matter

Once you move from a toy benchmark to real data, two details matter a lot: batching and pickling.

Use Chunksize for Many Small Records

If you have a large list of small records, sending one record at a time to worker processes can waste a lot of time on scheduling and serialization overhead.

In that case, chunksize tells the executor to send work to each process in batches:

with ProcessPoolExecutor() as ex:
    results = list(ex.map(transform, records, chunksize=1000))

The right value depends on the size of each record and how expensive transform() is. If each item takes meaningful CPU time, a smaller chunksize can be fine. If each item is tiny, batching can make a big difference.

The goal is to avoid paying process-pool overhead for every single record.

Use a Top-Level Function

Process pools need to serialize the function and data sent to worker processes. In practice, that means the worker function should be importable and picklable.

This is good:

def transform(record):
    ...

This is fragile or broken with process pools:

transform = lambda record: ...

Top-level named functions are easier for worker processes to import. Lambdas, nested functions, and closures often fail because they cannot be pickled cleanly.

That is also why the benchmark uses:

if __name__ == "__main__":
    ...

That guard prevents child processes from accidentally re-running the whole script when they start up.

How to Recognize This Bug in Real Code

Usually the first clue is simple: adding threads does nothing.

If you increase max_workers and the runtime barely changes, that is a strong sign the slow part is not waiting on I/O. It is busy computing.

A few practical signs:

  • One CPU core is pinned near 100%.
  • The other cores are mostly quiet.
  • Profiling shows time spent in your Python transform function, not in network or disk calls.
  • More threads add overhead, but not speed.

If more threads do not help, stop asking “how do I parallelize this?” and start asking “what kind of work is this?”

Threads Are Still Useful

This is not an anti-thread post. Threads are great when the program spends most of its time waiting.

For example:

  • an API
  • a database
  • the filesystem
  • the network

In those cases, one thread can wait while another thread makes progress. The GIL is much less painful because waiting on I/O often releases it.

Same executor pattern, opposite result: threads are great for waiting, but poor for pure Python computing.

That is why this version would be misleading:

def transform(chunk):
    time.sleep(1)
    return chunk

If you benchmark that with threads, it will look great. But you did not test CPU-bound work. You tested waiting.

And threads are excellent at waiting.

The Rule of Thumb

When choosing between threads and processes, ask what your program is mostly doing.

If it is mostly waiting:

network, disk, database, API, sleep

Use threads or async I/O.

If it is mostly computing:

parsing, compression, image transforms, pure Python loops, feature extraction

Use processes, native extensions, vectorized libraries, or a distributed compute system.

The shortest version is:

Threads for waiting. Processes for working.

Or even shorter:

  • CPU-bound: usually use processes.
  • I/O-bound: often use threads.

One More Caveat

Some CPU-heavy Python libraries do not behave like the toy benchmark above.

Libraries like NumPy, Polars, PyTorch, TensorFlow, and many compression or image-processing libraries often push the expensive work into native code. That native code may release the GIL or use its own thread pools.

So the real question is not just “is my task CPU-heavy?”

It is:

Is my CPU-heavy work executing as Python bytecode, or inside native code that can run outside the GIL?

For pure Python loops, the GIL matters a lot. For native libraries, benchmark before assuming.

Python Concurrency & The GIL: A First Principles Guide

1. The Hardware Foundation: Cores and Processes

To understand code execution, imagine your computer’s CPU is a Chef in a restaurant, and your computer’s Memory (RAM) is the Counter Space and Ingredients.

A Process: When you run a Python script, the Operating System creates a “Process”. This is the chef actively cooking a recipe. The chef is given a dedicated, isolated counter space (memory) that no one else can touch. Multiprocessing: Modern CPUs have multiple cores (multiple chefs). If you use the multiprocessing library, you are creating brand new, separate processes. The Analogy: You hire 4 chefs and give them 4 completely separate counter spaces with their own ingredients. The Result: They can work at the exact same time (true parallelism). The Catch: Because they don’t share counters, your memory (RAM) usage multiplies by 4, and passing data between them is slow.

2. Enter Threads: Multithreading

To avoid eating up all your RAM with separate processes, computer scientists invented Threads. Threads live inside a single process.

The Analogy: You hire 4 chefs, but you make them all stand around the exact same counter space and share the exact same ingredients. The Result: It is extremely lightweight on memory, and sharing data (like a visited set in a web crawler) is instant. The Catch: Because they share the same space, they might try to grab the same ingredient at the exact same time (a Race Condition), which is why we must use Locks (making them take turns).

3. The Python Plot Twist: The GIL

This is where standard Python (CPython) acts differently than languages like Java or C++. Python has a built-in mechanism called the Global Interpreter Lock (GIL).

The Analogy: The GIL is a strict kitchen rule that states: No matter how many chefs are sharing this counter space, there is only ONE kitchen knife. You cannot chop ingredients unless you are holding the knife. The Result: Even if you have 8 threads running on an 8-core CPU, only one thread can execute Python code at any exact millisecond. They must pass the “lock” (the knife) back and forth.

4. The Golden Rule: When to use what in Python?

Because of the GIL, you must choose your concurrency model based on the type of task:

Scenario A: CPU-Bound Tasks (Heavy Math)

Examples: Image processing, resizing pictures, training AI, calculating primes. What happens: The chefs are constantly chopping. If you use threads, they waste time fighting over the one knife. Speed stays the same or gets worse. The Solution: Use Multiprocessing. Give each chef their own kitchen and their own knife. Scenario B: I/O-Bound Tasks (Waiting Around)

Examples: Web crawling, downloading files, database queries, API calls. What happens: The chefs are mostly just putting a cake in the oven and waiting for it to bake. When Chef 1 puts their cake in the oven, they drop the knife. Chef 2 immediately picks it up and preps their cake. The Solution: Use Multithreading. Because the threads spend 99% of their time waiting on the internet or a hard drive, the “one knife” rule doesn’t slow them down. It saves massive amounts of memory.

Final Takeaway

Adding threads does not automatically make Python CPU work run in parallel.

If your workload is pure Python computation, ThreadPoolExecutor can leave you staring at one busy core and seven disappointed expectations. ProcessPoolExecutor works better because each worker process gets its own interpreter and its own GIL.

That is the performance lesson:

  • Use threads when the program is waiting.
  • Use processes when the program is working.