Global Interpreter Lock in Python
Continuing with the theme from last week, I’d like to dive deeper into the main defining features of the Python language. Since I like to get very technical, I would love to expand upon one of its most controversial features, the Global Interpreter Lock.
In CPython the GIL is a mutex that protects access to Python objects, preventing multiple threads from executing Python bytecodes at once. The GIL prevents race conditions and ensures thread safety. In short, this mutex is necessary mainly because CPython’s internal memory management is not thread-safe.
In hindsight, the GIL is not ideal, since it prevents multithreaded CPython programs from taking full advantage of multiprocessor systems in certain situations. Luckily, many potentially blocking or long-running operations, such as I/O, image processing, and NumPy number crunching, happen outside the GIL. Therefore it is only in multithreaded programs that spend a lot of time inside the GIL, interpreting CPython bytecode, that the GIL becomes a bottleneck.
Unfortunately, since the GIL exists, other features have grown to depend on the guarantees that it enforces. This makes it hard to remove the GIL without breaking many official and unofficial Python packages and modules. shortened excerpt from python.org
I like to talk about this feature, since it is a great example of a “quick fix”†, which stuck around for decades. It also, arguably, is one of the reasons Python is as popular as it is today. Not needing to worry (or know) about memory safety makes it much easier to pick up and start using…
…but it also creates problems and gets in your way once you know what you are doing. Let’s get practical.
from concurrent.futures import ThreadPoolExecutor
# the number of concurrent workers we spawn
THREADS = 4
# normal global variable
counter = 0
def worker():
global counter
# increment the global variable 10 million times
for _ in range(10**8):
counter += 1
# start the requested number of workers and wait for them to finish
with ThreadPoolExecutor(max_workers=THREADS) as executor:
for _ in range(THREADS):
executor.submit(worker)
executor.shutdown(wait=True)
print(counter)
Here is a simple Python program, which increments a global variable from 4 different worker threads concurrently - without any explicit synchronization! If you grew up with C, you wouldn’t expect to get a correct answer from this mess, right?
Well, Python manages it just fine - but it takes a while
(venv3.13) ➜ time python gil_demo.py
400000000
python3 gil_demo.py 7.37s user 0.05s system 100% cpu 7.442 total
Let’s unpack what we see here
- We got the correct answer, so there has to be some locking going on in the background (we already know what that is)
- The program took 7.4s to run on my machine and used
100% cpu
- When the program was running, it created 4 threads and each idled at around 25% CPU usage - we are being held back and unable to use the full potential of our hardware
What do you expect will happen when we simply start incrementing a local variable inside each worker instead of a global one?
def worker():
counter = 0
for _ in range(10**8):
counter += 1
The answer might surprise you
(venv3.13) ➜ time python gil_demo.py
0
python3 gil_demo.py 5.37s user 0.04s system 99% cpu 5.426 total
Now we do not get our answer, since we no longer touch the global variable and the program took only 5.4 seconds, but that’s not 4 times faster isn’t it? That’s because the whole interpreter is being locked. This is because threads share memory regardless of whether they actually use it to communicate with each other or not.
As I mentioned last week, there is an experimental freethreaded
mode in python 3.13. You can install this interpreter in a virtual environment using uv venv --python cpython-3.13.0+freethreaded-macos-aarch64-none venv3.13t
.
As you might expect, running the same code (incrementing a local variable) will yield
(venv3.13t) ➜ time python gil_demo.py
0
python3 gil_demo.py 6.72s user 0.02s system 382% cpu 1.766 total
our expected 4x speed increase and much higher CPU usage. Now the program took only 1.7 seconds to execute, but is no longer governed by the GIL, so if you try to run the original code with the global variable, it will break completely. You could try adding a lock to the original code, but that will introduce a bunch of overhead and dominate the variable increment time, so everything will take much, much longer.
This was just a small example for demonstration purposes, but I hope it highlights how this feature works and why it is important. Python is slowly entering a transition period, but it is hard to say whether GIL-less operation will ever become the default. For now, if you need true parallelism for CPU-bound tasks, look into the multiprocesing
library.
Most of the code we write day-to-day is shielded from memory management, which is usually abstracted away or inside pre-compiled binaries. However, I have personally run into GIL-related limitations multiple times when using heavy, Python-native libraries and custom number crunching. Now, if you ever run into a Python program, which is slow “because it is just Python”, try to look into the way it handles threads and examine if any heavy number crunching could be moved into a separate process.
†I am simplifying a lot here, GIL solves multiple very difficult CS problems in Python, you can read more about the requirements here. Also, Python was originally written when multi-processor computing wasn’t a thing yet.