Published on 2025-06-10
Discussions: /r/golang.
At work, a good colleague of mine opened a PR titled: 'fix data race'. Ok, I thought, let's see. They probably forgot a mutex or an atomic. Or perhaps they returned a pointer to an object when they intended to return a copy of the object. Easy to miss.
Then I was completely puzzled.
The diff for the fix was very small, just a few lines, and contained neither mutexes, nor pointers, nor goroutines, nor any concurrency construct for that matter. How could it be?
I have managed to reproduce the race in a self-contained Go program resembling the real production code, and the diff for the fix is basically the same as the real one. It's a web server with a middleware to do rate limiting. The actual rate limiting code is omitted because it does not matter:
package main
import (
"fmt"
"net/http"
"strings"
)
func NewMiddleware(handler http.Handler, rateLimitEnabled bool) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
if strings.HasPrefix(r.URL.Path, "/admin") {
rateLimitEnabled = false
}
if rateLimitEnabled {
fmt.Printf("path=%s rate_limit_enabled=yes\n", r.URL.Path)
// Rate limiting logic here ...
} else {
fmt.Printf("path=%s rate_limit_enabled=no\n", r.URL.Path)
}
handler.ServeHTTP(w, r)
})
}
func handle(w http.ResponseWriter, r *http.Request) {
w.Write([]byte("hello!\n"))
}
func main() {
handler := http.HandlerFunc(handle)
middleware := NewMiddleware(handler, true)
http.Handle("/", middleware)
http.ListenAndServe(":3001", nil)
}
It's very typical Go code I would say. The only interesting thing going on here, is that we never do rate-limiting for the admin section of the site. That's handy if a user is abusing the site, and the admin has to go disable their account as fast as possible in the admin section.
The intent behind the rateLimitEnabled
parameter was likely to have it 'off' in development mode, and 'on' in production, based on some environment variable read in main
(also omitted here).
Can you spot the data race? Feel free to pause for a second. I glanced at code very similar to this for like, 10 minutes, to even begin to form hypotheses about a data race, while knowing from the get go there is a data race, and having the fix in front of me.
Let's observe the behavior with a few HTTP requests:
$ curl http://localhost:3001/
$ curl http://localhost:3001/admin
$ curl http://localhost:3001/
$ curl http://localhost:3001/
We see these server logs:
path=/ rate_limit_enabled=yes
path=/admin rate_limit_enabled=no
path=/ rate_limit_enabled=no
path=/ rate_limit_enabled=no
The actual output could vary from machine to machine due to the data race. This is what I have observed on my machine. Reading the Go memory model, another legal behavior in this case could be to immediately abort the program. No symptoms at all is not possible. The only question is when the race will manifest. When receiving lots of HTTP requests, it might not happen right after the first request. See the 'Conclusion and recommendations' section for more information.
The third and fourth log are definitely wrong. We would have expected:
path=/ rate_limit_enabled=yes
path=/admin rate_limit_enabled=no
path=/ rate_limit_enabled=yes
path=/ rate_limit_enabled=yes
The non-admin section of the site should be rate limited, always. But it's apparently not, starting from the second request. Trying to access the admin section disables rate limiting for everyone, until the next server restart! So this data race just became a security vulnerability as well!
In the real code at work it was actually not a security issue, since the parameter in question governed something related to metrics about rate limiting, not rate limiting itself, so the consequence was only that some metrics were wrong. Which is how the bug was initially spotted.
The diff for the fix looks like this:
diff --git a/http-race.go b/http-race.go
index deff273..6c73b7e 100644
--- a/http-race.go
+++ b/http-race.go
@@ -6,8 +6,10 @@ import (
"strings"
)
-func NewMiddleware(handler http.Handler, rateLimitEnabled bool) http.Handler {
+func NewMiddleware(handler http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+ rateLimitEnabled := true
+
if strings.HasPrefix(r.URL.Path, "/admin") {
rateLimitEnabled = false
}
@@ -29,7 +31,7 @@ func handle(w http.ResponseWriter, r *http.Request) {
func main() {
handler := http.HandlerFunc(handle)
- middleware := NewMiddleware(handler, true)
+ middleware := NewMiddleware(handler)
http.Handle("/", middleware)
http.ListenAndServe(":3001", nil)
Just transforming a function argument to a local variable. No other change. How can it be?
We can confirm that the behavior is now correct:
$ curl http://localhost:3001/
$ curl http://localhost:3001/admin
$ curl http://localhost:3001/
$ curl http://localhost:3001/
Server logs:
path=/ rate_limit_enabled=yes
path=/admin rate_limit_enabled=no
path=/ rate_limit_enabled=yes
path=/ rate_limit_enabled=yes
As expected.
Go, like many languages, has closures: functions that implicitly capture their environment as needed so that surrounding variables can be accessed within the function scope.
The NewMiddleware
function returns a closure (the middleware) that implicitly captures the rateLimitEnabled
variable. The intent in the original code was to treat the function argument, which is passed by value (as most programming languages do), as essentially an automatic local variable. When mutated, nothing is visible outside of the scope of the function, as if we just used a local variable.
But...it's neither a plain local variable nor a plain function argument: it's a variable existing outside of the closure, captured by it. So when the closure mutates it, this mutation is visible to the outside. We can confirm our hypothesis by toggling a knob to ask the compiler to log this:
$ go build -gcflags='-d closure=1' http-race.go
./http-race.go:9:26: heap closure, captured vars = [rateLimitEnabled handler]
[...]
We can plainly see that rateLimitEnabled
is being captured.
The Go compiler conceptually transforms the closure into code like this:
type ClosureEnv struct {
rateLimitEnabled *bool
handler *http.Handler
}
func rateLimitMiddleware(w http.ResponseWriter, r *http.Request, env *ClosureEnv) {
if strings.HasPrefix(r.URL.Path, "/admin") {
env.rateLimitEnabled = false
}
if env.rateLimitEnabled {
fmt.Printf("path=%s rate_limit_enabled=yes\n", r.URL.Path)
// Rate limiting logic here ...
} else {
fmt.Printf("path=%s rate_limit_enabled=no\n", r.URL.Path)
}
env.handler.ServeHTTP(w, r)
}
Note that ClosureEnv
holds a pointer to rateLimitEnabled
. If it was not a pointer, the closure could not modify the outer values. That's why closures capturing their environment lead in the general case to heap allocations so that environment variables live long enough.
Ok, it's a logic bug. Not yet a data race, right? (We could debate whether all data races are also logic bugs. But let's move on).
Well, when is this closure called? When handling every single incoming HTTP request, concurrently.
It's as if we spawned many goroutines which all called this function concurrently, and said function reads and writes an outer variable without any synchronization. So it is indeed a data race.
We can confirm with the compiler that no variable is captured by accident now that the patch is applied:
$ go build -gcflags='-d closure=1' http-race.go
./http-race.go:9:26: heap closure, captured vars = [handler]
This was quite a subtle data race which took me time to spot and understand. The go race detector did not notice it, even when throwing lots of concurrent requests at it. LLMs, when asked to analyze the code, did not spot it.
The good news is: The Go memory model gives us some guarantees for data races. Contrary to C or C++, where a data race means the wild west of undefined behavior, it only means in Go that we may read/write the wrong value, when reading/writing machine word sizes or smaller (which is the case of booleans). We even find quite a strong guarantee in the Implementation Restrictions for Programs Containing Data Races section:
each read must observe a value written by a preceding or concurrent write.
However, the official documentation also warns us that reading more than one machine word at once may have dire consequences:
This means that races on multiword data structures can lead to inconsistent values not corresponding to a single write. When the values depend on the consistency of internal (pointer, length) or (pointer, type) pairs, as can be the case for interface values, maps, slices, and strings in most Go implementations, such races can in turn lead to arbitrary memory corruption.
So how can we avoid this kind of data race from occurring? How can we adapt our code style, if tools do not spot it? Well, you may have been bitten in the past by logic bugs using Go closures, when the wrong variable is captured by accident. The recommendation in these cases is typically: do not capture, pass variables you need inside the closure as function arguments to the closure (i.e., pass explicitly by value instead of implicitly by reference). That's probably a good idea but also: it's so easy to forget to do it.
The root cause is that the environment is captured implicitly. When I was writing C++ I actually liked the lambda syntax because it started with a capture list. Every capture was explicit! It was slightly verbose but as we have seen: it serves to avoid real production bugs! For example:
int a = 1, b = 2, c = 3;
auto fn = [a, b, c]() { return a + b + c; };
int res = fn();
After writing quite a lot of code in C, Zig and Odin, which all do not have support for closures, I actually do not miss them. I even think it might have been a mistake to have them in most languages. Every single time I have to deal with code with closures, it is always harder to understand and debug than code without them. It can even lead to performance issues due to hidden memory allocations, and makes the compiler quite a bit more complex. The code that gives me the most headaches is functions returning functions returning functions... And some of these functions in the chain capture their environment implicitly... Urgh.
So here's my personal, very subjective recommendation when writing code in any language including in Go:
io_uring
, or polling functions like poll
/epoll
/kqueue
. In other words, let the caller pull data, instead of pushing data to them (by calling their callback with the new data at some undetermined future point).And here's my personal, very subjective recommendation for future programming language designers:
golangci-lint
finds the bug neither in our reproducer nor in the real production service).Here is a reproducer program with the same issue but this time the Go race detector finds the data race:
package main
import (
"fmt"
"sync"
"time"
)
func NewMiddleware(rateLimitEnabled bool) func() {
return func() {
if time.Now().Nanosecond()%2 == 0 {
rateLimitEnabled = false
}
if rateLimitEnabled {
fmt.Printf("rate_limit_enabled=yes\n")
// Rate limiting logic here ...
} else {
fmt.Printf("rate_limit_enabled=no\n")
}
}
}
func main() {
middleware := NewMiddleware(true)
count := 100
wg := sync.WaitGroup{}
wg.Add(count)
for range count {
go func() {
middleware()
wg.Done()
}()
}
wg.Wait()
}
Here's the output:
$ go run -race race_reproducer.go
rate_limit_enabled=no
==================
WARNING: DATA RACE
Read at 0x00c00001216f by goroutine 8:
main.main.NewMiddleware.func2()
/home/pg/scratch/http-race/reproducer/race_reproducer.go:15 +0x52
main.main.func1()
/home/pg/scratch/http-race/reproducer/race_reproducer.go:32 +0x33
Previous write at 0x00c00001216f by goroutine 7:
main.main.NewMiddleware.func2()
/home/pg/scratch/http-race/reproducer/race_reproducer.go:12 +0x45
main.main.func1()
/home/pg/scratch/http-race/reproducer/race_reproducer.go:32 +0x33
Goroutine 8 (running) created at:
main.main()
/home/pg/scratch/http-race/reproducer/race_reproducer.go:31 +0xe4
Goroutine 7 (running) created at:
main.main()
/home/pg/scratch/http-race/reproducer/race_reproducer.go:31 +0xe4
==================
It's not completely clear to me why the Go race detector finds the race in this program but not in the original program since each HTTP request is handled in its own goroutine, both programs should be analogous. Maybe not enough concurrent HTTP traffic?
package main
import (
"fmt"
"net/http"
"strings"
)
func NewMiddleware(handler http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
rateLimitEnabled := true
if strings.HasPrefix(r.URL.Path, "/admin") {
rateLimitEnabled = false
}
if rateLimitEnabled {
fmt.Printf("path=%s rate_limit_enabled=yes\n", r.URL.Path)
// Rate limiting logic here ...
} else {
fmt.Printf("path=%s rate_limit_enabled=no\n", r.URL.Path)
}
handler.ServeHTTP(w, r)
})
}
func handle(w http.ResponseWriter, r *http.Request) {
w.Write([]byte("hello!\n"))
}
func main() {
handler := http.HandlerFunc(handle)
middleware := NewMiddleware(handler)
http.Handle("/", middleware)
http.ListenAndServe(":3001", nil)
}
If you enjoy what you're reading, you want to support me, and can afford it: Support me. That allows me to write more cool articles!
This blog is open-source! If you find a problem, please open a Github issue. The content of this blog as well as the code snippets are under the BSD-3 License which I also usually use for all my personal projects. It's basically free for every use but you have to mention me as the original author.