Fixing Java Thread & Database Connection Leaks For Fun & Profit
My day career is software engineering. Sometimes folks ask me to talk about some interesting problem or challenge I once worked on, and share how I approached it. I've done this orally, improvised, a few times. But I've learned to prefer to do things "only once" where possible, and then cache and re-use the results. And to bias to writing when I can, too. Therefore in that spirit I'm going to write down here, once and for all, my retelling of a few engineering challenges I overcame. Along with any lessons learned.
I'll cite 2 distinct issues (ie. bugs) I've encountered in the past, both in the same large legacy Java e-commerce system, though at diff times and in different parts of it. Though they involved Java code the lessons carry over perfectly to other languages (like my current main "reach for first" lang and geeky passion: Golang.) And indeed they are relevant to all software engineering roles, in my opinion.
I had been studying production logs with the goal of eliminating all our company's remaining known (and addressable!) bugs, one particular month, and, as part of that sprint, I recognized the team had grown to accept as "normal" a certain pattern of older stack traces related to threads involved in our site's purchase flow. It was in the part of our site UX where users booked tickets and it generated most of our biz income. I also determined we had a slow leak in prod, but that it didn't appear to happen in dev, or in our release candidate tests. I knew there was a chance these two distinct observations *might* be causally associated therefore I dug deeper.
The leaks were of object graphs. Imagine one object -- or a small set of structures -- sitting at the "top" of a reference graph, itself being proximally responsible for causing the entire *rest* of the memory in that interconnected graph to leak. From the names of object types in the graph it looked like it had to do with purchases and especially with the dedicated background threads which did "back office fulfillment & follow-up" tasks for them. Studying the related traces they all involved an exception thrown after an (in-house, app) Thread subclass was instantiated but *before* we start()-ed it. And the leaking graphs were all rooted on these same exact Thread subclassess. Something in common: therefore, I had made possible progress up the "chain of events" toward root causation, it seemed!
Essentially we were leaking Thread instances that were never started, and these leaked a small chunk of objects each time. Why?
To drill-in I wrote some bare minimum reproducible test cases which were work-a-likes to the actual buggy code. Ran them and studied via debugger and profilers. What happened was that in Sun Microsystem's orig impl of their own source code for java.lang.Thread *if* you instantiated a subclass but never started it... it would leak!!!
Under the hood, it would get added to a default ThreadGroup and then, in that case, was never removed from it. So the JVM's own GC algorithm treated it as "alive" and reachable and thus NOT eligible to be reclaimed. It was *counter-intuitive* -- heck, for any other object you released all refs leading to it that object *would* definitely, eventually, get reclaimed -- therefore it bit us and many other Java devs. Sun realized their mistake and, eventually, fixed this quirk in later versions. In the meantime I was charged with fixing it for our own app, and via temporary workarounds if needed.
My strategy was three-fold:
(1) I patched the entire codebase to do more "bulletproof" exception handling around the creation and starting of any app Thread subclasses -- result was the leak WAS eliminated in prod, and there was a somewhat happier percentage of endusers doing bookings, folks who were no longer being bitten by dropped tasks needed to fulfill their bookings.
(2) I documented the quirk for the team and wrote docs for best practices going forward.
(3) It also became clear it was unwise to let most devs do "ad hoc" launching of threads -- wiser to instead have thread pools of fixed size, and to put them behind interfaces that allowed an ordinary app dev to only pass in a requested task. They themselves should not be starting any threads directly, or managing their lifecycle.
That left one nagging question to follow-up with, if possible. To wit: *why* did it not reproduce in dev?
Reason: there were *certain* edge cases which *only* happened in prod. I identified what was diff about these buggy booking attempt cases, so I added testing to exercise them too.
The last takeaway was the biggest: the biz had NOT been doing load testing of the booking flow because they were afraid each test run would cause false bookings of real travel inventory (eg. real hotel room bookings) and cause higher costs and complaints downstream -- reasonable. But I pointed out the trade-off: that we would be asking for these kind of sneaky evil glitches to happen again, repeatedly, *unless* we tested booking *thoroughly.* Unless we added to our requirements the ability to *fully* test, end-to-end, our integrations with external providers of travel inventory (even if that meant adding special indicia flags or test-only credit cards etc.)
Database team came to me and said there was an old legacy quirk -- a little mystery! -- with database connections they wanted me to investigate. It was an old issue and seemed harmless & stable, originally, but over the last month had gotten worse -- it wasnt hurting prod yet but they preferred to solve it before it got worse. I agreed!
What they saw from their side was an issue with apparently unbounded growth of the number of database connections. Originating from our Java app server tier to their beefy Oracle server cluster for site-wide OLTP. As you might imagine: a subsystem critical to the biz! And "unbounded growth" suggested a resource leak (on both sides) which might eventually impair the servers.
It was observed to happen to all the Oracle servers. But only a subset of the app server JVM processess, not all. And only seen in prod -- never in the QA or staging envs, and nobody knew how to reproduce it in dev even under artificial "hacked-up" conditions. It was Heisenbug-ish!
Though all these points of evidence were themselves helpful because they were clues to shape the solution space. At this early of a stage in an investigation you appreciate *any* kind of “shrinking constraint” effect you can on the "volume" in which the culprit might be hiding.
Oh and the growth happened in batches, reliably, of once per day. Precisely. I knew that fact could help me home-in, too, on a proximate upstream cause.
But first I studied logs -- nothing jumped out, other than we did have some periodic housekeeping tasks that happened in each JVM about once a day: I took note for a closer look.
Second I felt a deep dive into our code which literally handled the database connections was warranted. Either to find (hopefully!) the "smoking gun" root cause, or, merely to rule out more -- but either way to try building a more crystal-clear mental model in my head of ALL the *relevant* parts of the problem/solution space.
We used connection pool code so that our app could re-use Java database connections across different SQL queries. For the sake of efficiency: of *not* having to redo the conn estab and auth/auth steps, as well as to benefit potentially from any caching or cursor optimizations under the hood. We used a third-party FOSS lib for it. We linked to a specific version-numbered jar of the compiled binary, but also kept in-house a copy of its corresponding source jar -- it immediately proved wise.
I studied its source.
I quickly noted it had an API which let apps call it to indicate they wished it to do "housekeeping" -- to close/release all idle connections and then once this step was complete to rebuild the pool with a freshly initiated set. Goal being as precaution to counter-act leaks which could theoretically happen on either side of the connection. A catch-all "shotgun" solution, likely, to prior problems saw in the wild.
Anyway it caught my eye. Why? Because in this pool code there were only 2 times ever a db connection was built. At pool init time (during our app init, when our JVMs procs started). Or... when this housekeeping API was called. So if anything *were* to cause them to leak it would be either one or both of these code paths -- that is it! I was now much closer to a solution to the mystery but still had a ways to go...
Next I found out our app code did call the housekeeping code. (By the way, here is where my memory of the exact event sequence, at runtime, in prod, is fuzzy, but its precise details are irrelevant for this retelling.) Anyway... I confirmed we did call this housekeeper *once* per day. And we did it with the justification of preventing leaks.
However... in dev, during testing, we normally never let JVM procs run for more than a day. I changed that quickly -- letting test specimens run for days to weeks, and even explicitly inducing calls to the housekeeper. No dice. They never leaked! Not under those conditions. However since I confirmed the once-daily association that was still a useful clue for explaining prod's behavior.
Fault lay elsewhere?
I next focused on the pool's initialization code path. Called only once per JVM process, during our app's startup.
By this point in my career as a Java programmer and also as a "threading and concurrency" geek I had lots of prior experience with Java's threading lifecycle and concurrent access gotchas and best practices. And with phenomenon like deadlocks and so-called "data race" conditions.
Cut to chase: while studying the pool's initialization code path I realized I made a discovery... I spotted a "race" in it!!!
This was a HUGE finding. Why?
Because portions of our legacy app codebase relied on the "lazy init" (anti-)pattern it meant that at runtime the exact relative order of calls may not be deterministic. In other words what might be "init event A then B" in dev might be "init event B then A" in prod. This pattern was *sometimes* unavoidable and so we had other mechanisms to compensate, when a stable predictable order was required. Also, the JVM startup phase of our app lifecycle was an unusual period of time. How so? In many configurations the identity of the thread who first called "init()" of a particular subsyatem (like logging, or our database conn pool layer) was stable. In other JVM configurations it was not. Also we had edge cases where sometimes during the app startup phase our app code allowed 2+ threads to descend into the pool's init API. For example, one call descent due to our "normal path" call to it -- which was a case of "eager" init. However, in some configurations it was possible for a non-typical thread to trigger a "lazy init" call down into the pool's init code.
Now add the race to the mix. And the pool's init API was not thread-safe! Rather, they relied on trusting the app-side caller to only call pool init from one thread, concurrently. If 2+ threads decended into it, in an overlapping way, bad things may happen.
The one bad thing which *did* happen was this: when 2+ threads raced through the pool's init meant that the pool state could go into "split brain" mode. Instead of having only 1 common pool of db conns -- which would have been bugfree and "by design" -- it would have 2 pools!
Basically two distinct collections: think of each as a set, or list. I'll refer to them as the normal pool and the zombie pool. The Normal pool held connections which would be used, repeatedly, by the app for SQL queries. The Zombie pool would never be drawn from to do queries. But every connection, in both pools, would be kept alive (via regular no-op heartbeats along them, as needed, to prevent teardown.)
Now add the pool's housekeeper API:
* IF the init "race" happened
* AND the split brain pools happened
* AND housekeeper was asked to run
* THEN a Bad Thing would happen!
The Bad Thing was that the housekeeper would create new connections and put refs to them, in effect, in *both* the Normal pool and the Zombie pool. But conns only in the Zombie pool would NOT be closed, or removed from that pool, ever. Therefore if the pool's target size was 20 connections then each time housekeeper ran it would grow the total number of connections by 20. So 20 would become 40, would become 60, and so on. Without bound. And this happened once per day.
Assume our Java app tier had about 1 JVM process per server. With a max heap size tuned to fit host's physical memory, and about 60 to 120 threads (because host had say 16 cores and many tasks were IO-bound), and a default minimum pool size of 20 db conns to satisfy their task needs. And say about 300 JVM processes total, in prod app tier. So a miminum of 6000 connections to the database tier.
Here's where it gets more complex. I'm going to wave my hands and leave out some detail but the takeaway is that our Oracle database tier expected a normal baseline of around 6000 active connections at once, ensured by these app-side pools. And expected them to get re-estab daily, to help mitigate leaks. Instead our Java app tier was itself *causing* a leak, and of about an additional 6000 connections per day! Luckily, the connections which were (only) in the Zombie pool were NOT being handed out for use by SQL queries. So they were kept mostly idle. However, they did consume resources on both endpoint servers. Left to run long enough it could starve legit connections of memory or CPU. Though both Linux and Oracle were optimized well (eg. the VM subsystem, paging, copy-on-write logic, kqueue-style techniques, etc) there would still be a marginal cost imposed (per conn) and therefore a potential risk of a crash, somewhere.
Jump to the end: though I could *never* reproduce the problem outside of prod I had "seen enough" to be confident I knew -- beyond all reasonable doubt -- all the guilty elements involved. And therefore also the fix!
The bare minimum fix (and therefore the least risky one, all other things being equal) would be for me to patch the source, in-house, of the connection pool lib to eliminate its race vulnerability. It took a change of only a few lines! Then I built a binary jar from it, as a drop-in replacement, and retested our app as if it were a release candidate. All looked good! Gave a briefing to my boss, dev team and the database and biz/product folks. We all were comfortable I did likely have the culprit identified and agreed too that my fix would be safe. Then deployed it to prod like any other release. All... was good.
The connection leak was gone! Days and weeks went by and it never came back...
It was the first time I had encountered one of the infamous Heisenbugs in my day job, in a prod system. And involving an utterly mission critical system -- our OLTP databases. And... I fixed it! Cleanly. All based on nothing more than "a theory" after having done the careful (almost pedantic) analytical work to build up a solid mental model in my head of all the relevant mechanisms.
In the end, honestly, we did take a *little* bit of a risk in deploying the fix candidate we did. However, we knew there was risk either way: the default course -- if we did nothing at all -- would have led to a crash! Or we would have had to swap out our connection pool lib entirely for something new, and all the "unknown unknowns" implied there. (ie. swapped with one which turned out even worse.)
The gamble paid off, yes, but it was based on sound logic and engineering practice, in my opinion.
I may revise this piece in the future to add my thoughts on my own longterm takeaways and the changes to our team's dev practices afterward, just as I did with Issue One.
However instead I'll end it with a small but possibly funny note...
Less than 6 months later our business merged with a major competitor, and ultimately *both* of our legacy codebases were abandoned and rewritten from scratch. Including *all* our improvements to our Java threading and database connection lifecycle. I only *hope* we started with a better foundation for the future. And that would have been mainly due to the combined engineering team's well-earned wisdom over how to manage complexity and speed-to-market. And their inevitable trade-offs both with outage risks and source code quality.
1. "It works on my box!"
2. "Uh, prod is now on fire."
3. goto 1? // TODO fixme