This is a really quick post to document a blindingly simple technique that far too many people are unaware of, or are possibly even unjustifiably afraid of. I want to demonstrate how with a few sleepy braincells and handful of standard tools running on Debian or Ubuntu Linux, you can answer pretty much any question about a running system without resorting to Stack Overflow or similar.
The process below should have taken a few minutes but due to writing it up it will probably take me closer to half an hour. Practice makes um.. less imperfect!
Our chosen weapons:
- apt-get
- cscope
- Silver Searcher (ag) or grep
- mkdir!
- A passing understanding of C (note: expertise is optional!)
A Test Case
Tonight I arrived home still pondering why I had to write a workaround this morning, due to OpenLDAP’s slapd hanging up my client connection when I sent it more than 1000 in-flight search requests without starting to read responses.
Armed only with a syslog message, “deferring operation: pending operations”, and a mailing list post from the project’s chief architect as my guide, while I am content the workaround I implemented is fairly robust (it simply limits the number of in-flight requests), I would sleep much better at night knowing exactly why my workaround erm.. works at all in the first place. Was 1000 a magic number, or was it simply a coincidental trick of default network buffer sizes I know nothing about?
It works, but I worry because the text of the mailing list post suggests that OpenLDAP may be hanging up on us based on the _size_ of the records queued to be returned to us, that sets me on edge, since in future my code might send some unknown query pattern that breaks its simple heuristic of limiting the _count_ of records queued for return (to 500), and end users, or worse, my boss might be served HTTP 500s on the application I am being paid to build. I really don’t like that.
So without further ado, let’s jump straight in.
Praise Be Debian
Debian’s packaging tools can be a pain in the ass at times, but one lesser loved feature that we’ll need is built right in, and can be run on pretty much any Debian/Ubuntu box as a regular user without any further configuration: that feature is apt-get source, a one-step command for fetching the source code for pretty much any component you’re ever likely to come into contact with.
If you ever see the old bearded UNIX nerd in the corner of a Windows-only team beavering away in a PuTTY session to a Linux VM running on their machine, this is probably one of the reasons why.
So let’s go ahead and grab the sources for the relevant OpenLDAP daemon producing the syslog entry.
Use The Source, Luke
Now what? Well, time to try a search for that syslog string. Silver Searcher and GNU grep both support a -w option, meaning “word boundaries”. It prevents a partial match, and proves convenient any time you’re searching for plain English in a large project.
The only advantage Silver Searcher has over grep for this task is slightly prettier output by default.
Boom
Now we know where the message comes from. Let’s pop into that file with vim..
vim servers/slapd/connection.c +1711
Scrolling up a bit, we see the function we’ve landed in is named connection_input(). We don’t NEED to read the entire function, it’s enough to guess from the name that this is the central point where OpenLDAP is processing an already-decoded message.
Scrolling back closer to our original starting point (line 1711), we see a really helpful clue:
This code is pretty beautiful, this is actually a really easy case to follow. I could diverge from the point of this post and say this function is WHY short one-word variable names improve comprehension massively, but let’s not bother with that holy war..
In the excerpt we see some complex logic that is stashing away a string into the ‘defer’ variable, again we do not need to care about that logic, it’s enough to notice that the subsequent if(){} branches based on whether or not ‘defer’ has a value.
Scrolling into the guts of this block, we find the money shot:
The function is returning an error return code (-1) if the number of pending operations exceeds some maximum (”max”), and the maximum is being calculated dynamically based on.. I have no idea. So let’s find out!
To The cscope, Robin!
Armed with a couple of weird sounding variable names contributing to that max variable, we’d like to figure out what “conn->c_dn.bv_len“ means, and where this “slap_conn_max_pending“ variable gets its value.
Before jumping into cscope to verify, “conn->c_dn” sounds a lot like “the distinguished name associated with the connection”, which in LDAP speak means “the authenticated user”.
So I’m guessing the maximum number of pending operations varies depending on whether we have authenticated or not. Let’s save some time and assume this guess is accurate (I’m pretty sure it is, the code is clear enough), and focus on that “slap_conn_max_pending” variable.
cscope -Rks .
[Use down arrow key to select ‘find this global definition’]
Resulting in..
Ugh, not much juice. Exit out of vim (:q!) and search again for SLAP_CONN_MAX_PENDING_DEFAULT.
Much better! That suggests ((1<<18)-1) (or 262,143) is the maximum number of default pending operations. That’s way higher than what I’m seeing (hangups after 1000 or so ops), so let’s keep digging to see what else changes that variable.
This time instead of “find this global definition”, use “find this C symbol” on “slap_conn_max_pending”.
Aha! It comes from the config file! Or in modern OpenLDAP, the ‘config schema’. Our LDAP server is too old to be using config schema, but now we have enough information to figure out what is going on.
Lightbulbs
A quick google for “slapd.conf” brings us to the slapd.conf man page, and a quick find-in-page for the word “pending” leads us to this:
Boom! My workaround will indeed function as desired, since it keeps the pending operation count at or below 500, or as we now know, half the maximum as found in the old-style OpenLDAP configuration file.
Based on knowing my app received hangups after 1038 pending ops, I now know our OpenLDAP config is very unlikely to be tweaking the setting from the default – the discrepancy can easily be attributed to network latency.
Next Time!
Next time we might not be so lucky. It turned out in this case, a diligent reading of the docs would have turned up a relevant setting, but that depends entirely on those docs existing, and our ability to discover them. It is a very common occurrence that no such docs exist, or if they do, they are buried in source code comments far away out of sight.
Here we traded an unpredictable session of head-scratching and Googling an answer that might not exist, for a very predictable couple of minutes spent in a terminal answering a question for ourselves – my, such power! And look, we didn’t even really have to understand C!
Admittedly this was an accidentally easier than expected example to pick, regardless you should try to practice this method wherever possible for answering your questions before resorting to Google or Stack Overflow, and see how deep the rabbit hole goes.. :)