For $work I recently came across a fun situation involving TCP/IP sockets that I hadn’t encountered before. In summary, an upstream service would become upset should our service accept a connection, but disconnect it before servicing any request, which appeared to be happening quite frequently while rolling out upgrades.
For reasons beyond our team’s remit, it was impossible to teach the upstream service not to contact us during the upgrade, and due to the nature of the application, it was impossible to use the ancient UNIX trick of handing the old program’s file descriptors over to the new program, since the new program was almost certainly started on a different host.
The goal was to ensure that either the upstream service received a “connection
refused” error, or a connected socket that would service its commands. That
left investigating why the erroneous behavior was occurring, which almost
immediately led to a surprising discovery regarding BSD sockets: it is
impossible to shut down a TCP listener without a race occurring between a
program’s last accept(), and calling close() on the
listener socket.
As a quick refresher, TCP servers are usually implemented something like this:
s = socket() # Create the socket.
s.bind(('0.0.0.0', 80)) # Set address.
s.listen(5) # Set backlog size and enter LISTEN state.
while not stopped:
c, _ = s.accept() # Wait for and accept a new client.
start_client(c) # Start a task to handle the client.
s.close()
Behind the scenes, the operating system works asynchronously to translate between network frames and kernel-side socket state:

(Picture courtesy of LWN)
Our problem occurs because between accept() and
close(), the kernel is still busily accepting new connections on
our program’s behalf, and if our program isn’t currently blocked in
accept() and ready to receive the new connection, it places them
into a queue that userspace has almost no control over: the
socket’s backlog.
If the kernel accepts a new connection and places it in our backlog between our
last accept() and the time we call close(), it is now
in possession of an established connection that userspace knows nothing about,
while also in possession of an instruction from userspace to cease accepting
connections. In short, it has to drop the connection on the floor.
There is no standard way to prevent this race from occuring. In a sensible design, traffic would not be directed at a listening port while it is being torn down, however we don’t live in a sensible world.
After some chats on Freenode’s #posix channel, it became clear
the only solution was to firewall the port during shut down, allowing time for
the kernel to empty the backlog while preventing it from filling again. This
approach sucked, not least because it involved changing system-global firewall
state, but also since our application ran beneath a distributed job scheduler
that started our server as non-root.
Linux and BPF to the rescue
It is a rare situation where decades of undisciplined tinkering with Linux esoterica occasionally pay out, but this was such an occasion. Unlike in BSD, where Berkeley Packet Filter is implemented as a root-only device that attaches to entire network interfaces, on Linux it is implemented in terms of a socket option that usually attaches to AF_PACKET or AF_RAW sockets, however it is a little known fact you can also attach such filters to AF_INET sockets, and better yet, the ability to do so does not require root. Essentially, Linux allows non-root programs to configure their own little private firewall.
Creating a filter
BPF filters are passed to the kernel as an array of structures containing 4 integers:
| code | Bitfield containing opcode, source operand, and instruction class |
| jt | Positive jump offset if condition is true, 0
means program counter + 1, etc.
|
| jf | Positive jump offset if condition is false. |
| k | Extra value used by some instructions. |
|---|
While it’s possible to build this array by hand or using bpf_asm,
it’s far more convenient to ask tcpdump to dump the compile result
for a filter expression:
# tcpdump -d 'proto ipv6'
(000) ldh [12]
(001) jeq #0x800 jt 2 jf 4
(002) ldb [23]
...
(010) ret #262144
(011) ret #0
# tcpdump -dd 'proto ipv6'
{ 0x28, 0, 0, 0x0000000c },
{ 0x15, 0, 2, 0x00000800 },
{ 0x30, 0, 0, 0x00000017 },
...
{ 0x6, 0, 0, 0x00040000 },
{ 0x6, 0, 0, 0x00000000 },
One caveat is that unlike a firewall rule, BPF filters attached to an
AF_INETSOCK_STREAM socket do not have complete
visibility of the incoming network frame, a fact that does not appear to be
documented anywhere except the kernel source. In this case, only the TCP
headers and data (if any) are visible within the filter. While this is
sufficient for our needs, it makes it slightly more difficult to use
tcpdump as a compiler, since programs it outputs expect offset
zero of the filter buffer to point to an Ethernet frame.
This is easily worked around by referring to the TCP header via the
ether array, which always points to offset 0. For example, to
match destination port 8080 (bytes
2..4 in the TCP header), we write ether[2:2] == 8080.
Dropping SYN frames
By attaching a filter that drops incoming frames with the SYN bit
set, we can ensure the kernel will accept no new incoming connection handshakes
that will end up in the socket backlog. TCP flags are stored in byte 13 of the
TCP header, so to allow everything except SYN frames we write:
# tcpdump -dd 'ether[13] != 0x02'
{ 0x30, 0, 0, 0x0000000d },
{ 0x15, 0, 1, 0x00000002 },
{ 0x6, 0, 0, 0x00000000 },
{ 0x6, 0, 0, 0x00040000 },
Cleaning up
After installing the filter, we must briefly wait for outstanding
SYN+ACK, ACK exchanges to complete, then drain the listening
socket’s backlog by polling for it to become unreadable and calling
accept() as necessary to ensure all established connections have
been seen.
Notice that while we avoided a race with the established connection backlog, it
does not seem possible to block remaining SYN+ACK, ACK frames
using BPF, or ask the kernel how many incomplete connection handshakes exist.
We have traded one unpredictable race for another that is much easier to
manage, particularly within the confines of a datacenter where latencies are
rarely excessive.
Clients connecting to the listener while the filter is installed will retry
sending their SYN until the listener is closed, which will cause
its filter to be destroyed, and thus allow the kernel’s default behaviour of
responding with a RST since a listening socket no longer exists on
that port. The net effect is that a client receives either an established
connection we know about, or it receives a “connection refused” error, perhaps
after a short delay.
Proof of concept
client.py and server.py are a minimal example of the problem, and how the BPF filter solves it.
The server runs in a loop accepting connections until a file named
stop exists. If the file contains the word graceful,
then instead of simply calling close(), it installs a filter
before sitting in a loop until the backlog is verifiably empty.
The client runs in a loop connecting to the server as often as possible, stopping only when connections are being refused, and printing a message any time an error occurs on an established connection.
Stopping the server ungracefully, we see the client prints:
$ python client.py 1: [Errno 104] Connection reset by peer 2: [Errno 104] Connection reset by peer 4: [Errno 104] Connection reset by peer 5: [Errno 104] Connection reset by peer $
However on stopping it gracefully, we see it completes without error:
$ python client.py $
On the server side when gracefully stopping, numerous connections are handled that would otherwise have been dropped on the floor:
$ python server.py backlog! backlog! backlog! backlog! backlog! backlog! $
[Note: to avoid another race, you must use echo graceful > stop.tmp;
mv stop.tmp stop to stop server.py gracefuly.]