In this series of posts I'd like to talk about some of the real-world, at-scale problems which myself and others have encountered during our time building and running Internet infrastructures. My main motivation for starting this series is that, particuarly with the current containerization explosion, I find myself reading a lot about trivial introductions to relatively narrow and shallow technologies and less about what it's really like to run a reasonably-sized Internet infrastructure. Hopefully I can encourage others to share their deeper tales too.
I've always been a huge fan of the Cricket book and suggest you go away and read all of it, if you do anything at all related to producing services on the Internet. Fair warning: you'll never be able bear hearing the mis-use of "propagation" again.
That said, I can understand that you might not have the time. For the purposes of this tale, all you need to know is that when your client device or machine wants to know the IP address of some endpoint, it issues a DNS request for it to its configured DNS server. Through a relay of sorts, what's returned is one or more records containing either suitable IP addresses for you to make your connection to, or one or more "pointers" in the form of
CNAME records that tell you to go look up another name and repeat the whole process again (it is reasonably analogous to following HTTP
30x redirects around).
Whilst I say "DNS server", there are (again, for the purposes of this tale) actually three modes in which a DNS server can function: recursive, caching and authoritative.
Recursive nameservers are the servers whose job it is to go out there on behalf of your machine and talk to whomever else is necessary to satisfy your query. In an idealized world, with no prior lookup of a given name, it'll first go to the root servers to determine somebody to ask about, say, the
.com domain, and then ask one of those servers where to go to enquire about the
facebook.com domain, and so on all the way to
www.facebook.com. As you can see, authority can be divided and delegated at subdomain boundaries.
A records are collectively called Resource Records, or RRs. There are many other types of RRs other than these two, although they are the most common. Like I said, go read the Cricket book.
All of this aside, what our client ultimately recieves is one or more
A records containing an IP address. We obviously only want to connect to one of these IPs. What to do?
RFC3484 defines what should be happening with most modern resolvers created in the past decade. What this says, in a nutshell, is that authoritative DNS servers are free to order the RRs in their response as they see fit. It is very common for authoritative namesers to round-robin their responses in the hope of distributing load across many endpoints. Some may also attempt to infer "locality" (be it physical or latency-based) based on the source of the query.
As for the clients, they are also free to apply a number of optimizations when choosing the IP to select. On a modern Linux, take a look in
/etc/gai.conf, the configuration for the
getaddrinfo(3) syscall your application will ultimately be using. Ultimately, the sort implementation can be found in the
glibc(7) sources under
/resolv. In the most common case of public Internet IPv4 addresses, you'll likely get back an RR set as ordered by the authoritative server, take the "top" one and move on. The next time you have occasion to do resolution again, you're likely to get a different ordering, a different "top" IP and to therefore spread your load over the endpoints as intended.
We shouldn't forget caching DNS servers. Unless you are in complete control of your resolution chain, it is very likely that your ISP is operating a caching DNS server to reduce traffic and decrease response times for oft-queried RRs: think
www.facebook.com and the
MX record for
gmail.com. Caching DNS servers are required to respect the TTL of the record, as issued by the authoritative server, and to expunge the record (forcing a look-up) when that TTL expires.
For the purposes of argument, the infrastructure from which this tale comes had 100 public-facing HAProxy instances and each of those instances could accept 1000 concurrent connections, giving 100,000 possible concurrent connections (in this particular instance the concurrency numbers were far higher, but these are nice numbers to illustrate a point).
Looking at the connection numbers for each individual load balancer, we discovered something very surprising: those that sorted significantly numerically, i.e. the first and last IPs, took a statistically significantly greater percentage of the traffic than those in the middle of the set. This isn't the way things are meant to happen.
A pretty obvious explanation for this is that somewhere in the DNS resolution chain some deviancy from the RFCs is occurring. I'd love to know if anybody has any evidence as to where, as there can't be too many reimplentations of
libresolv out there in the wild. The most likely suspects are cheap embedded devices such as routers and Internet gateways, although Microsoft Windows has a fairly well-known misbehaviour too.
After getting this far, "so what?" you might ask.
In the first instance, it is obvious that the idealized homogenous tier, certainly for the load-balancer tier, is a fallacy. All load balancers, certainly in this fable, are not created equally. More correctly, despite being functionally equal, they do not participate in the system equally due to unintended external effects.
Performing an act such as a rolling shutdown, upgrade and restart of HAProxy in this example will cause disconnections for more of your clients in 2 out of 100 cases. This in turn means more reconnections, re-authorizations, etc. It is easy to conceieve of a seemingly well-tested procedure actually resulting in an unintended thundering herd problem.
It is also worth considering that something as core and fundamental as DNS resolution still only functions thanks to Postel's law, and that the assumption of correct behaviour of all actors in a system has the possibility to lead to incorrect conclusions. Incidentally I think that answer to this is to build telemetry-first infrastructures, but that is another post for another time.
Finally I think it is worth being mindful that any blog post or article on Internet scale, particuarly those that are entirely dogmatic, is likely avoiding nasty implementation details like these in the name of either clarity, ignorance, or a willingness to present a suitably "clean" and satisfactory view of the world: nobody likes if-ladders after all.