Why can’t I reliably send or recieve data over TCP after moving servers?

Question 1

I am currently dealing with a networking problem on high latency (100-400ms) Internet links. I run a Minecraft network, and I recently moved it to a separate datacenter to get a server with a better CPU and more ram. The users of this server are spread all over the world. Before the switch, the server was in Montreal and the users in Europe got ~100-200ms latency, and those in Australia got ~200-300ms latency. After the switch, the server is in Germany, the users in North America get ~100-200ms latency, and the users in South America and Australia get 200-400ms latency. Overall, the latency is quite similar, but who gets great latency and who gets tolerable latency varies (note that Minecraft isn’t very latency sensitive in general, especially compared to most video games). There also isn’t any significant packet loss, as measured by the MTR and ping tools. Furthermore, the software on both servers is almost identical. Both servers run Debian 10, and I tar’d all software + configuration that’s not in APT repositories to send that over, while reinstalling the exact same packages via apt. As such, the software configuration should be essentially identical.

Yet, many users are having connection trouble. It seems to only occur around 6:00 PM (± a few hours) in the eastern United States. The connection trouble specifically takes the nature of the throughput of all TCP connections being atrociously low. With an SSH+SOCKS proxy, it took minutes to load a normal web page (Gmail), and in-game, it often takes minutes for even a simple chat message to get through if a few MBs of world data are currently being transferred. The effective latency of a TCP connection (e.g. the time it takes a chat message to get through it) increases unreasonably massively when any data is put over that TCP connection. A normal SSH session with just a terminal is basically fine, and the game is mostly fine if not much is going on, but as soon as anything of significant size gets sent over TCP and it is during the aforementioned time, then the throughput breaks down and latency over TCP (but not via ping) becomes unreasonable, even to multiple minutes in the worst cases. When this issue first took place, there was significant packet loss (~25%) which I thought was to blame, but, that packet loss is no longer occurring (according to ping, etc.) yet the issue remains. The packet loss, but not the problematic symptoms, disappeared after I made a general report to the new host about packet loss, but before I could give them more detailed data with MTR as they requested in their response to that report. My impression from the host is that they did not change anything, but, who knows really.

As such, at this point, I suspect that the relevant difference between the servers is that the old host (OVH) does some sort of tuning of their OS images (something that I know to be the case), and that the new host (Hetnzer) does not.

I suspect that this tuning has something to do with TCP window size, but when I attempted to manipulate those settings to make changes, the settings didn’t seem to do what they’re supposed to. Specifically, when I set the various net.ipv4.mem or net.core.mem settings that I find listed on the Internet via sysctl, the window size that iperf selects (or the maximum that it is allowed to select, when using the -w option) seems to take a random value with no discernible-to-me relation to the values that I set via sysctl, as opposed to behaving in the way that I expect, where it’s maximum value is simply whatever I set via sysctl. Note that iperf -s misbehaves even before a client connects to it, so failing to make the same changes on the client is not a plausible explanation.

As such, I’m wondering 2 things:

1) How can I fix my server, and allow the latency in TCP connections to be similar to the actual latency on the link, even at peak times and under moderate load (a few mbps)?

2) How can I reliably and predictably change the TCP window size of all applications? (or, equivalently, what’s going on with the sysctl settings applying in seemingly random ways? / what is the pattern that I’m missing?)

Question 2

The Internet is constantly changing, has no service guarantee, and crosses many third party transit networks. You can’t solve problems by just looking at your server’s IP stack, you need to finish root cause investigation.

Survey what class of connection is available to your users. More bandwidth sometimes helps when that is a constraint. Also measure latency from them to other Internet destinations. Extremely high latency to a Google POP might be suspect, given Google’s obsession about speed.

Keep collecting mtr from client to server and server to client. Test across multiple ISPs representative of your user base. Look for loss and very high latency. Identify which ISP owns problem hops.

Do packet captures on the application’s traffic. Again, both client to server and server to client. Analyze TCP with Wireshark, look for problems. I don’t think there’s a complete Minecraft dissector, but that is not strictly needed for TCP/IP performance.

Launch more servers on different providers in different parts of the world. Test the performance characteristics of those, see if is more server side or client side. Consider multiple servers (or proxies?) in different regions of the world permanently, if that is what it takes for acceptable performance.

Why can’t I reliably send or recieve data over TCP after moving servers?

Answer

Leave a Comment Cancel reply