Matteo Croce personal homepage

Low Quality of Service: How My Router Broke Git

QoS stands for Quality of Service. Improvement is not implied.

Symptoms

I’ve been a happy OpenWrt user for 20 years, but due to a hardware failure I had to temporarily switch back to my ISP router.
After the swap, every git pull started hanging. No errors, just a timeout after a few minutes.

The Investigation

The usual suspects were checked first:

I didn’t find any issues with general connectivity, so I focused on Git and noticed that some repositories were working while others weren’t. Initially I thought that it could be a server-side issue, but the failing repositories were spread between GitHub and GitLab, so it was unlikely that both services had simultaneous issues.

On closer inspection, I noticed a pattern: all the failing repositories were ones I contributed to, and they were using SSH URLs. The same repositories worked fine over HTTPS.

I tried to connect to GitHub with ssh and I got the usual message about no shell access:

$ ssh git@github.com
PTY allocation request failed on channel 0
Hi teknoraver! You've successfully authenticated, but GitHub does not provide shell access.
Connection to github.com closed.

So the GitHub server was reachable and the SSH connection working. I tried connecting to other servers via SSH; getting a shell worked fine, so this wasn’t a general SSH issue.
But as soon as I tried to run a remote command or transfer a file with scp, I got the same hanging behavior.

To recap:

Packet Capture Analysis

With no obvious explanation for why SSH login worked but remote command execution didn’t, I captured the traffic with Wireshark from both an interactive SSH session and a non-interactive one. At first sight the packet captures looked the same: the TCP handshake completed, key exchange succeeded, and the SSH session was established in both cases. The only difference was that the non-interactive session stopped working, while the interactive one continued to work normally. I focused on the differences between the two sessions, and that’s when I noticed something interesting: in the non-interactive session the packets from client to server changed the DSCP value in the IP header, while in the interactive session the DSCP value remained constant.

I started ssh with verbose logging to see if this was an intended behavior of OpenSSH, and indeed it was: OpenSSH sets different QoS markings for interactive vs non-interactive sessions. The relevant lines from the -vvv output were:

$ ssh -vvv user@host |& grep IP_TOS
debug3: set_sock_tos: set socket 3 IP_TOS 0x48
debug3: set_sock_tos: set socket 3 IP_TOS 0x48

$ ssh -vvv user@host id |& grep IP_TOS
debug3: set_sock_tos: set socket 3 IP_TOS 0x48
debug3: set_sock_tos: set socket 3 IP_TOS 0x20

Decoding the values (DSCP is the upper 6 bits of the ToS byte):

OpenSSH sets the TOS byte to 0x48 (AF21) for interactive sessions, and 0x20 (CS1) for non-interactive sessions. Fair enough, interactive sessions are latency-sensitive, while non-interactive sessions are throughput-oriented. But could it be the root cause of the problem? I had no reason to suspect it at first, but I wanted to check, so I forced OpenSSH to use a specific QoS marking for all sessions by adding -o IPQoS=af21 to the command line. Surprisingly, the non-interactive session worked perfectly with a consistent TOS value of 0x48.

# Default, no IPQoS override
$ ssh -vvv user@host uptime 2>&1 | grep IP_TOS
debug3: set_sock_tos: set socket 3 IP_TOS 0x48
debug3: set_sock_tos: set socket 3 IP_TOS 0x20   # hang

# With IPQoS=af21
$ ssh -vvv -o IPQoS=af21 user@host uptime 2>&1 | grep IP_TOS
debug3: set_sock_tos: set socket 3 IP_TOS 0x48
debug3: set_sock_tos: set socket 3 IP_TOS 0x48   # works

# With IPQoS=cs1
$ ssh -vvv -o IPQoS=cs1 user@host uptime 2>&1 | grep IP_TOS
debug3: set_sock_tos: set socket 3 IP_TOS 0x20
debug3: set_sock_tos: set socket 3 IP_TOS 0x20   # works

Two setsockopt(IP_TOS) calls are made by OpenSSH: the first during socket setup, the second after the SSH channel type has been negotiated. With no override, the TOS byte changes between those two calls for non-interactive sessions, which is exactly when the session dies.

Root Cause (my guess): Broken Flow Hashing in the Router

The new router, an ISP-supplied Vodafone Station HHG2500, was performing some form of traffic acceleration, likely offloading established TCP flows to hardware fast-path (common in ISP-provided and consumer gear). The flow classifier was hashing the 5-tuple plus the DSCP/ToS byte to identify flows.

When the DSCP value changed mid-connection (perfectly legal, DSCP is a per-hop hint, and nothing prevents it from changing mid-connection), the router’s flow table lookup failed to match the existing entry. The router’s fast-path dropped the packets (or failed to forward them), while the slow-path either never picked them up or silently discarded them. The result: a one-way blackhole after the SSH channel negotiation.

DSCP is not a flow identifier. The canonical 5-tuple for flow tracking is {src IP, dst IP, src port, dst port, protocol}. Including the ToS/DSCP byte in the hash breaks the assumption that a connection remains the same connection if ToS changes, which it legitimately can.

A correct hardware offload implementation would either exclude DSCP from the flow hash entirely, or re-classify the flow on DSCP change rather than dropping it.

The Fix

The simplest workaround is to override OpenSSH’s default QoS behavior and force a single, consistent DSCP marking for all session types. I added the following to /etc/ssh/ssh_config.d/99-dscp.conf:

IPQoS af21

af21 (DSCP decimal 18) corresponds to Assured Forwarding class 2, low drop precedence. It’s the value OpenSSH already uses for interactive sessions, so this effectively forces non-interactive sessions to use the same marking, preventing any mid-connection DSCP transition. Not the ideal solution, but it’s better than forcing all sessions to use cs1 (DSCP decimal 8), which is a lower priority class and might lead to worse performance for interactive sessions.

Conclusion

The previous router (OpenWrt-based) used a straightforward kernel nftables firewall, and nf flowtable as traffic accelerator which tracks flows on the standard 5-tuple. It doesn’t consider the DSCP bits when classifying flows, so the DSCP changes are invisible to it, as they should be.

Twenty years on OpenWrt-based hardware, zero issues of this class. That says a lot about the gap between open-source networking and whatever ships in commodity router firmware.

#Ssh #Networking #Linux #Ip #Debugging #Qos