Implementing SQM for VPN endpoints
In riseup's deployment of the leap vpn, we have been seeing some load issues. These are documented on the riseup side at https://0xacab.org/tubers/tcp/issues/12272 and leap side at leap/bitmask-vpn#80
The big issue with most endpoints currently is that bandwidth is being limited due to the fact that openvpn runs on one CPU core and we're CPU limited. Depending on CPU speed/AES-NI support, we're limit to between 100mbit-200mbit. You can see the plateaus on the graphs, for example gaei(montreal) at ~110Mbit:
More at https://we.riseup.net/riseup+tech/vpn-graphs-part-2#bandwidth
This saturation is causing packet loss and variable/high latency. But we don't currently have a good way to measure how much. Something like smokeping through a vpn client might be a good way to measure.
One thing that we noticed is the number of connections being tracked by ip_conntrack is really high, probably indicating P2P traffic like torrents.
https://we.riseup.net/riseup+tech/vpn-conntrack
This problem of p2p/large users saturating a connection and ruining things for everyone is common on home/office ISP gateways and has been solved with something called "Smart Queue Management"
https://www.bufferbloat.net/projects/cerowrt/wiki/Smart_Queue_Management/
When a network link becomes saturated, SQM ensures that all all users/protocols/IPs get a fair share of the network so that one user/protocol/IP can't starve the others of bandwidth or increase their latency.
SQM is a combination of a few things: better scheduling when things are congested, timeouts for old packets in the queue to manage queue length/latency, shaping/rate limiting, and prioritization. However, it can only control the packets that are coming through the system, if the problem is upstream it can't "push on the rope". But what can be done is to deliberately ensure that the point we control is the bottleneck. So if our connection to the internet is allocated a particular amount of, we can deliberately rate limit our connection to just below that level, ensuring any congestion happens in our own queues where we have more control, rather than in upstream routers that might introduce delays or their own priorities, etc.
Current best practice for SQM is to use CAKE
https://www.bufferbloat.net/projects/codel/wiki/Cake/
SQM has been nicely implemented in openwrt, it's easy to turn on in home routers and make your ISP connection resilient against abuse. The scripts that do this are
https://github.com/tohojo/sqm-scripts
and there is a "luci-app-sqm" openwrt package that adds functionality to the web UI. But in addition to supporting openwrt, it supports a generic linux target. It's not packaged in debian, but only consists of a few shell scripts, config files, and automation to automatically bring things up and down. Integrating into our setup would be pretty easy.
On my home router I use SQM, with CAKE and "piece_of_cake.qos" and bandwidth limits of 120mbit down and 5.9mbit up set in the web UI (my cable modem service is 125mbit down, 6mbit up). This results in an sqm config file of
root@router:~# cat /etc/config/sqm
config queue 'eth1'
option interface 'eth1'
option qdisc_advanced '0'
option linklayer 'none'
option download '120000'
option upload '5900'
option debug_logging '0'
option verbosity '2'
option enabled '1'
option qdisc 'cake'
option script 'piece_of_cake.qos'
and tc rules of:
root@router:~# tc -d qdisc
qdisc noqueue 0: dev lo root refcnt 2
qdisc fq_codel 0: dev eth0 root refcnt 2 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 4Mb ecn
qdisc cake 8019: dev eth1 root refcnt 2 bandwidth 5900Kbit besteffort triple-isolate nonat nowash no-ack-filter split-gso rtt 100.0ms raw overhead 0
qdisc ingress ffff: dev eth1 parent ffff:fff1 ----------------
qdisc noqueue 0: dev br-lan root refcnt 2
qdisc cake 801a: dev ifb4eth1 root refcnt 2 bandwidth 120Mbit besteffort triple-isolate nonat wash no-ack-filter split-gso rtt 100.0ms raw overhead 0
breaking this down:
- eth1 is my outbound "wan" device
- eth0 is my internal "lan" network
- ifb4eth1 is a pseudodevice it creates to sit between the lan and wan. This allows it to control the flows to each. Also this allows it to see the outbound traffic before it is NAT'd and forwarded out eth1 to the internet. tc is able to deal with NAT, but I think they have chosen to implement it this way to be more clear (and thus "nonat" is set) (and also so they can control outbound bandwidth).
- "triple-isolate" is: "Flows are defined by the 5-tuple, and fairness is applied over source and destination addresses intelligently (ie. not merely by host-pairs), and also over individual flows." The 5-tuple is: src IP, dest IP, src port, dest port, protocol number. So each unique combination of those things will be considered a separate flow, and fairness is applied in several ways to prevent and one thing from using more than it's fair share..
- The eth1 (wan) interface is limited to transmitting 5900KBit because that's what we set for outbound in the config.
- The ifb4eth1 interface is limited to transmitting 120Mbit (aka the rate that lan can transmit to wan, outbound traffic) because that's what we set
- "wash" is to script any existing diffserv marking from packets that might have been applied upstream, before we add our own.
We could choose to setup tc rules ourselves, but the sqm-script package makes things much easier with it's nice automation of safely bringing things up and down, probably that will be the way to go.
We'll need to adjust the rates based on plateaus we're seeing for each VM type and measure/adjust accordingly. betterspeedtest.sh from https://github.com/richb-hanover/OpenWrtScripts is a good way to test, probably we'll need a way to run that from a VM in the same data center as each node to ensure we're testing the node (and not all hops inbeween).