diff --git a/proposals/005-aggregate-tunnel-metrics.md b/proposals/005-aggregate-tunnel-metrics.md new file mode 100644 index 0000000000000000000000000000000000000000..8a74ab15472578d70d37adf86186107669461083 --- /dev/null +++ b/proposals/005-aggregate-tunnel-metrics.md @@ -0,0 +1,749 @@ +# Aggregate Tunnel Metrics + +* Author: bassosimone +* Reviewers: cyberta +* Status: under discussion + +This specification describes how to format and submit aggregate +tunnel metrics to the OONI collector. Additionally, we include +sections with design and implementation considerations. + +## Index + +1. [Overview](#overview) +2. [Definitions](#definitions) +3. [Use Cases](#use-cases) +4. [Goals](#goals) +5. [Non-Goals](#non-goals) +6. [Related Work](#related-work) +7. [Threat Model](#threat-model) +8. [Tunnel Lifecycle](#tunnel-lifecycle) +9. [Data Format (Envelope)](#data-format-envelope) +10. [Data Format (Test Keys)](#data-format-test-keys) +11. [Implementation Requirements](#implementation-requirements) +12. [Privacy Considerations](#privacy-considerations) +13. [Security Considerations](#security-considerations) +14. [LEAP Implementation Details](#leap-implementation-details) + + +## Overview + +The objective of this specification is to allow a VPN provider +to report aggregate tunnel metrics to the OONI collector. We submit +to the OONI collector for archival reasons. The submitted metrics +range from availability to tunnel performance metrics. + +The aim is to allow researchers to evaluate the viability of +using a given protocol in a specific country and ASN. This +information will be provided with varying levels of granularity +depending on the VPN provider's needs. + + +## Definitions + +**AS**: autonomous system. + +**ASN**: autonomous system number. + +**CC**: country code. + +**bootstrap**: the process with which the VPN fetches all the +required information to create tunnels. + +**tunnel**: communication between two network endpoints +encapsulating TCP/IP packets. + +**protocol**: the protocol stack used by the tunnel to encapsulate +packets that the adversary can observe. + +**(tunnel) endpoint**: either a bridge or a gateway. + +**measurement**: JSON document submitted to the OONI collector. + + +## Use Cases + +In all use cases, a VPN provider wants to make *statements* about +the availability and performance of accessing *something* they manage +from a set of clients located in a given ASN and CC in a given time +period. What sets use cases apart is the *something* about which +they are making the statement. + +More specifically, we cover these use cases: + +**endpoint**: the statement is about a specific tunnel endpoint. For +example, in the case of RiseupVPN, an endpoint could be: + +- hostname: `vpn02-par.riseup.net` +- address: `51.159.197.108` +- port: `443` +- protocol: `openvpn+obfs4` +- asn: `12876` +- cc: `FR` + +which is as detailed and transparent as possible. + +**endpoint_pool**: the statement is about a homogeneous pool of +endpoints. RiseupVPN publishes its endpoints, so this use case does +not make sense for them. However, another VPN provider may want +to disclose less information than in the `endpoint` use case and +say something like: + +- protocol: `openvpn+obfs4` +- cc: `FR` + +which does not give away the exact endpoints, but still allows +to make statements regarding the availability and performance of +using `openvpn+obfs4` endpoints located in France. + +**global**: the statement is about the whole set of endpoints +and only includes information about the protocol: + +- protocol: `openvpn+obfs4` + +Each statement is a single *measurement* submitted to the OONI +collector for archival reasons. The `scope` field inside the +measurement `test_keys` identifies the specific use case. + +In summary, all use cases allow a researcher to evaluate the +availability and performance of tunnels using specific protocols, +but the granularity of the information disclosed varies. + + +## Goals + +1. *Extensible spec*: we aim to specify a baseline spec that is +simple to implement for LEAP *right now*, while, at the same time, +allowing to extend it to submit more information later on. + +2. *Anonymity set*: the spec allows incrementally including more +metrics as the user population grows. + +3. *~Standards compliance*: where possible, reuse concepts and +terminology from existing standards and related work done in +this space, including Network Error Logging (NEL), Ain Ghazal's +"tunnel telemetry", DPIDetector, and the OONI specifications. + + +## Non-Goals + +1. Specifying how a VPN client should submit tunnel telemetry +data to the VPN provider itself. This scope has already been explored +by Ain Ghazal under the codename of "tunnel telemetry". We aim +to leverage the work done by Ain Ghazal with respect to NEL +error reporting. Yet, we believe that the mechanism with which +a VPN client reports to its own infrastructure should +directly use NEL without being constrained by the specific +details of the OONI measurement format. + +2. Specifying how a VPN client should directly submit individual tunnel +establishment measurements to the OONI collector. Adding this +requirement would significantly complicate the design and the job +of the data analyst, because the same format would need to accommodate +for both aggregate and individual measurements. That said, scoping +this use case seems a matter to continue working on the Ain +Ghazal proposal on "tunnel telemetry". + +3. Measuring the VPN bootstrap phase. This phase is not +tunnel-specific, rather it is VPN-specific. We believe we +should not conflate this information into the same +measurements. We could write a bootstrap-related spec. Also, +the aggregation and reporting requirements required by the +bootstrap may differ from those for tunnels. + + +## Related Work + +1. [OONI spec](https://github.com/ooni/spec); + +2. [Tunnel Telemetry repository](https://github.com/ainghazal/tunnel-telemetry); + +3. [Tunnel Telemetry proposal](https://github.com/ooni/spec/pull/274); + +4. [DPIDetector](https://dpidetector.org/); + +5. [Network Error Logging](https://www.w3.org/TR/network-error-logging/). + + +## Threat Model + +A tunnel comes to life with the establishment of a connection +between a client and an endpoint. The client establishes the tunnel +initiating the communication. + +While the adversary will only be able to observe the most external +protocol layer (e.g., `obfs4+kcp`), we assume it can also use +other identification techniques, including conversation analysis, +statistical analysis on packet sizes, and active probing. + +Additionally, the adversary can interfere with the communication +after it has been established. For example, it can mess with the +TCP state by injecting RST segments, or it can throttle the available +bandwidth down to ~zero, explicitly or through routing. + +For these reasons: + +1. we include the full protocol stack into the definition of the +protocol (i.e., `openvpn+obfs4+kcp` rather than `obfs4+kcp`), +to reflect the fact that the packet dynamics differ depending on +the inner protocols of the protocol stack; + +2. we include the possibility of reporting performance and +latency metrics, to allow provider to make statements about +the overall quality of an established tunnel. + +Additionally, as explained in the next section, this specification +allows VPN providers to selectively choose how much information +to disclose about their VPN architecture, thus ensuring they are +in control of how much information to expose to the adversary. + +In the interest of facilitating research, VPN providers MAY choose to +publicly disclose more information about *some* endpoints, while +being more secretive about other endpoints. + + +## Tunnels Lifecycle + +We model the tunnel lifecycle using this state machine: + +``` + .--> <<active_measurement>> +.---------. .-------------. / / +| Initial | --> <<creation>> --> | Established |<-----------' +`---------' `-------------' + | + V + .-------------. + | Final | + `-------------' + +``` + +In the `Initial` state the tunnel has not been created yet. The +`creation` operation creates the tunnel and transitions the state +to `Established`. While in `Established`, the VPN app may run +active measurements to assess the tunnel performance. For example, +the app may `ping` the the VPN gateway or well-known addresses, +or it could run network-performance tests, such as NDT. The +tunnel enters into a `Final` state when it is closed. + +We define the following *operations*: + +**creation**: this operation creates the tunnel. It SHOULD NOT include +DNS lookup or other operations required to fetch the IP address and the +required certificates. For example, if we are using `obfs4`, the +`create` operation SHOULD be about performing the TCP three-way +handshake, the OBFS4 handshake, and the OpenVPN handshake. + +**tunnel_ping**: this operations is the active measurement using a +`ping`-like tool to ping well-known addresses. + +**tunnel_ndt_download**: this operation is the active measurement using a +NDT to measure the download speed over the tunnel. + +**tunnel_ndt_upload**: like `active_ndt_download`, but for the upload. + +Note that both `tunnel_ping` and `tunnel_ndt_{down,up}load` are optional and +MAY possibly occur multiple times during the tunnel lifecycle. + +For each `tunnel_xx` operation, we also define the equivalent +`baseline_xx` operation optionally performed before creating the +tunnel to establish a baseline. + +This specification allows a VPN provider to submit aggregate +reports about the `creation`, `tunnel_ping`, etc., +operations. A future version of this specification may extend +the set of operations to include more active measurements, +or to include additional information about tunnels, such as +the aggregate average duration of tunnels, the number of bytes +transmitted and received, etc. + +We use the `phase` keyword (borrowed from NEL terminology) to indicate +the *operation* associated with specific statements. + + +## Data Format (Envelope) + +The metaphor used by OONI measurements is that there is +a `test_name` describing a specific network testing methodology +where we make a statement about a specific resource (the +`input` field). This happens in the context of a given ASN +and CC. Each specific experiment type (i.e., `test_name`) has +its own specific data format, which is described by the +experiment-specific `test_keys` field. + +Accordingly, we define the `aggregate_tunnel_metrics` experiment +name and sketch out the overall envelope as follows: + +```JavaScript +{ + "annotations": { + "upstream_collector": "riseup-par-01" + }, + "data_format_version": "0.2.0", + "input": "openvpn+obfs4+kcp://riseup.net/", + "measurement_start_time": "2024-10-29 00:00:00", + "probe_asn": "AS1234", + "probe_cc": "IT", + "test_keys": { /* ... */ }, + "test_name": "aggregate_tunnel_metrics", + "test_runtime": 0.0, + "test_start_time": "2024-10-29 00:00:00", + "test_version": "0.1.0" +} +``` + +Here is the justification for setting the fields as such: + +- `upstream_collector` (`string`): the name of the upstream +collector that collected and aggregated metrics before +submitting to the OONI collector. This is useful to know +which entity collected the data and submitted the aggregate. + +- `data_format_version` (`string`): the version of the data +format used by OONI, which must be exactly equal to `0.2.0`. + +- `input` (`URL`): the input URL format is consistent with +the OONI `openvpn` experiment and is discussed in more detail below. + +- `measurement_start_time` (`Date`): the UTC (without +explicit indication!) moment in which this aggregate was produced. + +- `probe_asn` (`^AS[0-9]+$`): the ASN of the set of +probes that this measurement is about. + +- `probe_cc` (`^[A-Z]{2}$`): the country code of the set +of probes that this measurement is about. + +- `test_name` (`string`): must be `aggregate_tunnel_metrics`. + +- `test_runtime` (`float64`): runtime of this test in +seconds, which is set to zero because there is no +real runtime here. + +- `test_start_time` is set equal to `measurement_start_time`, +since there is not really a test here, this is just an aggregate. + +- `test_version` (`^[0-9]+.[0-9]+.[0-9]+`): the version of the +test, which will evolve as we evolve this specification. + +Regarding the `input`, its main purpose is to allow searching +for measurements through the OONI API. Whether the aggregate +tunnel metrics will be exposed by the OONI API is an orthogonal +topic, which requires coordination with the OONI team. + +The `input` format is the same as the `openvpn` experiment, +with some additions that are specific to this spec: + +``` +{protocol}://{provider}/?{query_string} +``` + +More specifically: + +- `{protocol}` (`string`): the VPN protocol stack being used, +therefore, `openvpn`, `openvpn+obfs4`, etc. + +- `{provider}` (`string`): the entity that manages the endpoints, for +example `riseup.net`. + +- the `{query_string}` contains the following parameters (we use the +`<type>|undefined` syntax to mark optional fields): + + - `address` (`string|undefined`): the endpoint IPv4/IPv6 address; + + - `asn` (`string|undefined`): the endpoint ASN; + + - `cc` (`string|undefined`): the endpoint country code; + + - `hostname` (`string|undefined`): the endpoint hostname; + + - `port` (`string|undefined`): the endpoint port; + +For example: + +``` +openvpn+obfs4://riseup.net/?address=51.159.197.108&asn=AS12876&port=443 +``` + +the above URL describes a currently existing RiseupVPN endpoint. + + +## Data Format (Test Keys) + +The `test_keys` format is specific to this experiment. Here's how they +would look like in JSON format (where we have added comments to try and +be explicative about what it means): + +```JavaScript +{ + // for this provider + "provider": "riseup.net", + + // with this `endpoint` scope + "scope": "endpoint", + "endpoint_hostname": "vpn02-par.riseup.net", + "endpoint_address": "51.159.197.108", + "endpoint_port": 443, + "protocol": "openvpn+obfs4", + "asn": "AS12876" + "cc": "DE", + + // alternatively, with this `endpoint_pool` scope + "scope": "endpoint_pool", + "protocol": "openvpn+obfs4", + "cc": "DE", + + // alternatively, with this `global` scope + "scope": "global", + "protocol": "openvpn+obfs4", + + // in this time window + "time_window": { + "from": "2024-10-29T00:00:00Z", + "to": "2024-10-30T00:00:00Z" + }, + + // we make the following statements + "bodies": [ + + { + // during the tunnel creation phase + "phase": "creation", + + // with this sample size + "sample_size": 200, + + // we make a statement about network errors + "type": "network-error", + + // and the statement is that we fail 66% + // of the times with tcp.timed_out + "failure_ratio": 0.66, + "error": "tcp.timed_out" + }, + + { + // during the tunnel_ping phase + "phase": "tunnel_ping", + + // targeting the 8.8.8.8 IP address + "target_address": "8.8.8.8", + + // with this sample size + "sample_size": 500, + + // we make a statement about the latency distribution + "type": "ping", + + // and the statement is that we see the following + // latency distribution in milliseconds. + "latency_ms": { + "25p": 100, + "50p": 150, + "75p": 200, + "99p": 1100, + } + }, + + { + // during the tunnel_ndt_download phase + "phase": "tunnel_ndt_download", + + // targeting the 8.8.8.8 IP address + "target_hostname": "ndt-mlab2-mil07.mlab-oti.measurement-lab.org", + "target_address": "162.213.100.88", + "target_port": 443, + + // with this sample size + "sample_size": 500, + + // we make a statement about the latency distribution + "type": "ndt_download", + + // and the statement is that we see the following + // latency distribution in milliseconds and the + // following download speed distribution in Mbit/s. + "latency_ms": { + "25p": 100, + "50p": 150, + "75p": 200, + "99p": 1100, + }, + "speed_mbits": { + "25p": 4, + "50p": 7, + "75p": 11, + "99p": 200, + } + }, + + // ... + ] +} +``` + +More formally, this is the meaning of the fields (where we indicate +optional fields using the `<type>|undefined` syntax): + +- `provider` (`string`): the provider of the tunnel service, using +the same syntax as defined for the `input` field. + +- `scope` (`enum`): the scope, as defined above. + +- `endpoint_hostname` (`string|undefined`): the endpoint hostname. + +- `endpoint_address` (`string|undefined`): the endpoint IPv4/IPv6 address. + +- `endpoint_port` (`uint16|undefined`): the endpoint port. + +- `asn` (`^AS[0-9]+$|undefined`): the ASN of the endpoint or endpoint pool. + +- `cc` (`^[A-Z]{2}$|undefined`): the country code of the endpoint or endpoint pool. + +- `protocol` (`enum`): the protocol as defined above. + +- `time_window` (`object`): the time window in which the +statements we are making are valid. + +- `bodies` (`array`): an array of NEL-like objects. + +In turn, the common structure of each NEL-like object is the following: + +```JSON +{ + "phase": "", + "sample_size": 0, + "type": "" +} +``` + +where: + +- `phase` (`enum`): is the operation as defined above. + +- `sample_size` (`int53|undefined`): the number of samples +that are being considered for making this statement, appropriately +rounded, or directly omitted to preserve privacy. We RECOMMEND +to round to the nearest multiple of 100 and omit below 1000. + +- `type` (`enum`): the type of statement that is being made. + +The `network-error` NEL-like object is like: + +```JavaScript +{ + // ... common NEL-like fields ... + "type": "network-error", + "failure_ratio": 0.0, + "error": "" +} +``` + +where: + +- `failure_ratio` (`float64`): the ratio of the number of +failures over the population sample size. + +- `error` (`enum|undefined`): the network error as defined by NEL +or `undefined` if we don't know or don't want to report. + +Regarding network errors, note that the the `creation` phase +SHOULD NOT include DNS operations or other operations required +to obtain information useful for creating the tunnel. Rather, +this information is part of the VPN bootstrap process and +is out of the scope of this document. + +The `ping` NEL-like object is like: + +```JavaScript +{ + // ... common NEL-like fields ... + "type": "ping", + "target_address": "", + "latency_ms": {} +} +``` + +where: + +- `target_address` (`IPAddr`): the target IP address. + +- `latency_ms` (`object`): the latency distribution in +millisecond containing the latency percentiles indicated +using `pXX` (e.g., `p50` is the median). + +The `tunnel_ndt_download` NEL-like object is like: + +```JavaScript +{ + // ... common NEL-like fields ... + "type": "tunnel_ndt_download", + "target_hostname": "", + "target_address": "", + "target_port": 0, + "latency_ms": {}, + "speed_mbits": {} +} +``` + +where: + +- `target_hostname` (`string`): the target hostname. + +- `target_address` (`IPAddr`): the target IP address. + +- `target_port` (`uint16`): the target port. + +- `latency_ms` (`object`): is exactly like in `ping`. + +- `speed_mbits` (`object`): is like `latency_ms` but for +the download speed expressed in Mbit/s. + +Note that a `phase` is not restricted to use a specific NEL-like +object `type`. For example: + +```JSON +{ + "phase": "tunnel_ndt_download", + "target_hostname": "ndt-mlab2-mil07.mlab-oti.measurement-lab.org", + "sample_size": 500, + "type": "network-error", + "failure_ratio": 0.66, + "error": "dns.name_not_resolved" +} +``` + +the previous JSON snippet contains a statement that in 66% of the +cases, out of ~500 samples, the DNS lookup failed. + + +## Implementation Requirements + +This specification assumes that there is a *collector* for NEL-like +reports submitted by VPN apps. Defining how this happens is out of +the scope of this document, but the "tunnel telemetry" spec is a good +starting point. The privacy implications of submitting aggregated +measurements are in scope and are discussed in a dedicated section below. + +As far as this specification is concerned, it is also important to note +that VPN apps SHOULD probably be allowed cache unsent reports for up +to one week. This is to ensure that reports are not lost in case of +heavy censorship or just widespread internet failure. + +The collector will be responsible for storing the incoming reports +into a spool directory organised in daily buckets. After putting the +reports into the spool, the job of the collector is done. + +A separate component, the *submitter*, will periodically process +the spool directory, aggregating existing reports, deleting the +already-processed reports, and sending the aggregated reports to +the OONI collector using the data format defined in this document. + +The initial aggregation period is set to one week, anticipating +that, at the outset, there will be a low number of reports. We will +revise this decision based on actual numbers. + +In principle, determining whether a existing report has already +been processed is a simple matter. It suffices to delete the files +that have already been processed and "close" old buckets. Yet, +since the VPN app is allowed to cache reports up to one week, it +is possible that the submitter will receive reports for already +closed buckets. Additionally, malfunction in the VPN app may cause +reports to be submitted multiple times. A future version of this +spec will articulate how to solve this problem. + +Summarising, this discussion leads us to the following architecture: + +``` +.---------. .-----------. +| VPN App | --> <<push>> -> | collector | --> {spool} +`---------' `-----------' + + .-----------. .------. +{spool} --> <<pop>> --> | submitter | --> <<submit>> --> | OONI | + `-----------' `------' +``` + +where it is intended that the semantics of `<<pop>>` includes both +processing and removing a report from the spool. + + +## Privacy Considerations + +Users of VPN apps that submit NEL-like reports that end up being +aggregated and resubmitted to the OONI collector MUST be asked for +their informed consent. The informed consent SHOULD clearly +specify the purpose of the data collection (i.e., collecting +data for evaluating the effectiveness of specific protocol stacks +in creating usable VPN tunnels). Additionally, users MUST be +able to opt-out of the process at any time. + +Additionally, the aggregation period and the amount of information +disclosed in the aggregated measurements submitted to OONI MUST +take into account the anonimity set. + + +## Security Considerations + +In principle, it is not possible to absolutely trust measurements +submitted by unknown parties. The attack from which we want to +defend is the injection of bogus aggregate measurements, which has +more impact than the injection of bogus individual OONI measurements, +since less information needs to be submitted to the OONI collector +to have a significant impact. + +OONI is aware of the issue posed by the injection of bogus +measurements, and they are considering implementing an anonymous +probe ID mechanism to mitigate this issue. + +A future version of this specification will consider integrating +this functionality into the submitter, to facilitate OONI's job +of identifying reliable data sources. + +The related problem of how to evaluate the reliability of the VPN app +instances is out of the scope of this document. + + +## LEAP Implementation Details + +(This section is non-normative and describes the architecture +of LEAP with respect to data collection and submission. It +mainly serves the purpose of explaining the original context +in which we implemented this specification in production.) + +LEAP is currently collecting logs from docker-compose-based field +testing clients. There is a logs processing pipline that transforms +these textual logs into CSV files, shown in a dashboard. + +The initial implementation of this specification could be as +simple as a script that processes the CSV files, transforms +its rows into aggregated measurements, and submits them to the +OONI collector. This statement about simplicity is grounded +into the understanding that the existing logs processing pipline +is already able to deduplicate incoming field testing reports. + +To make the system compatible with the design described above, we +will also need to modify the logs pipeline to put the CSV files +into a spool directory organised in daily buckets. + +This leads us to the following systems architecture: + +``` + +.-----------. .---------------. +| FT Client | --> <<push>> --> | Logs pipeline | --> {spool} +`-----------' `---------------' + + .-----------. .------. +{spool} --> <<pop>> --> | submitter | --> <<submit>> --> | OONI | + `-----------' `------' +``` + +where `FT Client` is the field testing client, and it is intended +that the `<<pop>>` operation includes both processing and removing +a given CSV file from the corresponding bucket. + +A future version of this specification will address the problem of +extending this architecture to account for the submission of NEL-like +reports from the Bitmask-VPN app.