Simone Basso · 9b27be08 · 7a865859 · bef1c3a7 · be58cacf · 748ada62
--- a/proposals/005-aggregate-tunnel-metrics.md 0 → 100644

+ 749

− 0

View file @ 748ada62

Open in Web IDE
+++ b/proposals/005-aggregate-tunnel-metrics.md 0 → 100644

+ 749

− 0

View file @ 748ada62

Open in Web IDE
+# Aggregate Tunnel Metrics
+
+* Author: bassosimone
+* Reviewers: cyberta
+* Status: under discussion
+
+This specification describes how to format and submit aggregate
+tunnel metrics to the OONI collector. Additionally, we include
+sections with design and implementation considerations.
+
+## Index
+
+1. [Overview](#overview)
+2. [Definitions](#definitions)
+3. [Use Cases](#use-cases)
+4. [Goals](#goals)
+5. [Non-Goals](#non-goals)
+6. [Related Work](#related-work)
+7. [Threat Model](#threat-model)
+8. [Tunnel Lifecycle](#tunnel-lifecycle)
+9. [Data Format (Envelope)](#data-format-envelope)
+10. [Data Format (Test Keys)](#data-format-test-keys)
+11. [Implementation Requirements](#implementation-requirements)
+12. [Privacy Considerations](#privacy-considerations)
+13. [Security Considerations](#security-considerations)
+14. [LEAP Implementation Details](#leap-implementation-details)
+
+
+## Overview
+
+The objective of this specification is to allow a VPN provider
+to report aggregate tunnel metrics to the OONI collector. We submit
+to the OONI collector for archival reasons. The submitted metrics
+range from availability to tunnel performance metrics.
+
+The aim is to allow researchers to evaluate the viability of
+using a given protocol in a specific country and ASN. This
+information will be provided with varying levels of granularity
+depending on the VPN provider's needs.
+
+
+## Definitions
+
+**AS**: autonomous system.
+
+**ASN**: autonomous system number.
+
+**CC**: country code.
+
+**bootstrap**: the process with which the VPN fetches all the
+required information to create tunnels.
+
+**tunnel**: communication between two network endpoints
+encapsulating TCP/IP packets.
+
+**protocol**: the protocol stack used by the tunnel to encapsulate
+packets that the adversary can observe.
+
+**(tunnel) endpoint**: either a bridge or a gateway.
+
+**measurement**: JSON document submitted to the OONI collector.
+
+
+## Use Cases
+
+In all use cases, a VPN provider wants to make *statements* about
+the availability and performance of accessing *something* they manage
+from a set of clients located in a given ASN and CC in a given time
+period. What sets use cases apart is the *something* about which
+they are making the statement.
+
+More specifically, we cover these use cases:
+
+**endpoint**: the statement is about a specific tunnel endpoint. For
+example, in the case of RiseupVPN, an endpoint could be:
+
+- hostname: `vpn02-par.riseup.net`
+- address: `51.159.197.108`
+- port: `443`
+- protocol: `openvpn+obfs4`
+- asn: `12876`
+- cc: `FR`
+
+which is as detailed and transparent as possible.
+
+**endpoint_pool**: the statement is about a homogeneous pool of
+endpoints. RiseupVPN publishes its endpoints, so this use case does
+not make sense for them. However, another VPN provider may want
+to disclose less information than in the `endpoint` use case and
+say something like:
+
+- protocol: `openvpn+obfs4`
+- cc: `FR`
+
+which does not give away the exact endpoints, but still allows
+to make statements regarding the availability and performance of
+using `openvpn+obfs4` endpoints located in France.
+
+**global**: the statement is about the whole set of endpoints
+and only includes information about the protocol:
+
+- protocol: `openvpn+obfs4`
+
+Each statement is a single *measurement* submitted to the OONI
+collector for archival reasons. The `scope` field inside the
+measurement `test_keys` identifies the specific use case.
+
+In summary, all use cases allow a researcher to evaluate the
+availability and performance of tunnels using specific protocols,
+but the granularity of the information disclosed varies.
+
+
+## Goals
+
+1. *Extensible spec*: we aim to specify a baseline spec that is
+simple to implement for LEAP *right now*, while, at the same time,
+allowing to extend it to submit more information later on.
+
+2. *Anonymity set*: the spec allows incrementally including more
+metrics as the user population grows.
+
+3. *~Standards compliance*: where possible, reuse concepts and
+terminology from existing standards and related work done in
+this space, including Network Error Logging (NEL), Ain Ghazal's
+"tunnel telemetry", DPIDetector, and the OONI specifications.
+
+
+## Non-Goals
+
+1. Specifying how a VPN client should submit tunnel telemetry
+data to the VPN provider itself. This scope has already been explored
+by Ain Ghazal under the codename of "tunnel telemetry". We aim
+to leverage the work done by Ain Ghazal with respect to NEL
+error reporting. Yet, we believe that the mechanism with which
+a VPN client reports to its own infrastructure should
+directly use NEL without being constrained by the specific
+details of the OONI measurement format.
+
+2. Specifying how a VPN client should directly submit individual tunnel
+establishment measurements to the OONI collector. Adding this
+requirement would significantly complicate the design and the job
+of the data analyst, because the same format would need to accommodate
+for both aggregate and individual measurements. That said, scoping
+this use case seems a matter to continue working on the Ain
+Ghazal proposal on "tunnel telemetry".
+
+3. Measuring the VPN bootstrap phase. This phase is not
+tunnel-specific, rather it is VPN-specific. We believe we
+should not conflate this information into the same
+measurements. We could write a bootstrap-related spec. Also,
+the aggregation and reporting requirements required by the
+bootstrap may differ from those for tunnels.
+
+
+## Related Work
+
+1. [OONI spec](https://github.com/ooni/spec);
+
+2. [Tunnel Telemetry repository](https://github.com/ainghazal/tunnel-telemetry);
+
+3. [Tunnel Telemetry proposal](https://github.com/ooni/spec/pull/274);
+
+4. [DPIDetector](https://dpidetector.org/);
+
+5. [Network Error Logging](https://www.w3.org/TR/network-error-logging/).
+
+
+## Threat Model
+
+A tunnel comes to life with the establishment of a connection
+between a client and an endpoint. The client establishes the tunnel
+initiating the communication.
+
+While the adversary will only be able to observe the most external
+protocol layer (e.g., `obfs4+kcp`), we assume it can also use
+other identification techniques, including conversation analysis,
+statistical analysis on packet sizes, and active probing.
+
+Additionally, the adversary can interfere with the communication
+after it has been established. For example, it can mess with the
+TCP state by injecting RST segments, or it can throttle the available
+bandwidth down to ~zero, explicitly or through routing.
+
+For these reasons:
+
+1. we include the full protocol stack into the definition of the
+protocol (i.e., `openvpn+obfs4+kcp` rather than `obfs4+kcp`),
+to reflect the fact that the packet dynamics differ depending on
+the inner protocols of the protocol stack;
+
+2. we include the possibility of reporting performance and
+latency metrics, to allow provider to make statements about
+the overall quality of an established tunnel.
+
+Additionally, as explained in the next section, this specification
+allows VPN providers to selectively choose how much information
+to disclose about their VPN architecture, thus ensuring they are
+in control of how much information to expose to the adversary.
+
+In the interest of facilitating research, VPN providers MAY choose to
+publicly disclose more information about *some* endpoints, while
+being more secretive about other endpoints.
+
+
+## Tunnels Lifecycle
+
+We model the tunnel lifecycle using this state machine:
+
+```
+                                                   .--> <<active_measurement>>
+.---------.                      .-------------.  /          /
+| Initial | --> <<creation>> --> | Established |<-----------'
+`---------'                      `-------------'
+                                        |
+                                        V
+                                 .-------------.
+                                 |   Final     |
+                                 `-------------'
+
+```
+
+In the `Initial` state the tunnel has not been created yet. The
+`creation` operation creates the tunnel and transitions the state
+to `Established`. While in `Established`, the VPN app may run
+active measurements to assess the tunnel performance. For example,
+the app may `ping` the the VPN gateway or well-known addresses,
+or it could run network-performance tests, such as NDT. The
+tunnel enters into a `Final` state when it is closed.
+
+We define the following *operations*:
+
+**creation**: this operation creates the tunnel. It SHOULD NOT include
+DNS lookup or other operations required to fetch the IP address and the
+required certificates. For example, if we are using `obfs4`, the
+`create` operation SHOULD be about performing the TCP three-way
+handshake, the OBFS4 handshake, and the OpenVPN handshake.
+
+**tunnel_ping**: this operations is the active measurement using a
+`ping`-like tool to ping well-known addresses.
+
+**tunnel_ndt_download**: this operation is the active measurement using a
+NDT to measure the download speed over the tunnel.
+
+**tunnel_ndt_upload**: like `active_ndt_download`, but for the upload.
+
+Note that both `tunnel_ping` and `tunnel_ndt_{down,up}load` are optional and
+MAY possibly occur multiple times during the tunnel lifecycle.
+
+For each `tunnel_xx` operation, we also define the equivalent
+`baseline_xx` operation optionally performed before creating the
+tunnel to establish a baseline.
+
+This specification allows a VPN provider to submit aggregate
+reports about the `creation`, `tunnel_ping`, etc.,
+operations. A future version of this specification may extend
+the set of operations to include more active measurements,
+or to include additional information about tunnels, such as
+the aggregate average duration of tunnels, the number of bytes
+transmitted and received, etc.
+
+We use the `phase` keyword (borrowed from NEL terminology) to indicate
+the *operation* associated with specific statements.
+
+
+## Data Format (Envelope)
+
+The metaphor used by OONI measurements is that there is
+a `test_name` describing a specific network testing methodology
+where we make a statement about a specific resource (the
+`input` field). This happens in the context of a given ASN
+and CC. Each specific experiment type (i.e., `test_name`) has
+its own specific data format, which is described by the
+experiment-specific `test_keys` field.
+
+Accordingly, we define the `aggregate_tunnel_metrics` experiment
+name and sketch out the overall envelope as follows:
+
+```JavaScript
+{
+  "annotations": {
+    "upstream_collector": "riseup-par-01"
+  },
+  "data_format_version": "0.2.0",
+  "input": "openvpn+obfs4+kcp://riseup.net/",
+  "measurement_start_time": "2024-10-29 00:00:00",
+  "probe_asn": "AS1234",
+  "probe_cc": "IT",
+  "test_keys": { /* ... */ },
+  "test_name": "aggregate_tunnel_metrics",
+  "test_runtime": 0.0,
+  "test_start_time": "2024-10-29 00:00:00",
+  "test_version": "0.1.0"
+}
+```
+
+Here is the justification for setting the fields as such:
+
+- `upstream_collector` (`string`): the name of the upstream
+collector that collected and aggregated metrics before
+submitting to the OONI collector. This is useful to know
+which entity collected the data and submitted the aggregate.
+
+- `data_format_version` (`string`): the version of the data
+format used by OONI, which must be exactly equal to `0.2.0`.
+
+- `input` (`URL`): the input URL format is consistent with
+the OONI `openvpn` experiment and is discussed in more detail below.
+
+- `measurement_start_time` (`Date`): the UTC (without
+explicit indication!) moment in which this aggregate was produced.
+
+- `probe_asn` (`^AS[0-9]+$`): the ASN of the set of
+probes that this measurement is about.
+
+- `probe_cc` (`^[A-Z]{2}$`): the country code of the set
+of probes that this measurement is about.
+
+- `test_name` (`string`): must be `aggregate_tunnel_metrics`.
+
+- `test_runtime` (`float64`): runtime of this test in
+seconds, which is set to zero because there is no
+real runtime here.
+
+- `test_start_time` is set equal to `measurement_start_time`,
+since there is not really a test here, this is just an aggregate.
+
+- `test_version` (`^[0-9]+.[0-9]+.[0-9]+`): the version of the
+test, which will evolve as we evolve this specification.
+
+Regarding the `input`, its main purpose is to allow searching
+for measurements through the OONI API. Whether the aggregate
+tunnel metrics will be exposed by the OONI API is an orthogonal
+topic, which requires coordination with the OONI team.
+
+The `input` format is the same as the `openvpn` experiment,
+with some additions that are specific to this spec:
+
+```
+{protocol}://{provider}/?{query_string}
+```
+
+More specifically:
+
+- `{protocol}` (`string`): the VPN protocol stack being used,
+therefore, `openvpn`, `openvpn+obfs4`, etc.
+
+- `{provider}` (`string`): the entity that manages the endpoints, for
+example `riseup.net`.
+
+- the `{query_string}` contains the following parameters (we use the
+`<type>|undefined` syntax to mark optional fields):
+
+    - `address` (`string|undefined`): the endpoint IPv4/IPv6 address;
+
+    - `asn` (`string|undefined`): the endpoint ASN;
+
+    - `cc` (`string|undefined`): the endpoint country code;
+
+    - `hostname` (`string|undefined`): the endpoint hostname;
+
+    - `port` (`string|undefined`): the endpoint port;
+
+For example:
+
+```
+openvpn+obfs4://riseup.net/?address=51.159.197.108&asn=AS12876&port=443
+```
+
+the above URL describes a currently existing RiseupVPN endpoint.
+
+
+## Data Format (Test Keys)
+
+The `test_keys` format is specific to this experiment. Here's how they
+would look like in JSON format (where we have added comments to try and
+be explicative about what it means):
+
+```JavaScript
+{
+  // for this provider
+  "provider": "riseup.net",
+
+  // with this `endpoint` scope
+  "scope": "endpoint",
+  "endpoint_hostname": "vpn02-par.riseup.net",
+  "endpoint_address": "51.159.197.108",
+  "endpoint_port": 443,
+  "protocol": "openvpn+obfs4",
+  "asn": "AS12876"
+  "cc": "DE",
+
+  // alternatively, with this `endpoint_pool` scope
+  "scope": "endpoint_pool",
+  "protocol": "openvpn+obfs4",
+  "cc": "DE",
+
+  // alternatively, with this `global` scope
+  "scope": "global",
+  "protocol": "openvpn+obfs4",
+
+  // in this time window
+  "time_window": {
+    "from": "2024-10-29T00:00:00Z",
+    "to": "2024-10-30T00:00:00Z"
+  },
+
+  // we make the following statements
+  "bodies": [
+
+    {
+      // during the tunnel creation phase
+      "phase": "creation",
+
+      // with this sample size
+      "sample_size": 200,
+
+      // we make a statement about network errors
+      "type": "network-error",
+
+      // and the statement is that we fail 66%
+      // of the times with tcp.timed_out
+      "failure_ratio": 0.66,
+      "error": "tcp.timed_out"
+    },
+
+    {
+      // during the tunnel_ping phase
+      "phase": "tunnel_ping",
+
+      // targeting the 8.8.8.8 IP address
+      "target_address": "8.8.8.8",
+
+      // with this sample size
+      "sample_size": 500,
+
+      // we make a statement about the latency distribution
+      "type": "ping",
+
+      // and the statement is that we see the following
+      // latency distribution in milliseconds.
+      "latency_ms": {
+          "25p": 100,
+          "50p": 150,
+          "75p": 200,
+          "99p": 1100,
+      }
+    },
+
+    {
+      // during the tunnel_ndt_download phase
+      "phase": "tunnel_ndt_download",
+
+      // targeting the 8.8.8.8 IP address
+      "target_hostname": "ndt-mlab2-mil07.mlab-oti.measurement-lab.org",
+      "target_address": "162.213.100.88",
+      "target_port": 443,
+
+      // with this sample size
+      "sample_size": 500,
+
+      // we make a statement about the latency distribution
+      "type": "ndt_download",
+
+      // and the statement is that we see the following
+      // latency distribution in milliseconds and the
+      // following download speed distribution in Mbit/s.
+      "latency_ms": {
+        "25p": 100,
+        "50p": 150,
+        "75p": 200,
+        "99p": 1100,
+      },
+      "speed_mbits": {
+        "25p": 4,
+        "50p": 7,
+        "75p": 11,
+        "99p": 200,
+      }
+    },
+
+    // ...
+  ]
+}
+```
+
+More formally, this is the meaning of the fields (where we indicate
+optional fields using the `<type>|undefined` syntax):
+
+- `provider` (`string`): the provider of the tunnel service, using
+the same syntax as defined for the `input` field.
+
+- `scope` (`enum`): the scope, as defined above.
+
+- `endpoint_hostname` (`string|undefined`): the endpoint hostname.
+
+- `endpoint_address` (`string|undefined`): the endpoint IPv4/IPv6 address.
+
+- `endpoint_port` (`uint16|undefined`): the endpoint port.
+
+- `asn` (`^AS[0-9]+$|undefined`): the ASN of the endpoint or endpoint pool.
+
+- `cc` (`^[A-Z]{2}$|undefined`): the country code of the endpoint or endpoint pool.
+
+- `protocol` (`enum`): the protocol as defined above.
+
+- `time_window` (`object`): the time window in which the
+statements we are making are valid.
+
+- `bodies` (`array`): an array of NEL-like objects.
+
+In turn, the common structure of each NEL-like object is the following:
+
+```JSON
+{
+  "phase": "",
+  "sample_size": 0,
+  "type": ""
+}
+```
+
+where:
+
+- `phase` (`enum`): is the operation as defined above.
+
+- `sample_size` (`int53|undefined`): the number of samples
+that are being considered for making this statement, appropriately
+rounded, or directly omitted to preserve privacy. We RECOMMEND
+to round to the nearest multiple of 100 and omit below 1000.
+
+- `type` (`enum`): the type of statement that is being made.
+
+The `network-error` NEL-like object is like:
+
+```JavaScript
+{
+  // ... common NEL-like fields ...
+  "type": "network-error",
+  "failure_ratio": 0.0,
+  "error": ""
+}
+```
+
+where:
+
+- `failure_ratio` (`float64`): the ratio of the number of
+failures over the population sample size.
+
+- `error` (`enum|undefined`): the network error as defined by NEL
+or `undefined` if we don't know or don't want to report.
+
+Regarding network errors, note that the the `creation` phase
+SHOULD NOT include DNS operations or other operations required
+to obtain information useful for creating the tunnel. Rather,
+this information is part of the VPN bootstrap process and
+is out of the scope of this document.
+
+The `ping` NEL-like object is like:
+
+```JavaScript
+{
+  // ... common NEL-like fields ...
+  "type": "ping",
+  "target_address": "",
+  "latency_ms": {}
+}
+```
+
+where:
+
+- `target_address` (`IPAddr`): the target IP address.
+
+- `latency_ms` (`object`): the latency distribution in
+millisecond containing the latency percentiles indicated
+using `pXX` (e.g., `p50` is the median).
+
+The `tunnel_ndt_download` NEL-like object is like:
+
+```JavaScript
+{
+  // ... common NEL-like fields ...
+  "type": "tunnel_ndt_download",
+  "target_hostname": "",
+  "target_address": "",
+  "target_port": 0,
+  "latency_ms": {},
+  "speed_mbits": {}
+}
+```
+
+where:
+
+- `target_hostname` (`string`): the target hostname.
+
+- `target_address` (`IPAddr`): the target IP address.
+
+- `target_port` (`uint16`): the target port.
+
+- `latency_ms` (`object`): is exactly like in `ping`.
+
+- `speed_mbits` (`object`): is like `latency_ms` but for
+the download speed expressed in Mbit/s.
+
+Note that a `phase` is not restricted to use a specific NEL-like
+object `type`. For example:
+
+```JSON
+{
+  "phase": "tunnel_ndt_download",
+  "target_hostname": "ndt-mlab2-mil07.mlab-oti.measurement-lab.org",
+  "sample_size": 500,
+  "type": "network-error",
+  "failure_ratio": 0.66,
+  "error": "dns.name_not_resolved"
+}
+```
+
+the previous JSON snippet contains a statement that in 66% of the
+cases, out of ~500 samples, the DNS lookup failed.
+
+
+## Implementation Requirements
+
+This specification assumes that there is a *collector* for NEL-like
+reports submitted by VPN apps. Defining how this happens is out of
+the scope of this document, but the "tunnel telemetry" spec is a good
+starting point. The privacy implications of submitting aggregated
+measurements are in scope and are discussed in a dedicated section below.
+
+As far as this specification is concerned, it is also important to note
+that VPN apps SHOULD probably be allowed cache unsent reports for up
+to one week. This is to ensure that reports are not lost in case of
+heavy censorship or just widespread internet failure.
+
+The collector will be responsible for storing the incoming reports
+into a spool directory organised in daily buckets. After putting the
+reports into the spool, the job of the collector is done.
+
+A separate component, the *submitter*, will periodically process
+the spool directory, aggregating existing reports, deleting the
+already-processed reports, and sending the aggregated reports to
+the OONI collector using the data format defined in this document.
+
+The initial aggregation period is set to one week, anticipating
+that, at the outset, there will be a low number of reports. We will
+revise this decision based on actual numbers.
+
+In principle, determining whether a existing report has already
+been processed is a simple matter. It suffices to delete the files
+that have already been processed and "close" old buckets. Yet,
+since the VPN app is allowed to cache reports up to one week, it
+is possible that the submitter will receive reports for already
+closed buckets. Additionally, malfunction in the VPN app may cause
+reports to be submitted multiple times. A future version of this
+spec will articulate how to solve this problem.
+
+Summarising, this discussion leads us to the following architecture:
+
+```
+.---------.                 .-----------.
+| VPN App | --> <<push>> -> | collector | --> {spool}
+`---------'                 `-----------'
+
+                        .-----------.                    .------.
+{spool} --> <<pop>> --> | submitter | --> <<submit>> --> | OONI |
+                        `-----------'                    `------'
+```
+
+where it is intended that the semantics of `<<pop>>` includes both
+processing and removing a report from the spool.
+
+
+## Privacy Considerations
+
+Users of VPN apps that submit NEL-like reports that end up being
+aggregated and resubmitted to the OONI collector MUST be asked for
+their informed consent. The informed consent SHOULD clearly
+specify the purpose of the data collection (i.e., collecting
+data for evaluating the effectiveness of specific protocol stacks
+in creating usable VPN tunnels). Additionally, users MUST be
+able to opt-out of the process at any time.
+
+Additionally, the aggregation period and the amount of information
+disclosed in the aggregated measurements submitted to OONI MUST
+take into account the anonimity set.
+
+
+## Security Considerations
+
+In principle, it is not possible to absolutely trust measurements
+submitted by unknown parties. The attack from which we want to
+defend is the injection of bogus aggregate measurements, which has
+more impact than the injection of bogus individual OONI measurements,
+since less information needs to be submitted to the OONI collector
+to have a significant impact.
+
+OONI is aware of the issue posed by the injection of bogus
+measurements, and they are considering implementing an anonymous
+probe ID mechanism to mitigate this issue.
+
+A future version of this specification will consider integrating
+this functionality into the submitter, to facilitate OONI's job
+of identifying reliable data sources.
+
+The related problem of how to evaluate the reliability of the VPN app
+instances is out of the scope of this document.
+
+
+## LEAP Implementation Details
+
+(This section is non-normative and describes the architecture
+of LEAP with respect to data collection and submission. It
+mainly serves the purpose of explaining the original context
+in which we implemented this specification in production.)
+
+LEAP is currently collecting logs from docker-compose-based field
+testing clients. There is a logs processing pipline that transforms
+these textual logs into CSV files, shown in a dashboard.
+
+The initial implementation of this specification could be as
+simple as a script that processes the CSV files, transforms
+its rows into aggregated measurements, and submits them to the
+OONI collector. This statement about simplicity is grounded
+into the understanding that the existing logs processing pipline
+is already able to deduplicate incoming field testing reports.
+
+To make the system compatible with the design described above, we
+will also need to modify the logs pipeline to put the CSV files
+into a spool directory organised in daily buckets.
+
+This leads us to the following systems architecture:
+
+```
+
+.-----------.                  .---------------.
+| FT Client | --> <<push>> --> | Logs pipeline | --> {spool}
+`-----------'                  `---------------'
+
+                        .-----------.                    .------.
+{spool} --> <<pop>> --> | submitter | --> <<submit>> --> | OONI |
+                        `-----------'                    `------'
+```
+
+where `FT Client` is the field testing client, and it is intended
+that the `<<pop>>` operation includes both processing and removing
+a given CSV file from the corresponding bucket.
+
+A future version of this specification will address the problem of
+extending this architecture to account for the submission of NEL-like
+reports from the Bitmask-VPN app.