we want to be able to sort the gateways, not to just select one.
protobuf is still magic for me, I might need to read the doc to understand how to extend it.
I'm still trying to wrap my head around everything, the example doesn't even have a main funcion. Who calls that? How?
I'm not sure if lb complexity is needed. AFAIK it just provides a communication mechanism and a selector algorithm. I have the feeling that an specific implementation for what we need of this two things will be pretty small, is lb adding complexity to our code? or will it make it simpler?
The work we need to do to integrate that in our set up:
decide which metrics we want to collect (CPU, bandwith, memory, ...)
implement agent able to collect this metrics
integrate this agent in our float platform
integrate lb into getmyip.
extend getmyip gateway list to provide some metrics on the status of the gateways (will be useful for the client side gateway selector)
internally lb operates as a scoring function, Select() just happens to return the lowest-scoring entry, but adding an API that returns multiple results should be relatively trivial (there are small caveats here as to how that changes the predictive model semantics, but I think they are relatively minor and the system would still work reasonably well with the assumption that "everyone picks the first result" even if it isn't true)
how to extend the lb protobuf definition is the purpose of the ai3/tools/live exercise, and should be exemplified here (once one knows how these things are done in protobufs)
you're right that the example code isn't a fully finished product yet: what is missing is the top-level main() that would instantiate a Director and the various APIs...
lb is meant to encapsulate the minimum amount of complexity necessary to do this job into something you don't need to worry about too much, so the purpose is to make your code simpler (kind of the point with libraries I guess) -- obviously this judgement is ultimately yours, though I do suspect that eventually you would end up with reproducing most of what's there (I mean, lb itself is "pretty small" imho)
I would add, to that list, the somehow related question (needed for all this to be effective, as we've discussed previously) about churn rates and the ability to signal the client to switch to a less-congested gateway in the first possible opportunity (ie, background-refresh of priority-list and change remotes on reconnect). I'm creating #7 to track that.
The interesting thing about the example implementation, besides it missing a main(), is that it is basically a complete reference implementation of a service that uses lb autoscale RTMP http proxies:. It has all the bits that we need:
there is a per-server agent that reports back utilization (bypassing prometheus etc)
there is a service that can answer the question "which server should I pick?" (this one serves http redirects, but it can do anything the client needs)
The ansible stuff in the live repository is sort of unrelated, but it has interesting bits for the platform: it builds a vm to be imaged, and restored on demand as many times as necessary, that does all the proper initialization, registration and shutdown steps via systemd, resulting in a nice bring up that will start things, report presence so DNS entries can be created, waits until lets-encrypt provides a cert, and then start serving (it also handles shutdown, where it redirects everyone back to the main service to 'drain' the node before actually shutting down).
@micah Do we need any kind of authentication and transport encryption for the server/agent communication? Or will the platform provide a secure channel (like stunnel or whatever) and better leave this pieces to the platform side?
If I understand the architecture properly, the communication would happen over the float service mesh, which has already transport encryption, and I believe is mutually authenticated via the x509 certificates.
I still need to internalize min/max and target fullness, but I'm starting to see how to use it.
AFAIK the server side is meant to be run independently and have things like getmyip do a gRPC call Select to get the selected node. I'm not sure if we want the balancer to be run as different binary, or it will make our platform simpler if we had it integrated in the same binary of the getmyip. I'm thinking about going the single binary option for now.
We'll need to extend lb to have another call that instead of selecting a single node does list them in the right order, I think that will be my next step and this might be producing a patch for lb (#11 (closed))
We could extend lb protocol so each gateway reports it's metadata (location, IPAddress, transport, ...) so getmyip stops using eip-service.json. That will require that each agent deployed needs to know all this information, I guess is easy by float. And at some point it might make eip-service.json completely obsolete, as the clients will not need it.
There are a couple of design decisions of lb that I'm confused about:
Why does agent.Start starts the agent on background? I would expect agents to don't do anything beside sending updates. I just wrote an infinite loop at the end of the main function of the agent (I'm wondering if there is anything better I can do than for/sleep):
for{time.Sleep(time.Hour)}
Why does bwmon work on transferred bytes instead of on a rate of used bandwidth? I guess it does assume every node has the same bandwidth, is something I don't think we can assume in leap providers.