-
Notifications
You must be signed in to change notification settings - Fork 0
Description
The current proto definitions and RPCs I quickly made just to get something running as a whole. This means little to no real thought were put into the definitions, the API design and generally into how things are going to work on the server. This issue aims to outline the problems that needs solving; luckily this is a little bit clearer now that I've been running tests using the current API.
User journey
There are two types of user journeys; the developer and the end-user. The developer is anyone who will deploy and build on top of fleetd; the developer is the one who will manage, configure and monitor the devices through fleetd. They will build software and deploy software that will run on devices with the fleetd agent. The developer cares about the developer experience (easy to use, available tooling, easy to debug and resolve issues, etc).
The end-user is the consumer or company that will install the devices using fleetd agent to run the developer's software. They will in most cases be responsible for setting up the physical device by e.g. connecting the device to their own network, either through mass-deployment tooling (companies, mostly) or e.g. through a mobile app (typically a consumer). Here, robustness and low failure rates are critical, and in the case of anything going wrong; issue resolving is key as the developer typically can not be physically present in the case of any issues. Every hardware device shipped to the end-user is typically impossible to get physical access to (and the need for it should be zero). Making it easy for non-technical users to resolve problems is essential.
Fleetd needs to focus on them both, and enable the developer to rapidly build, ship and monitor software for their devices while ensuring a smooth, frictionless experience for the end-user without the developer having to do a lot of work (batteries included).
Needs
I can use some of the learnings from the v1 protos; what worked and what didn't work quite as I wanted it to. Given the user journey, here's some of the needs, outlined:
Device configuration
Let's start at the beginning of the device lifecycle; setting it up with fleetd. Let's say the developer orders a batch of ESP32 or Raspberry Pi's. How should they be able to (most effortlessly) set them up with fleetd and load configuration onto the devices?
First off, the fleetd agent has to be installed onto the system and fleetd needs to provide tooling to batch install agent software or even flash firmware onto these devices. Might even make sense to provide a simple utility in the fleetd core software while a more production-ready tool could be part of a cloud/platform offering. They should both use the same mechanism though, so that anyone could write their own programs for custom production, for example.
The configuration we currently allow is either A) environment variables on the system (Linux only), B) config file loaded into the system or C) through mDNS discovery (typically a mobile app).
A and B mostly use the same mechanism; if no complete configuration is found on-device in a file, look up the environment variables. If none of them, the device could either go into a faulty/stuck state or fall back to mDNS.
C uses mDNS; the device agent software uses the onboard WiFi-antenna to establish and broadcast its own network which e.g. a mobile app can connect to and uses mDNS to broadcast some information about the device agent's own server running on some port. The mobile app can the connect to the device agent's server and use ConnectRPC to send a configuration request to the device agent. Given a valid configuration, the values are persisted on the device as well.
I don't know the final configuration keys needed in the end, but currently we need the following;
- If remotely controlled, a remote server URL. This should point to a fleetd compatible server, serving the defined RPC services. This could also point to a server running on the same network, if running local-only. I don't know yet, but there might be a case for allowing re-configuring devices but a consumer product would typically only allow this after doing a physical press of a reset button, for example. Regardless, the fleetd agent should expose some interface to any on-device software that the developer's software can use to e.g. toggle reset allowance.
There is also the use-case where the device is entirely pre-configured with both the fleetd agent but also the software it should run, such that it could be connected and start running even if in a faraday cage. Of course, I would assume most people's use cases include the ability to remotely distribute and upgrade software via a fleetd-compatible remote server, just noting that it should be able to work regardless.
Remote registration
Typically, the developer wants to have all devices connected to a remote, which is a fleetd-compatible server running the ConnectRPC services we define here. It could run the implementation fleetd provides, but it could also run a custom implementation.
Once the device and agent is configured properly, it is time for the device to get registered with the remote. The remote then assigns some human-readable name and an identifier. This identifier is remotely managed, contrary to the local identifier which is generated by the agent itself. This allows for reuse across e.g. different environments or resetting and re-registering with another server if reconfigured.
As for the registration, some key metadata about the device should be included; arch, os, cpu, ram etc. just basic system info that will both be informative for the developer but most importantly, it will be used to target software rollouts.
When the device is successfully registered, it has exited its "setup mode" and is now ready to run whatever software the developer chooses to deploy.
Deploying software
I haven't come up with a good name for them yet, but the developer should be able to deploy arbitrary software to all devices they manage. Usually, for embedded devices, these are just native binaries. For some use cases, nixpacks or Docker/OCI support may be required, but for now binaries are the primary source. These binaries can be rolled out as part of a rollout/wave/campaign or whatever you might call it. Such a rollout contains a) what software (binary download url) should be run, some metadata (e.g. signature, file data) and identifiers for the devices that should receive the notice and start downloading the update if it exists already, or install the software if not.
A rollout is typically created by the developer in a CI/CD pipeline like GitHub Actions, and ergonomics here is key. The rollout has to be created with certain targets, and it should be pretty clear and visible what devices would get targeted by a rollout before it is rolled out. This can e.g. play a larger role in the managed offering, but we should expose APIs for it regardless. One example might be to build a CLI that can visualize the impact of a rollout.
The server should track the install progress on every device for a certain rollout; is it downloaded, is it installing, are there errors, is it installed? The server needs to keep track of all the devices and their status, as any updates gone wrong might potentially brick the entire device fleet of a customer, be it a production line losing millions in revenue for every hour of downtime, or your cousin Albert messing around with his RPi cluster at home.
Lifecycle management
The agent needs to expose an interface (ConnectRPC) that allows for either a remote or a local client (like a mobile app) to manage its lifecycle and state. For example, stopping software from running, manually starting software, get deployment status, device metrics, etc. Observability is key here. We ideally
Observability
It is crucial that every device is able to both trace and log relevant output and send it to the remote server, if any. Still, any logs and traces should be offline-first, meaning e.g. log lines should persist locally first (memory or disk) before being sent to the remote. Only when the remote log drain ack's the log data or trace data, should the agent remove the local copy. This is in order to ensure no data is lost.
Data like metrics, performance and analytics should be aggregated to the remote on a regular, configurable basis. The metric data format should be compatible with otel.
Self-updates
The fleetd agent should be able to look for updates for itself, either by querying the remote or maybe ideally for the remote to create a table of outdated agents running with compatible updates available, and then prompt the developer(s) to update explicitly via e.g. a portal or any client connected to the remote.
Authorization and authentication
I haven't gotten this far yet, but every device registered with the remote should have some sort of way to identify impostor agents. Every RPC should be guarded etc.
Schemas
Protobuf files reside within proto/{domain}/{version} and will always generate at least Go code. Each schema has the following package option to create the package names used; {domain}pb for messages and {domain}rpc for ConnectRPC code.
option go_package = "fleetd.sh/gen/fleetd/v1;fleetpb";Currently, these domains are either "fleetd", "agent" or "health" . The two former are named after where the RPC services are running; "fleetd" services will run on the central server, while "agent" services will run on each device, e.g. for discovery or local control. The health package is just a copy of the gRPC health protocol.
Principles
Package Structure
- Use reverse domain notation:
package domain.service.v1 - Keep one service per package
- Version packages explicitly with
v1,v2suffix
Message Design
- Prefix request/response with RPC name:
CreateUserRequest,CreateUserResponse - Use singular names for single entities:
UsernotUsers - Make required fields required through validation
- Add field comments for non-obvious fields
- Group related fields together
- Reserve deleted field numbers and names
Field Numbering
- Start with field 1
- Leave gaps between fields (1, 3, 4, 6...)
- Group related fields in blocks of 10
- Never reuse field numbers
- Reserve removed field numbers
Enums
- Prefix values with enum name:
STATUS_ACTIVEinStatusenum - Always start with UNSPECIFIED = 0
- Leave gaps between values for future additions
- Add _UNSPECIFIED suffix to default value
Service Design
- One service per domain concept
- Use consistent verb prefixes (Create/Get/List/Update/Delete)
- Stream responses for large collections
- Make services focused and cohesive
- Document service purpose and scope
Types & Standards
- Use well-known types (google.protobuf.Timestamp)
- Consistent casing: snake_case for protobuf
- Reuse common patterns (pagination, filtering)
- Define reusable types in separate packages
- Follow buf.build style guide
Versioning & Compatibility
- Never change field numbers
- Only add optional fields
- Use deprecation comments before removal
- Keep backwards compatibility
- Version breaking changes in new package
Documentation
- Document message and field purpose
- Add service-level documentation
- Include examples in comments
- Document deprecation reasons
- Keep docs up to date
Common Patterns
// Reusable status
enum Status {
STATUS_UNSPECIFIED = 0;
STATUS_ACTIVE = 1;
STATUS_INACTIVE = 2;
}
// Pagination
message ListUsersRequest {
int32 page_size = 1;
string page_token = 2;
}
// Standard response
message StandardResponse {
string trace_id = 1;
repeated Error errors = 2;
google.protobuf.Timestamp server_time = 3;
}
// Error details
message Error {
string code = 1;
string message = 2;
string details = 3;
string help_url = 4;
}Thoughts and design
Lastly, all of this should consider the following; first and foremost build this for the developer and the actual users of the software. Ergonomics is super important, as is UX and design. Keep in mind that the agent software and server software is open-source and should be built with readability and maintainability in mind. The closed-source managed service is primarily just a (good) client on top of the fleetd server.