Path: blob/main/cmd/grafana-agent-operator/DEVELOPERS.md
4094 views
Developer's Guide
This document contains maintainer-specific information.
Table of Contents:
Introduction
Kubernetes Operators are designed to automate the behavior of human operators for pieces of software. The Grafana Agent Operator, in particular, is based off of the very popular Prometheus Operator:
We use the same v1 CRDs from the official project.
We aim to generate the same remote_write and scrape_configs that the Prometheus Operator does.
That being said, we're not fully compatible, and the Grafana Agent Operator has the same trade-offs that the Grafana Agent does: no recording rules, no alerts, no local storage for querying metrics.
The public Grafana Agent Operator design doc goes into more detail about the context and design decisions being made.
Updating CRDs
The make generate-crds
command at the root of this repository will generate CRDs and other code used by the operator. This calls the generate-crds script in a container. If you wish to call this script manually, you must also install controller-gen
and gen-crd-api-reference-docs
. Ensure to keep the version in sync with what's defined in the Dockerfile
.
Use the following to run the script in a container:
Testing Locally
Create a k3d cluster (depending on k3d v4.x):
Deploy Prometheus
An example Prometheus server is provided in ./example-prometheus.yaml
. Deploy it with the following, from the root of the repository:
You can view it at http://prometheus.k3d.localhost:30080 once the k3d cluster is running.
Apply the CRDs
Generated CRDs used by the operator can be found in the Production folder. Deploy them from the root of the repository with:
Run the Operator
Now that the CRDs are applied, you can run the operator from the root of the repository:
Apply a GrafanaAgent custom resource
Finally, you can apply an example GrafanaAgent custom resource. One is provided for you. From the root of the repository, run:
If you are running the operator, you should see it pick up the change and start mutating the cluster.
Development Architecture
This project makes heavy use of the Kubernetes SIG Controller Runtime project. That project has its own documentation, but for a high level overview of how it relates to this project:
The Grafana Agent Operator is composed of a single controller. A controller is responsible for responding to changes to Kubernetes resources.
Controllers can be notified about changes to:
One Primary resource (i.e., the GrafanaAgent CR)
Any number of secondary resources used to deploy the managed software (e.g., ServiceMonitor, PodMonitors). This is done using a custom event handler, which we'll detail below.
Any number of resources the Operator deploys (ConfigMaps, Secrets, StatefulSets). This is done using ownerReferences.
Controllers have one reconciler. The reconciler handles updating managed resources for one specific primary resource. The
GrafanaAgent
CRD is the primary resource, and the reconciler will handle updating managed resources for all discovered GrafanaAgent CRs. Each reconcile request is for a specific CR, such asagent-1
oragent-2
.A manager initializes all controllers for a project. It provides a caching Kubernetes client and propagates Kubernetes events to controllers.
An EnqueueRequestForSelector
event handler was added to handle dealing to changes to secondary resources, which is not a concept in the official Controller Runtime project. This works by allowing the reconciler to request events for a given primary resource if one of the secondary resource changes. This means that multiple primary resources can watch a ServiceMonitor and cause a reconcile when it changes.
Event handlers are specific to a resource, so there is one EnqueueRequestForSelector
handler per secondary resource.
Reconciles are supposed to be idempotent, so deletes, updates, and creates should be treated the same. All managed resources are deployed with ownerReferences set, so managed resources will be automatically deleted by Kubernetes' garbage collector when the primary resource gets deleted by the user.
Flow
This section walks through what happens when a user deploys a new GrafanaAgent CR:
A GrafanaAgent CR
default/agent
gets deployed to a clusterThe Controller's event handlers get notified about the event and queue a reconcile request for
default/agent
.The reonciler discovers all secondary
MetricsInstance
referenced bydefault/agent
.The reconciler discovers all secondary
ServiceMonitor
,PodMonitor
andProbe
resources that are referenced by the discoveredMetricsInstance
resource.The reconciler informs the appropriate
EnqueueRequestForSelector
event handlers that changes to those resources should cause a new reconcile fordefault/agent
.The reconciler discovers all
Secrets
referenced across all current resources. The content of the secrets are held in-memory to statically configure Grafana Agent fields that do not support reading in from a file (e.g., basic auth username).All the discovered secrets are copied to a new Secret in the
default
namespace. This is done in case aServiceMonitor
is found in a different namespace than where the Agent will be deployed.A new Secret is created for the configuration of the Grafana Agent.
A StatefulSet is generated for the Grafana Agent.
When default/agent
gets deleted, all EnqueueRequestForSelector
event handlers get notified to stop sending events for default/agent
.