From 50eb140412d576e5fd7ae55dd719d53e14bbd20f Mon Sep 17 00:00:00 2001 From: Boden Fuller Date: Thu, 1 Jan 2026 21:09:15 -0500 Subject: [PATCH 1/2] feat(enterprise): add enterprise enablement features for Issue #476 Implements enterprise-grade deployment capabilities for kagent: ## High Availability - PodDisruptionBudget templates for controller and UI - Configurable minAvailable/maxUnavailable settings ## Observability - ServiceMonitor for Prometheus Operator integration - PrometheusRule with default alerts (down, error rate, latency) - Structured JSON audit logging with request tracing ## Security - OAuth2/OIDC authentication with JWT validation - JWKS caching with auto-refresh - Configurable claims extraction (user, roles, scopes) ## Multi-tenancy - Namespace-scoped controller via watchNamespaces - Namespace isolation validation in reconcilers - Cluster-wide mode when no namespaces specified ## OpenShift Support - Security context compatibility with restricted-v2 SCC - Deployment guide for Routes, SCCs, and PSS Tested on OpenShift 4.16 with: - Pods: Running (controller, UI) - PDBs: Created with minAvailable=1 - ServiceMonitor: Prometheus scraping active - PrometheusRule: Alerting rules loaded - Audit logs: Structured JSON with request IDs Closes: #476 Signed-off-by: Boden Fuller --- docs/openshift-deployment-guide.md | 575 ++++++++++++++ docs/security-context.md | 302 +++++++ .../controller/reconciler/reconciler.go | 63 ++ .../controller/reconciler/reconciler_test.go | 154 ++++ .../translator/agent/security_context_test.go | 614 +++++++++++++++ go/internal/httpserver/auth/oauth2.go | 446 +++++++++++ go/internal/httpserver/auth/oauth2_test.go | 736 ++++++++++++++++++ go/internal/httpserver/middleware.go | 151 ++++ .../httpserver/middleware_audit_test.go | 358 +++++++++ go/internal/httpserver/server.go | 1 + go/pkg/app/app.go | 1 + helm/kagent/templates/pdb.yaml | 40 + helm/kagent/templates/prometheusrule.yaml | 65 ++ helm/kagent/templates/servicemonitor.yaml | 41 + helm/kagent/tests/pdb_test.yaml | 158 ++++ helm/kagent/values.yaml | 88 ++- 16 files changed, 3791 insertions(+), 2 deletions(-) create mode 100644 docs/openshift-deployment-guide.md create mode 100644 docs/security-context.md create mode 100644 go/internal/httpserver/auth/oauth2.go create mode 100644 go/internal/httpserver/auth/oauth2_test.go create mode 100644 go/internal/httpserver/middleware_audit_test.go create mode 100644 helm/kagent/templates/pdb.yaml create mode 100644 helm/kagent/templates/prometheusrule.yaml create mode 100644 helm/kagent/templates/servicemonitor.yaml create mode 100644 helm/kagent/tests/pdb_test.yaml diff --git a/docs/openshift-deployment-guide.md b/docs/openshift-deployment-guide.md new file mode 100644 index 000000000..fa0a26b9d --- /dev/null +++ b/docs/openshift-deployment-guide.md @@ -0,0 +1,575 @@ +# OpenShift Deployment Guide + +This guide covers deploying kagent on Red Hat OpenShift Container Platform. OpenShift provides additional security features and enterprise capabilities on top of Kubernetes, which require specific configuration considerations. + +## Table of Contents + +- [Prerequisites](#prerequisites) +- [OpenShift vs Kubernetes: Key Differences](#openshift-vs-kubernetes-key-differences) +- [Security Context Constraints (SCCs)](#security-context-constraints-sccs) + - [Understanding SCCs](#understanding-sccs) + - [Configuring kagent for SCCs](#configuring-kagent-for-sccs) + - [Creating Custom SCCs](#creating-custom-sccs) +- [Pod Security Standards (PSS)](#pod-security-standards-pss) + - [restricted-v2 Compatibility](#restricted-v2-compatibility) + - [Configuring Security Contexts](#configuring-security-contexts) +- [Routes vs Ingress](#routes-vs-ingress) + - [Automatic Route Creation](#automatic-route-creation) + - [Manual Route Configuration](#manual-route-configuration) + - [TLS Configuration](#tls-configuration) +- [Namespace Patterns and Multi-Tenancy](#namespace-patterns-and-multi-tenancy) + - [Single-Namespace Deployment](#single-namespace-deployment) + - [Multi-Namespace Deployment](#multi-namespace-deployment) + - [Project-Based Isolation](#project-based-isolation) +- [Deployment Scenarios](#deployment-scenarios) + - [Basic Deployment](#basic-deployment) + - [Production Deployment](#production-deployment) + - [Air-Gapped Deployment](#air-gapped-deployment) +- [Troubleshooting](#troubleshooting) + +## Prerequisites + +Before deploying kagent on OpenShift, ensure you have: + +- OpenShift Container Platform 4.12 or later +- `oc` CLI installed and configured +- Cluster-admin or project-admin privileges (depending on deployment scope) +- Helm 3.x installed +- Access to container images (either from `cr.kagent.dev` or your internal registry) + +## OpenShift vs Kubernetes: Key Differences + +OpenShift extends Kubernetes with additional security and operational features: + +| Feature | Kubernetes | OpenShift | +|---------|-----------|-----------| +| Ingress | Ingress resources | Routes (native), Ingress (supported) | +| Security | Pod Security Standards | Security Context Constraints (SCCs) | +| Projects | Namespaces | Projects (namespaces with additional annotations) | +| Registry | External registries | Built-in integrated registry | +| Networking | CNI plugins | OpenShift SDN or OVN-Kubernetes | + +## Security Context Constraints (SCCs) + +### Understanding SCCs + +Security Context Constraints (SCCs) are OpenShift's mechanism for controlling pod security. They define what actions pods can perform and what resources they can access. SCCs are more granular than Kubernetes Pod Security Standards. + +OpenShift includes several built-in SCCs: + +| SCC | Description | Use Case | +|-----|-------------|----------| +| `restricted-v2` | Most restrictive, default for authenticated users | Standard workloads | +| `restricted` | Legacy restricted policy | Backward compatibility | +| `nonroot-v2` | Allows any non-root UID | Workloads needing specific UIDs | +| `hostnetwork` | Allows host networking | Networking components | +| `privileged` | Full access | System components only | + +### Configuring kagent for SCCs + +The kagent Helm chart supports security context configuration through `values.yaml`. For most OpenShift deployments, the default `restricted-v2` SCC works well with proper security context settings. + +To verify which SCC your pods are using: + +```bash +oc get pod -n kagent -o yaml | grep scc +``` + +Configure security contexts in your `values.yaml`: + +```yaml +# Pod-level security context (applies to all containers in the pod) +podSecurityContext: + runAsNonRoot: true + seccompProfile: + type: RuntimeDefault + +# Container-level security context +securityContext: + allowPrivilegeEscalation: false + capabilities: + drop: + - ALL + runAsNonRoot: true + seccompProfile: + type: RuntimeDefault + +# Controller-specific security context (optional override) +controller: + securityContext: + allowPrivilegeEscalation: false + capabilities: + drop: + - ALL + runAsNonRoot: true +``` + +### Creating Custom SCCs + +If your deployment requires capabilities not provided by built-in SCCs, create a custom SCC: + +```yaml +apiVersion: security.openshift.io/v1 +kind: SecurityContextConstraints +metadata: + name: kagent-scc +allowHostDirVolumePlugin: false +allowHostIPC: false +allowHostNetwork: false +allowHostPID: false +allowHostPorts: false +allowPrivilegeEscalation: false +allowPrivilegedContainer: false +allowedCapabilities: [] +defaultAddCapabilities: [] +fsGroup: + type: MustRunAs +readOnlyRootFilesystem: false +requiredDropCapabilities: + - ALL +runAsUser: + type: MustRunAsNonRoot +seLinuxContext: + type: MustRunAs +supplementalGroups: + type: RunAsAny +volumes: + - configMap + - downwardAPI + - emptyDir + - persistentVolumeClaim + - projected + - secret +users: + - system:serviceaccount:kagent:kagent-controller + - system:serviceaccount:kagent:kagent-ui +``` + +Apply the custom SCC: + +```bash +oc apply -f kagent-scc.yaml +``` + +## Pod Security Standards (PSS) + +### restricted-v2 Compatibility + +OpenShift 4.12+ uses the `restricted-v2` SCC by default, which aligns with the Kubernetes Pod Security Standards `restricted` profile. The kagent Helm chart is designed to be compatible with this profile. + +Key requirements for `restricted-v2` compatibility: + +1. **Non-root execution**: Containers must run as non-root users +2. **Read-only root filesystem**: Recommended but not required +3. **No privilege escalation**: `allowPrivilegeEscalation: false` +4. **Dropped capabilities**: All capabilities should be dropped +5. **Seccomp profile**: Must use `RuntimeDefault` or `Localhost` + +### Configuring Security Contexts + +For full `restricted-v2` compliance, use this configuration: + +```yaml +podSecurityContext: + runAsNonRoot: true + seccompProfile: + type: RuntimeDefault + +securityContext: + allowPrivilegeEscalation: false + capabilities: + drop: + - ALL + readOnlyRootFilesystem: true + runAsNonRoot: true + seccompProfile: + type: RuntimeDefault +``` + +Note: If using SQLite storage (the default), the controller needs write access to the SQLite volume. The chart mounts this as an emptyDir, which works with read-only root filesystem. + +## Routes vs Ingress + +### Automatic Route Creation + +The kagent Helm chart automatically creates an OpenShift Route when deployed on OpenShift. This is detected via the `route.openshift.io/v1` API availability. + +The Route is created for the UI service with the following defaults: + +- TLS termination: Edge +- Insecure traffic: Redirect to HTTPS + +To verify the Route was created: + +```bash +oc get route -n kagent +``` + +### Manual Route Configuration + +If you need custom Route configuration, you can disable the automatic Route and create your own: + +First, examine the auto-generated Route: + +```bash +oc get route kagent-ui -n kagent -o yaml +``` + +Create a custom Route: + +```yaml +apiVersion: route.openshift.io/v1 +kind: Route +metadata: + name: kagent-ui + namespace: kagent + labels: + app.kubernetes.io/name: kagent + app.kubernetes.io/component: ui +spec: + host: kagent.apps.your-cluster.example.com + to: + kind: Service + name: kagent-ui + weight: 100 + port: + targetPort: 8080 + tls: + termination: edge + insecureEdgeTerminationPolicy: Redirect + wildcardPolicy: None +``` + +### TLS Configuration + +For production deployments, configure TLS certificates: + +**Using cluster default certificate (edge termination):** + +```yaml +spec: + tls: + termination: edge + insecureEdgeTerminationPolicy: Redirect +``` + +**Using custom certificate:** + +```yaml +spec: + tls: + termination: edge + certificate: | + -----BEGIN CERTIFICATE----- + ... + -----END CERTIFICATE----- + key: | + -----BEGIN RSA PRIVATE KEY----- + ... + -----END RSA PRIVATE KEY----- + caCertificate: | + -----BEGIN CERTIFICATE----- + ... + -----END CERTIFICATE----- + insecureEdgeTerminationPolicy: Redirect +``` + +**Passthrough to application TLS:** + +```yaml +spec: + tls: + termination: passthrough +``` + +## Namespace Patterns and Multi-Tenancy + +### Single-Namespace Deployment + +The simplest deployment pattern puts all kagent components in a single namespace: + +```bash +# Create project (OpenShift's enhanced namespace) +oc new-project kagent + +# Install CRDs (cluster-scoped, requires cluster-admin) +helm install kagent-crds ./helm/kagent-crds -n kagent + +# Install kagent +helm install kagent ./helm/kagent -n kagent \ + --set providers.openAI.apiKey=$OPENAI_API_KEY +``` + +### Multi-Namespace Deployment + +For larger deployments, you may want to separate components: + +```yaml +# values-multi-namespace.yaml + +# Override namespace for CRDs (installed separately) +namespaceOverride: "" + +# Controller watches specific namespaces +controller: + watchNamespaces: + - kagent-agents-dev + - kagent-agents-staging + - kagent-agents-prod +``` + +Deploy agents in separate namespaces: + +```bash +# Create agent namespaces +oc new-project kagent-agents-dev +oc new-project kagent-agents-staging +oc new-project kagent-agents-prod + +# Label namespaces for controller discovery +oc label namespace kagent-agents-dev kagent.dev/managed=true +oc label namespace kagent-agents-staging kagent.dev/managed=true +oc label namespace kagent-agents-prod kagent.dev/managed=true +``` + +### Project-Based Isolation + +OpenShift Projects provide additional isolation through: + +1. **Network Policies**: Automatic network isolation between projects +2. **RBAC**: Project-scoped roles and bindings +3. **Resource Quotas**: Per-project resource limits + +Configure network policies to allow kagent controller communication: + +```yaml +apiVersion: networking.k8s.io/v1 +kind: NetworkPolicy +metadata: + name: allow-kagent-controller + namespace: kagent-agents-dev +spec: + podSelector: {} + policyTypes: + - Ingress + ingress: + - from: + - namespaceSelector: + matchLabels: + kubernetes.io/metadata.name: kagent +``` + +## Deployment Scenarios + +### Basic Deployment + +For development or testing environments: + +```bash +# Create project +oc new-project kagent + +# Install CRDs +helm install kagent-crds ./helm/kagent-crds -n kagent + +# Install with minimal configuration +helm install kagent ./helm/kagent -n kagent \ + --set providers.default=ollama \ + --set providers.ollama.config.host=ollama.kagent.svc.cluster.local:11434 +``` + +### Production Deployment + +For production environments, consider these settings: + +```yaml +# values-production.yaml + +# Use PostgreSQL for persistence +database: + type: postgres + postgres: + url: postgres://user:pass@postgresql.kagent.svc.cluster.local:5432/kagent + +# Security hardening +podSecurityContext: + runAsNonRoot: true + seccompProfile: + type: RuntimeDefault + +securityContext: + allowPrivilegeEscalation: false + capabilities: + drop: + - ALL + readOnlyRootFilesystem: true + runAsNonRoot: true + seccompProfile: + type: RuntimeDefault + +# Resource limits +controller: + resources: + requests: + cpu: 200m + memory: 256Mi + limits: + cpu: 2 + memory: 1Gi + replicas: 2 + +ui: + resources: + requests: + cpu: 100m + memory: 256Mi + limits: + cpu: 1 + memory: 512Mi + replicas: 2 + +# Enable observability +otel: + tracing: + enabled: true + exporter: + otlp: + endpoint: http://jaeger-collector.observability.svc.cluster.local:4317 + insecure: true +``` + +Deploy with production values: + +```bash +helm install kagent ./helm/kagent -n kagent \ + -f values-production.yaml \ + --set providers.openAI.apiKeySecretRef=kagent-openai +``` + +### Air-Gapped Deployment + +For disconnected or air-gapped environments: + +1. **Mirror images to internal registry:** + +```bash +# Pull images +podman pull cr.kagent.dev/kagent-dev/kagent/controller:latest +podman pull cr.kagent.dev/kagent-dev/kagent/ui:latest +podman pull cr.kagent.dev/kagent-dev/kagent/app:latest + +# Tag for internal registry +podman tag cr.kagent.dev/kagent-dev/kagent/controller:latest \ + registry.internal.example.com/kagent/controller:latest + +# Push to internal registry +podman push registry.internal.example.com/kagent/controller:latest +``` + +2. **Configure Helm values for internal registry:** + +```yaml +# values-airgapped.yaml + +registry: registry.internal.example.com/kagent + +imagePullSecrets: + - name: internal-registry-secret + +controller: + image: + registry: registry.internal.example.com + repository: kagent/controller + +ui: + image: + registry: registry.internal.example.com + repository: kagent/ui +``` + +3. **Create image pull secret:** + +```bash +oc create secret docker-registry internal-registry-secret \ + --docker-server=registry.internal.example.com \ + --docker-username=user \ + --docker-password=password \ + -n kagent +``` + +## Troubleshooting + +### Common Issues + +**1. Pods stuck in CrashLoopBackOff with permission errors** + +This usually indicates SCC issues. Check which SCC is being used: + +```bash +oc get pod -n kagent -o yaml | grep scc +oc adm policy who-can use scc restricted-v2 +``` + +**2. Route not accessible** + +Verify the Route is created and has an assigned host: + +```bash +oc get route -n kagent +oc describe route kagent-ui -n kagent +``` + +Check if the router pods are running: + +```bash +oc get pods -n openshift-ingress +``` + +**3. Controller cannot watch other namespaces** + +Ensure proper RBAC is configured for cross-namespace access: + +```bash +oc get clusterrolebinding | grep kagent +oc auth can-i list agents --as=system:serviceaccount:kagent:kagent-controller -n other-namespace +``` + +**4. Image pull errors in air-gapped environment** + +Verify the image pull secret is correctly configured: + +```bash +oc get secret internal-registry-secret -n kagent +oc describe pod -n kagent | grep -A5 "Events" +``` + +### Useful Commands + +```bash +# View all kagent resources +oc get all -n kagent + +# Check pod security context +oc get pod -n kagent -o jsonpath='{.spec.securityContext}' + +# View SCC for a specific service account +oc adm policy who-can use scc restricted-v2 -n kagent + +# Debug network connectivity +oc debug pod/ -n kagent + +# View controller logs +oc logs -f deployment/kagent-controller -n kagent + +# View UI logs +oc logs -f deployment/kagent-ui -n kagent +``` + +### Getting Help + +If you encounter issues: + +1. Check the [kagent documentation](https://kagent.dev/docs/kagent) +2. Search existing [GitHub issues](https://github.com/kagent-dev/kagent/issues) +3. Join the [CNCF Slack #kagent channel](https://cloud-native.slack.com/archives/C08ETST0076) +4. Join the [kagent Discord server](https://discord.gg/Fu3k65f2k3) diff --git a/docs/security-context.md b/docs/security-context.md new file mode 100644 index 000000000..b32b9e7b4 --- /dev/null +++ b/docs/security-context.md @@ -0,0 +1,302 @@ +# Security Context Configuration + +This document describes how kagent handles security contexts for agent deployments, including compatibility with OpenShift Security Context Constraints (SCC) and Kubernetes Pod Security Standards (PSS). + +## Overview + +kagent allows users to configure security contexts at both the pod and container levels through the Agent CRD. These settings are passed through to the generated Kubernetes Deployment, enabling compliance with various security policies. + +## Configuration Options + +### Pod Security Context + +Configure pod-level security settings via `spec.declarative.deployment.podSecurityContext`: + +```yaml +apiVersion: kagent.dev/v1alpha2 +kind: Agent +metadata: + name: secure-agent +spec: + type: Declarative + declarative: + systemMessage: "A secure agent" + modelConfig: my-model + deployment: + podSecurityContext: + runAsNonRoot: true + runAsUser: 1000 + runAsGroup: 1000 + fsGroup: 1000 + seccompProfile: + type: RuntimeDefault +``` + +### Container Security Context + +Configure container-level security settings via `spec.declarative.deployment.securityContext`: + +```yaml +apiVersion: kagent.dev/v1alpha2 +kind: Agent +metadata: + name: secure-agent +spec: + type: Declarative + declarative: + systemMessage: "A secure agent" + modelConfig: my-model + deployment: + securityContext: + runAsNonRoot: true + runAsUser: 1000 + runAsGroup: 1000 + allowPrivilegeEscalation: false + readOnlyRootFilesystem: true + capabilities: + drop: + - ALL + seccompProfile: + type: RuntimeDefault +``` + +## OpenShift SCC Compatibility + +kagent supports deployment under various OpenShift Security Context Constraints. Below is a compatibility matrix showing which SCCs can be used with different kagent configurations. + +### SCC Compatibility Matrix + +| SCC | Basic Agent | Agent with Skills | Notes | +|-----|-------------|-------------------|-------| +| `restricted` | Yes | No | Most secure, requires UID range allocation | +| `restricted-v2` | Yes | No | Updated restricted with seccomp requirements | +| `anyuid` | Yes | No | Allows specific UID without range restrictions | +| `nonroot` | Yes | No | Must run as non-root user | +| `nonroot-v2` | Yes | No | Updated nonroot with seccomp requirements | +| `privileged` | Yes | Yes | Required for agents with skills/sandbox | + +### Restricted SCC Configuration + +For OpenShift's `restricted` or `restricted-v2` SCC: + +```yaml +apiVersion: kagent.dev/v1alpha2 +kind: Agent +metadata: + name: restricted-agent +spec: + type: Declarative + declarative: + systemMessage: "Agent running under restricted SCC" + modelConfig: my-model + deployment: + podSecurityContext: + runAsNonRoot: true + # OpenShift will assign UID from namespace range + fsGroup: 1000680000 # Use your namespace's allocated range + seccompProfile: + type: RuntimeDefault + securityContext: + runAsNonRoot: true + allowPrivilegeEscalation: false + capabilities: + drop: + - ALL + seccompProfile: + type: RuntimeDefault +``` + +### Anyuid SCC Configuration + +For OpenShift's `anyuid` SCC (when you need a specific UID): + +```yaml +apiVersion: kagent.dev/v1alpha2 +kind: Agent +metadata: + name: anyuid-agent +spec: + type: Declarative + declarative: + systemMessage: "Agent running under anyuid SCC" + modelConfig: my-model + deployment: + podSecurityContext: + runAsUser: 1000 + runAsGroup: 1000 + fsGroup: 1000 + securityContext: + runAsUser: 1000 + runAsGroup: 1000 + allowPrivilegeEscalation: false + capabilities: + drop: + - ALL +``` + +### Privileged SCC Configuration + +For agents that require skills or sandboxed code execution: + +```yaml +apiVersion: kagent.dev/v1alpha2 +kind: Agent +metadata: + name: privileged-agent +spec: + type: Declarative + skills: + refs: + - my-skill:latest + declarative: + systemMessage: "Agent with skills requiring privileged SCC" + modelConfig: my-model + deployment: + podSecurityContext: + runAsUser: 0 + runAsGroup: 0 +``` + +**Important**: When an agent uses skills, kagent automatically sets `privileged: true` on the container security context to enable sandbox functionality. + +## Pod Security Standards (PSS) Compliance + +kagent supports Kubernetes Pod Security Standards at all levels. The table below shows compatibility: + +### PSS Compatibility Matrix + +| PSS Level | Basic Agent | Agent with Skills | Notes | +|-----------|-------------|-------------------|-------| +| `privileged` | Yes | Yes | No restrictions | +| `baseline` | Yes | No | Prevents known privilege escalations | +| `restricted` | Yes | No | Highly restrictive, current best practices | + +### PSS Restricted Profile Configuration + +For namespaces enforcing the `restricted` PSS profile: + +```yaml +apiVersion: kagent.dev/v1alpha2 +kind: Agent +metadata: + name: pss-restricted-agent +spec: + type: Declarative + declarative: + systemMessage: "PSS restricted compliant agent" + modelConfig: my-model + deployment: + podSecurityContext: + runAsNonRoot: true + runAsUser: 65534 # nobody user + runAsGroup: 65534 + fsGroup: 65534 + seccompProfile: + type: RuntimeDefault + securityContext: + runAsNonRoot: true + runAsUser: 65534 + runAsGroup: 65534 + allowPrivilegeEscalation: false + readOnlyRootFilesystem: true + capabilities: + drop: + - ALL + seccompProfile: + type: RuntimeDefault + # Add volumes for writable paths when using readOnlyRootFilesystem + volumes: + - name: tmp + emptyDir: {} + - name: cache + emptyDir: {} + volumeMounts: + - name: tmp + mountPath: /tmp + - name: cache + mountPath: /cache +``` + +### PSS Baseline Profile Configuration + +For namespaces enforcing the `baseline` PSS profile: + +```yaml +apiVersion: kagent.dev/v1alpha2 +kind: Agent +metadata: + name: pss-baseline-agent +spec: + type: Declarative + declarative: + systemMessage: "PSS baseline compliant agent" + modelConfig: my-model + deployment: + securityContext: + allowPrivilegeEscalation: false + capabilities: + drop: + - ALL + add: + - NET_BIND_SERVICE # Allowed in baseline +``` + +## Sandbox and Privileged Mode + +When an agent uses skills (via `spec.skills.refs`), kagent automatically enables privileged mode for the container. This is required for: + +- Container-based sandbox execution +- Isolated code execution environments +- Skill package installation and execution + +The privileged mode is applied regardless of user-provided security context settings, though other security context fields (like `runAsUser`) are preserved. + +### Implications + +1. **OpenShift**: Requires the `privileged` SCC to be assigned to the service account +2. **Kubernetes PSS**: Cannot run in namespaces with `restricted` or `baseline` enforcement +3. **Security**: Carefully review which agents require skills and limit their use in production + +## Best Practices + +1. **Start Restricted**: Begin with the most restrictive settings and relax only as needed +2. **Use Non-Root**: Always prefer running as non-root (`runAsNonRoot: true`) +3. **Drop Capabilities**: Drop all capabilities and add back only what's needed +4. **Enable Seccomp**: Use `RuntimeDefault` seccomp profile for defense in depth +5. **Read-Only Filesystem**: Use `readOnlyRootFilesystem: true` with explicit writable volume mounts +6. **Limit Skills Usage**: Only use skills when necessary, as they require privileged mode +7. **Namespace Isolation**: Run privileged agents in dedicated namespaces with appropriate PSS/SCC policies + +## Troubleshooting + +### Common Issues + +**Pod fails to start with "container has runAsNonRoot and image will run as root"** +- Ensure both `runAsNonRoot: true` and a non-zero `runAsUser` are set + +**Pod fails with SCC/PSS policy violations** +- Check the namespace's PSS enforcement level: `kubectl get ns -o yaml` +- For OpenShift, verify SCC assignments: `oc get scc` and `oc adm policy who-can use scc/` + +**Agent with skills fails to start** +- Verify the privileged SCC is available (OpenShift) +- Ensure the namespace allows privileged pods (Kubernetes PSS) + +### Verification Commands + +```bash +# Check which SCC a pod is using (OpenShift) +oc get pod -o yaml | grep scc + +# Check namespace PSS enforcement (Kubernetes) +kubectl get ns -o yaml | grep pod-security + +# Describe pod for security context issues +kubectl describe pod +``` + +## References + +- [Kubernetes Pod Security Standards](https://kubernetes.io/docs/concepts/security/pod-security-standards/) +- [OpenShift Security Context Constraints](https://docs.openshift.com/container-platform/latest/authentication/managing-security-context-constraints.html) +- [Kubernetes Security Context](https://kubernetes.io/docs/tasks/configure-pod-container/security-context/) diff --git a/go/internal/controller/reconciler/reconciler.go b/go/internal/controller/reconciler/reconciler.go index 258fd62a9..777936590 100644 --- a/go/internal/controller/reconciler/reconciler.go +++ b/go/internal/controller/reconciler/reconciler.go @@ -58,6 +58,11 @@ type kagentReconciler struct { defaultModelConfig types.NamespacedName + // watchedNamespaces contains the list of namespaces the controller is allowed to operate on. + // If empty, all namespaces are allowed (cluster-wide mode). + // If non-empty, only operations within these namespaces are permitted (namespace-scoped mode). + watchedNamespaces []string + // TODO: Remove this lock since we have a DB which we can batch anyway upsertLock sync.Mutex } @@ -67,16 +72,50 @@ func NewKagentReconciler( kube client.Client, dbClient database.Client, defaultModelConfig types.NamespacedName, + watchedNamespaces []string, ) KagentReconciler { return &kagentReconciler{ adkTranslator: translator, kube: kube, dbClient: dbClient, defaultModelConfig: defaultModelConfig, + watchedNamespaces: watchedNamespaces, + } +} + +// isNamespaceAllowed checks if the given namespace is allowed based on the watched namespaces configuration. +// Returns true if: +// - watchedNamespaces is empty (cluster-wide mode, all namespaces allowed) +// - namespace is in the watchedNamespaces list (namespace-scoped mode) +func (a *kagentReconciler) isNamespaceAllowed(namespace string) bool { + if len(a.watchedNamespaces) == 0 { + return true } + return slices.Contains(a.watchedNamespaces, namespace) +} + +// validateNamespaceIsolation checks if an operation in the given namespace is allowed. +// Returns an error if namespace isolation is violated. +func (a *kagentReconciler) validateNamespaceIsolation(namespace string) error { + if !a.isNamespaceAllowed(namespace) { + return fmt.Errorf("namespace %q is not in the list of watched namespaces; controller is in namespace-scoped mode watching: %v", namespace, a.watchedNamespaces) + } + return nil +} + +// IsNamespaceScopedMode returns true if the controller is operating in namespace-scoped mode +// (i.e., watching specific namespaces rather than all namespaces). +func (a *kagentReconciler) IsNamespaceScopedMode() bool { + return len(a.watchedNamespaces) > 0 } func (a *kagentReconciler) ReconcileKagentAgent(ctx context.Context, req ctrl.Request) error { + // Enforce namespace isolation in namespace-scoped mode + if err := a.validateNamespaceIsolation(req.Namespace); err != nil { + reconcileLog.Info("Skipping agent reconciliation due to namespace isolation", "agent", req.NamespacedName, "error", err) + return nil + } + // TODO(sbx0r): missing finalizer logic agent := &v1alpha2.Agent{} if err := a.kube.Get(ctx, req.NamespacedName, agent); err != nil { @@ -171,6 +210,12 @@ func (a *kagentReconciler) reconcileAgentStatus(ctx context.Context, agent *v1al } func (a *kagentReconciler) ReconcileKagentMCPService(ctx context.Context, req ctrl.Request) error { + // Enforce namespace isolation in namespace-scoped mode + if err := a.validateNamespaceIsolation(req.Namespace); err != nil { + reconcileLog.Info("Skipping MCP service reconciliation due to namespace isolation", "service", req.NamespacedName, "error", err) + return nil + } + service := &corev1.Service{} if err := a.kube.Get(ctx, req.NamespacedName, service); err != nil { if apierrors.IsNotFound(err) { @@ -214,6 +259,12 @@ type secretRef struct { } func (a *kagentReconciler) ReconcileKagentModelConfig(ctx context.Context, req ctrl.Request) error { + // Enforce namespace isolation in namespace-scoped mode + if err := a.validateNamespaceIsolation(req.Namespace); err != nil { + reconcileLog.Info("Skipping model config reconciliation due to namespace isolation", "modelConfig", req.NamespacedName, "error", err) + return nil + } + modelConfig := &v1alpha2.ModelConfig{} if err := a.kube.Get(ctx, req.NamespacedName, modelConfig); err != nil { if apierrors.IsNotFound(err) { @@ -337,6 +388,12 @@ func (a *kagentReconciler) reconcileModelConfigStatus(ctx context.Context, model } func (a *kagentReconciler) ReconcileKagentMCPServer(ctx context.Context, req ctrl.Request) error { + // Enforce namespace isolation in namespace-scoped mode + if err := a.validateNamespaceIsolation(req.Namespace); err != nil { + reconcileLog.Info("Skipping MCP server reconciliation due to namespace isolation", "mcpServer", req.NamespacedName, "error", err) + return nil + } + mcpServer := &v1alpha1.MCPServer{} if err := a.kube.Get(ctx, req.NamespacedName, mcpServer); err != nil { if apierrors.IsNotFound(err) { @@ -375,6 +432,12 @@ func (a *kagentReconciler) ReconcileKagentMCPServer(ctx context.Context, req ctr } func (a *kagentReconciler) ReconcileKagentRemoteMCPServer(ctx context.Context, req ctrl.Request) error { + // Enforce namespace isolation in namespace-scoped mode + if err := a.validateNamespaceIsolation(req.Namespace); err != nil { + reconcileLog.Info("Skipping remote MCP server reconciliation due to namespace isolation", "remoteMCPServer", req.NamespacedName, "error", err) + return nil + } + nns := req.NamespacedName serverRef := nns.String() l := reconcileLog.WithValues("remoteMCPServer", serverRef) diff --git a/go/internal/controller/reconciler/reconciler_test.go b/go/internal/controller/reconciler/reconciler_test.go index 4cd6bd5ea..27ba14a7f 100644 --- a/go/internal/controller/reconciler/reconciler_test.go +++ b/go/internal/controller/reconciler/reconciler_test.go @@ -207,3 +207,157 @@ func TestAgentIDConsistency(t *testing.T) { assert.Equal(t, storeID, deleteID) } + +// TestIsNamespaceAllowed tests the namespace isolation logic +func TestIsNamespaceAllowed(t *testing.T) { + tests := []struct { + name string + watchedNamespaces []string + namespace string + expected bool + }{ + { + name: "cluster-wide mode (empty watchedNamespaces) allows all namespaces", + watchedNamespaces: []string{}, + namespace: "any-namespace", + expected: true, + }, + { + name: "cluster-wide mode allows default namespace", + watchedNamespaces: []string{}, + namespace: "default", + expected: true, + }, + { + name: "namespace-scoped mode allows watched namespace", + watchedNamespaces: []string{"ns1", "ns2", "ns3"}, + namespace: "ns1", + expected: true, + }, + { + name: "namespace-scoped mode allows another watched namespace", + watchedNamespaces: []string{"ns1", "ns2", "ns3"}, + namespace: "ns3", + expected: true, + }, + { + name: "namespace-scoped mode blocks non-watched namespace", + watchedNamespaces: []string{"ns1", "ns2"}, + namespace: "ns3", + expected: false, + }, + { + name: "namespace-scoped mode blocks default namespace if not in list", + watchedNamespaces: []string{"ns1", "ns2"}, + namespace: "default", + expected: false, + }, + { + name: "single namespace mode allows only that namespace", + watchedNamespaces: []string{"production"}, + namespace: "production", + expected: true, + }, + { + name: "single namespace mode blocks other namespaces", + watchedNamespaces: []string{"production"}, + namespace: "staging", + expected: false, + }, + } + + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + r := &kagentReconciler{ + watchedNamespaces: tt.watchedNamespaces, + } + result := r.isNamespaceAllowed(tt.namespace) + assert.Equal(t, tt.expected, result) + }) + } +} + +// TestValidateNamespaceIsolation tests the namespace isolation validation +func TestValidateNamespaceIsolation(t *testing.T) { + tests := []struct { + name string + watchedNamespaces []string + namespace string + expectError bool + }{ + { + name: "cluster-wide mode returns no error", + watchedNamespaces: []string{}, + namespace: "any-namespace", + expectError: false, + }, + { + name: "namespace-scoped mode allows watched namespace", + watchedNamespaces: []string{"allowed-ns"}, + namespace: "allowed-ns", + expectError: false, + }, + { + name: "namespace-scoped mode returns error for non-watched namespace", + watchedNamespaces: []string{"allowed-ns"}, + namespace: "blocked-ns", + expectError: true, + }, + } + + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + r := &kagentReconciler{ + watchedNamespaces: tt.watchedNamespaces, + } + err := r.validateNamespaceIsolation(tt.namespace) + if tt.expectError { + assert.Error(t, err) + assert.Contains(t, err.Error(), "namespace-scoped mode") + assert.Contains(t, err.Error(), tt.namespace) + } else { + assert.NoError(t, err) + } + }) + } +} + +// TestIsNamespaceScopedMode tests the namespace scoped mode detection +func TestIsNamespaceScopedMode(t *testing.T) { + tests := []struct { + name string + watchedNamespaces []string + expected bool + }{ + { + name: "empty watchedNamespaces is not namespace-scoped", + watchedNamespaces: []string{}, + expected: false, + }, + { + name: "nil watchedNamespaces is not namespace-scoped", + watchedNamespaces: nil, + expected: false, + }, + { + name: "single namespace is namespace-scoped", + watchedNamespaces: []string{"ns1"}, + expected: true, + }, + { + name: "multiple namespaces is namespace-scoped", + watchedNamespaces: []string{"ns1", "ns2", "ns3"}, + expected: true, + }, + } + + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + r := &kagentReconciler{ + watchedNamespaces: tt.watchedNamespaces, + } + result := r.IsNamespaceScopedMode() + assert.Equal(t, tt.expected, result) + }) + } +} diff --git a/go/internal/controller/translator/agent/security_context_test.go b/go/internal/controller/translator/agent/security_context_test.go index 71f1d5296..8f477d88e 100644 --- a/go/internal/controller/translator/agent/security_context_test.go +++ b/go/internal/controller/translator/agent/security_context_test.go @@ -349,3 +349,617 @@ func TestSecurityContext_WithSandbox(t *testing.T) { assert.True(t, *containerSecurityContext.Privileged, "Privileged should be true when sandbox is needed") assert.Equal(t, int64(1000), *containerSecurityContext.RunAsUser, "User-provided runAsUser should still be set") } + +// OpenShift SCC (Security Context Constraints) Test Scenarios +// These tests verify compatibility with OpenShift's SCC policies: +// - restricted: Most restrictive, requires specific UID/GID ranges +// - anyuid: Allows running as any user ID +// - privileged: Allows privileged containers + +// TestSecurityContext_OpenShiftSCC_Restricted tests that agents can run under +// OpenShift's restricted SCC which enforces: +// - RunAsNonRoot: true +// - Specific UID/GID ranges (typically allocated by the namespace) +// - No privilege escalation +// - Capabilities dropped to minimum +func TestSecurityContext_OpenShiftSCC_Restricted(t *testing.T) { + ctx := context.Background() + + // OpenShift restricted SCC requires specific security settings + // UID ranges are typically allocated per-namespace (e.g., 1000680000-1000689999) + agent := &v1alpha2.Agent{ + ObjectMeta: metav1.ObjectMeta{ + Name: "test-agent", + Namespace: "test", + }, + Spec: v1alpha2.AgentSpec{ + Type: v1alpha2.AgentType_Declarative, + Declarative: &v1alpha2.DeclarativeAgentSpec{ + SystemMessage: "Test agent", + ModelConfig: "test-model", + Deployment: &v1alpha2.DeclarativeDeploymentSpec{ + SharedDeploymentSpec: v1alpha2.SharedDeploymentSpec{ + PodSecurityContext: &corev1.PodSecurityContext{ + RunAsNonRoot: ptr.To(true), + FSGroup: ptr.To(int64(1000680000)), + SeccompProfile: &corev1.SeccompProfile{ + Type: corev1.SeccompProfileTypeRuntimeDefault, + }, + }, + SecurityContext: &corev1.SecurityContext{ + RunAsNonRoot: ptr.To(true), + AllowPrivilegeEscalation: ptr.To(false), + Capabilities: &corev1.Capabilities{ + Drop: []corev1.Capability{"ALL"}, + }, + SeccompProfile: &corev1.SeccompProfile{ + Type: corev1.SeccompProfileTypeRuntimeDefault, + }, + }, + }, + }, + }, + }, + } + + modelConfig := &v1alpha2.ModelConfig{ + ObjectMeta: metav1.ObjectMeta{ + Name: "test-model", + Namespace: "test", + }, + Spec: v1alpha2.ModelConfigSpec{ + Provider: "OpenAI", + Model: "gpt-4o", + }, + } + + scheme := schemev1.Scheme + err := v1alpha2.AddToScheme(scheme) + require.NoError(t, err) + + kubeClient := fake.NewClientBuilder(). + WithScheme(scheme). + WithObjects(agent, modelConfig). + Build() + + defaultModel := types.NamespacedName{ + Namespace: "test", + Name: "test-model", + } + translatorInstance := translator.NewAdkApiTranslator(kubeClient, defaultModel, nil) + + result, err := translatorInstance.TranslateAgent(ctx, agent) + require.NoError(t, err) + + var deployment *appsv1.Deployment + for _, obj := range result.Manifest { + if dep, ok := obj.(*appsv1.Deployment); ok { + deployment = dep + break + } + } + require.NotNil(t, deployment) + podTemplate := &deployment.Spec.Template + + // Verify pod security context meets restricted SCC requirements + podSecurityContext := podTemplate.Spec.SecurityContext + require.NotNil(t, podSecurityContext, "Pod securityContext should be set for restricted SCC") + assert.True(t, *podSecurityContext.RunAsNonRoot, "RunAsNonRoot must be true for restricted SCC") + assert.Equal(t, int64(1000680000), *podSecurityContext.FSGroup, "FSGroup should be in OpenShift allocated range") + require.NotNil(t, podSecurityContext.SeccompProfile, "SeccompProfile should be set") + assert.Equal(t, corev1.SeccompProfileTypeRuntimeDefault, podSecurityContext.SeccompProfile.Type) + + // Verify container security context meets restricted SCC requirements + containerSecurityContext := podTemplate.Spec.Containers[0].SecurityContext + require.NotNil(t, containerSecurityContext, "Container securityContext should be set for restricted SCC") + assert.True(t, *containerSecurityContext.RunAsNonRoot, "Container runAsNonRoot must be true") + assert.False(t, *containerSecurityContext.AllowPrivilegeEscalation, "AllowPrivilegeEscalation must be false") + require.NotNil(t, containerSecurityContext.Capabilities, "Capabilities should be set") + assert.Contains(t, containerSecurityContext.Capabilities.Drop, corev1.Capability("ALL"), "Should drop ALL capabilities") + assert.Empty(t, containerSecurityContext.Capabilities.Add, "No capabilities should be added for restricted SCC") +} + +// TestSecurityContext_OpenShiftSCC_Anyuid tests that agents can run under +// OpenShift's anyuid SCC which allows: +// - Running as any user ID (including root) +// - Still requires security best practices for containers +func TestSecurityContext_OpenShiftSCC_Anyuid(t *testing.T) { + ctx := context.Background() + + // Anyuid SCC allows running as specific UID without OpenShift range restrictions + agent := &v1alpha2.Agent{ + ObjectMeta: metav1.ObjectMeta{ + Name: "test-agent", + Namespace: "test", + }, + Spec: v1alpha2.AgentSpec{ + Type: v1alpha2.AgentType_Declarative, + Declarative: &v1alpha2.DeclarativeAgentSpec{ + SystemMessage: "Test agent", + ModelConfig: "test-model", + Deployment: &v1alpha2.DeclarativeDeploymentSpec{ + SharedDeploymentSpec: v1alpha2.SharedDeploymentSpec{ + PodSecurityContext: &corev1.PodSecurityContext{ + RunAsUser: ptr.To(int64(1000)), + RunAsGroup: ptr.To(int64(1000)), + FSGroup: ptr.To(int64(1000)), + }, + SecurityContext: &corev1.SecurityContext{ + RunAsUser: ptr.To(int64(1000)), + RunAsGroup: ptr.To(int64(1000)), + AllowPrivilegeEscalation: ptr.To(false), + Capabilities: &corev1.Capabilities{ + Drop: []corev1.Capability{"ALL"}, + }, + }, + }, + }, + }, + }, + } + + modelConfig := &v1alpha2.ModelConfig{ + ObjectMeta: metav1.ObjectMeta{ + Name: "test-model", + Namespace: "test", + }, + Spec: v1alpha2.ModelConfigSpec{ + Provider: "OpenAI", + Model: "gpt-4o", + }, + } + + scheme := schemev1.Scheme + err := v1alpha2.AddToScheme(scheme) + require.NoError(t, err) + + kubeClient := fake.NewClientBuilder(). + WithScheme(scheme). + WithObjects(agent, modelConfig). + Build() + + defaultModel := types.NamespacedName{ + Namespace: "test", + Name: "test-model", + } + translatorInstance := translator.NewAdkApiTranslator(kubeClient, defaultModel, nil) + + result, err := translatorInstance.TranslateAgent(ctx, agent) + require.NoError(t, err) + + var deployment *appsv1.Deployment + for _, obj := range result.Manifest { + if dep, ok := obj.(*appsv1.Deployment); ok { + deployment = dep + break + } + } + require.NotNil(t, deployment) + podTemplate := &deployment.Spec.Template + + // Verify pod security context for anyuid SCC + podSecurityContext := podTemplate.Spec.SecurityContext + require.NotNil(t, podSecurityContext) + assert.Equal(t, int64(1000), *podSecurityContext.RunAsUser, "Anyuid allows specific user ID") + assert.Equal(t, int64(1000), *podSecurityContext.RunAsGroup) + assert.Equal(t, int64(1000), *podSecurityContext.FSGroup) + + // Verify container security context for anyuid SCC + containerSecurityContext := podTemplate.Spec.Containers[0].SecurityContext + require.NotNil(t, containerSecurityContext) + assert.Equal(t, int64(1000), *containerSecurityContext.RunAsUser) + assert.False(t, *containerSecurityContext.AllowPrivilegeEscalation, "Should still prevent privilege escalation") +} + +// TestSecurityContext_OpenShiftSCC_Privileged tests agents that require +// privileged access, such as those using sandboxed code execution. +// Note: This SCC should only be used when absolutely necessary. +func TestSecurityContext_OpenShiftSCC_Privileged(t *testing.T) { + ctx := context.Background() + + // When skills are used, kagent requires privileged mode for sandbox + agent := &v1alpha2.Agent{ + ObjectMeta: metav1.ObjectMeta{ + Name: "test-agent", + Namespace: "test", + }, + Spec: v1alpha2.AgentSpec{ + Type: v1alpha2.AgentType_Declarative, + Skills: &v1alpha2.SkillForAgent{ + Refs: []string{"code-exec-skill:latest"}, + }, + Declarative: &v1alpha2.DeclarativeAgentSpec{ + SystemMessage: "Test agent", + ModelConfig: "test-model", + Deployment: &v1alpha2.DeclarativeDeploymentSpec{ + SharedDeploymentSpec: v1alpha2.SharedDeploymentSpec{ + PodSecurityContext: &corev1.PodSecurityContext{ + RunAsUser: ptr.To(int64(0)), + RunAsGroup: ptr.To(int64(0)), + }, + }, + }, + }, + }, + } + + modelConfig := &v1alpha2.ModelConfig{ + ObjectMeta: metav1.ObjectMeta{ + Name: "test-model", + Namespace: "test", + }, + Spec: v1alpha2.ModelConfigSpec{ + Provider: "OpenAI", + Model: "gpt-4o", + }, + } + + scheme := schemev1.Scheme + err := v1alpha2.AddToScheme(scheme) + require.NoError(t, err) + + kubeClient := fake.NewClientBuilder(). + WithScheme(scheme). + WithObjects(agent, modelConfig). + Build() + + defaultModel := types.NamespacedName{ + Namespace: "test", + Name: "test-model", + } + translatorInstance := translator.NewAdkApiTranslator(kubeClient, defaultModel, nil) + + result, err := translatorInstance.TranslateAgent(ctx, agent) + require.NoError(t, err) + + var deployment *appsv1.Deployment + for _, obj := range result.Manifest { + if dep, ok := obj.(*appsv1.Deployment); ok { + deployment = dep + break + } + } + require.NotNil(t, deployment) + podTemplate := &deployment.Spec.Template + + // Verify privileged container for sandbox + containerSecurityContext := podTemplate.Spec.Containers[0].SecurityContext + require.NotNil(t, containerSecurityContext) + assert.True(t, *containerSecurityContext.Privileged, "Privileged must be true for sandbox execution") + + // Pod security context should still be respected + podSecurityContext := podTemplate.Spec.SecurityContext + require.NotNil(t, podSecurityContext) + assert.Equal(t, int64(0), *podSecurityContext.RunAsUser, "Should run as root when privileged") +} + +// Pod Security Standards (PSS) Test Scenarios +// These tests verify compatibility with Kubernetes Pod Security Standards: +// - restricted: Highly restricted, following current pod hardening best practices +// - baseline: Minimally restrictive, prevents known privilege escalations +// - privileged: Unrestricted policy + +// TestSecurityContext_PSS_RestrictedV2 tests that agents can run under the +// Pod Security Standards "restricted" profile (v2), which is the most secure +// and requires: +// - Running as non-root +// - Seccomp profile set to RuntimeDefault or Localhost +// - Dropping all capabilities +// - No privilege escalation +// - Read-only root filesystem (recommended) +func TestSecurityContext_PSS_RestrictedV2(t *testing.T) { + ctx := context.Background() + + // PSS restricted profile (v2) configuration + agent := &v1alpha2.Agent{ + ObjectMeta: metav1.ObjectMeta{ + Name: "test-agent", + Namespace: "test", + }, + Spec: v1alpha2.AgentSpec{ + Type: v1alpha2.AgentType_Declarative, + Declarative: &v1alpha2.DeclarativeAgentSpec{ + SystemMessage: "Test agent", + ModelConfig: "test-model", + Deployment: &v1alpha2.DeclarativeDeploymentSpec{ + SharedDeploymentSpec: v1alpha2.SharedDeploymentSpec{ + PodSecurityContext: &corev1.PodSecurityContext{ + RunAsNonRoot: ptr.To(true), + RunAsUser: ptr.To(int64(65534)), // nobody user + RunAsGroup: ptr.To(int64(65534)), + FSGroup: ptr.To(int64(65534)), + SeccompProfile: &corev1.SeccompProfile{ + Type: corev1.SeccompProfileTypeRuntimeDefault, + }, + }, + SecurityContext: &corev1.SecurityContext{ + RunAsNonRoot: ptr.To(true), + RunAsUser: ptr.To(int64(65534)), + RunAsGroup: ptr.To(int64(65534)), + AllowPrivilegeEscalation: ptr.To(false), + ReadOnlyRootFilesystem: ptr.To(true), + Capabilities: &corev1.Capabilities{ + Drop: []corev1.Capability{"ALL"}, + }, + SeccompProfile: &corev1.SeccompProfile{ + Type: corev1.SeccompProfileTypeRuntimeDefault, + }, + }, + }, + }, + }, + }, + } + + modelConfig := &v1alpha2.ModelConfig{ + ObjectMeta: metav1.ObjectMeta{ + Name: "test-model", + Namespace: "test", + }, + Spec: v1alpha2.ModelConfigSpec{ + Provider: "OpenAI", + Model: "gpt-4o", + }, + } + + scheme := schemev1.Scheme + err := v1alpha2.AddToScheme(scheme) + require.NoError(t, err) + + kubeClient := fake.NewClientBuilder(). + WithScheme(scheme). + WithObjects(agent, modelConfig). + Build() + + defaultModel := types.NamespacedName{ + Namespace: "test", + Name: "test-model", + } + translatorInstance := translator.NewAdkApiTranslator(kubeClient, defaultModel, nil) + + result, err := translatorInstance.TranslateAgent(ctx, agent) + require.NoError(t, err) + + var deployment *appsv1.Deployment + for _, obj := range result.Manifest { + if dep, ok := obj.(*appsv1.Deployment); ok { + deployment = dep + break + } + } + require.NotNil(t, deployment) + podTemplate := &deployment.Spec.Template + + // Verify pod security context meets PSS restricted requirements + podSecurityContext := podTemplate.Spec.SecurityContext + require.NotNil(t, podSecurityContext, "Pod securityContext is required for PSS restricted") + assert.True(t, *podSecurityContext.RunAsNonRoot, "RunAsNonRoot must be true for PSS restricted") + assert.Equal(t, int64(65534), *podSecurityContext.RunAsUser, "Should run as nobody user") + require.NotNil(t, podSecurityContext.SeccompProfile, "SeccompProfile is required for PSS restricted") + assert.Equal(t, corev1.SeccompProfileTypeRuntimeDefault, podSecurityContext.SeccompProfile.Type) + + // Verify container security context meets PSS restricted requirements + containerSecurityContext := podTemplate.Spec.Containers[0].SecurityContext + require.NotNil(t, containerSecurityContext, "Container securityContext is required") + assert.True(t, *containerSecurityContext.RunAsNonRoot, "RunAsNonRoot must be true") + assert.False(t, *containerSecurityContext.AllowPrivilegeEscalation, "AllowPrivilegeEscalation must be false") + assert.True(t, *containerSecurityContext.ReadOnlyRootFilesystem, "ReadOnlyRootFilesystem should be true") + + // Verify capabilities are dropped + require.NotNil(t, containerSecurityContext.Capabilities, "Capabilities must be configured") + assert.Contains(t, containerSecurityContext.Capabilities.Drop, corev1.Capability("ALL"), "Must drop ALL capabilities") + assert.Empty(t, containerSecurityContext.Capabilities.Add, "No capabilities should be added for PSS restricted") + + // Verify seccomp profile at container level + require.NotNil(t, containerSecurityContext.SeccompProfile, "Container seccompProfile is required") + assert.Equal(t, corev1.SeccompProfileTypeRuntimeDefault, containerSecurityContext.SeccompProfile.Type) +} + +// TestSecurityContext_PSS_Baseline tests the Pod Security Standards "baseline" +// profile which prevents known privilege escalations but is more permissive. +func TestSecurityContext_PSS_Baseline(t *testing.T) { + ctx := context.Background() + + // PSS baseline profile - minimal restrictions + agent := &v1alpha2.Agent{ + ObjectMeta: metav1.ObjectMeta{ + Name: "test-agent", + Namespace: "test", + }, + Spec: v1alpha2.AgentSpec{ + Type: v1alpha2.AgentType_Declarative, + Declarative: &v1alpha2.DeclarativeAgentSpec{ + SystemMessage: "Test agent", + ModelConfig: "test-model", + Deployment: &v1alpha2.DeclarativeDeploymentSpec{ + SharedDeploymentSpec: v1alpha2.SharedDeploymentSpec{ + SecurityContext: &corev1.SecurityContext{ + AllowPrivilegeEscalation: ptr.To(false), + Capabilities: &corev1.Capabilities{ + Drop: []corev1.Capability{"ALL"}, + Add: []corev1.Capability{"NET_BIND_SERVICE"}, // Allowed in baseline + }, + }, + }, + }, + }, + }, + } + + modelConfig := &v1alpha2.ModelConfig{ + ObjectMeta: metav1.ObjectMeta{ + Name: "test-model", + Namespace: "test", + }, + Spec: v1alpha2.ModelConfigSpec{ + Provider: "OpenAI", + Model: "gpt-4o", + }, + } + + scheme := schemev1.Scheme + err := v1alpha2.AddToScheme(scheme) + require.NoError(t, err) + + kubeClient := fake.NewClientBuilder(). + WithScheme(scheme). + WithObjects(agent, modelConfig). + Build() + + defaultModel := types.NamespacedName{ + Namespace: "test", + Name: "test-model", + } + translatorInstance := translator.NewAdkApiTranslator(kubeClient, defaultModel, nil) + + result, err := translatorInstance.TranslateAgent(ctx, agent) + require.NoError(t, err) + + var deployment *appsv1.Deployment + for _, obj := range result.Manifest { + if dep, ok := obj.(*appsv1.Deployment); ok { + deployment = dep + break + } + } + require.NotNil(t, deployment) + podTemplate := &deployment.Spec.Template + + // Verify container security context meets PSS baseline requirements + containerSecurityContext := podTemplate.Spec.Containers[0].SecurityContext + require.NotNil(t, containerSecurityContext) + assert.False(t, *containerSecurityContext.AllowPrivilegeEscalation, "AllowPrivilegeEscalation should be false") + + // Baseline allows NET_BIND_SERVICE capability + require.NotNil(t, containerSecurityContext.Capabilities) + assert.Contains(t, containerSecurityContext.Capabilities.Add, corev1.Capability("NET_BIND_SERVICE"), "NET_BIND_SERVICE is allowed in baseline") +} + +// TestSecurityContext_PSS_RestrictedV2_WithVolumes tests PSS restricted compliance +// when the agent requires writable volumes (since ReadOnlyRootFilesystem is true). +func TestSecurityContext_PSS_RestrictedV2_WithVolumes(t *testing.T) { + ctx := context.Background() + + // PSS restricted with emptyDir volumes for writable paths + agent := &v1alpha2.Agent{ + ObjectMeta: metav1.ObjectMeta{ + Name: "test-agent", + Namespace: "test", + }, + Spec: v1alpha2.AgentSpec{ + Type: v1alpha2.AgentType_Declarative, + Declarative: &v1alpha2.DeclarativeAgentSpec{ + SystemMessage: "Test agent", + ModelConfig: "test-model", + Deployment: &v1alpha2.DeclarativeDeploymentSpec{ + SharedDeploymentSpec: v1alpha2.SharedDeploymentSpec{ + PodSecurityContext: &corev1.PodSecurityContext{ + RunAsNonRoot: ptr.To(true), + RunAsUser: ptr.To(int64(1000)), + FSGroup: ptr.To(int64(1000)), + SeccompProfile: &corev1.SeccompProfile{ + Type: corev1.SeccompProfileTypeRuntimeDefault, + }, + }, + SecurityContext: &corev1.SecurityContext{ + RunAsNonRoot: ptr.To(true), + AllowPrivilegeEscalation: ptr.To(false), + ReadOnlyRootFilesystem: ptr.To(true), + Capabilities: &corev1.Capabilities{ + Drop: []corev1.Capability{"ALL"}, + }, + SeccompProfile: &corev1.SeccompProfile{ + Type: corev1.SeccompProfileTypeRuntimeDefault, + }, + }, + Volumes: []corev1.Volume{ + { + Name: "tmp", + VolumeSource: corev1.VolumeSource{ + EmptyDir: &corev1.EmptyDirVolumeSource{}, + }, + }, + { + Name: "cache", + VolumeSource: corev1.VolumeSource{ + EmptyDir: &corev1.EmptyDirVolumeSource{}, + }, + }, + }, + VolumeMounts: []corev1.VolumeMount{ + { + Name: "tmp", + MountPath: "/tmp", + }, + { + Name: "cache", + MountPath: "/cache", + }, + }, + }, + }, + }, + }, + } + + modelConfig := &v1alpha2.ModelConfig{ + ObjectMeta: metav1.ObjectMeta{ + Name: "test-model", + Namespace: "test", + }, + Spec: v1alpha2.ModelConfigSpec{ + Provider: "OpenAI", + Model: "gpt-4o", + }, + } + + scheme := schemev1.Scheme + err := v1alpha2.AddToScheme(scheme) + require.NoError(t, err) + + kubeClient := fake.NewClientBuilder(). + WithScheme(scheme). + WithObjects(agent, modelConfig). + Build() + + defaultModel := types.NamespacedName{ + Namespace: "test", + Name: "test-model", + } + translatorInstance := translator.NewAdkApiTranslator(kubeClient, defaultModel, nil) + + result, err := translatorInstance.TranslateAgent(ctx, agent) + require.NoError(t, err) + + var deployment *appsv1.Deployment + for _, obj := range result.Manifest { + if dep, ok := obj.(*appsv1.Deployment); ok { + deployment = dep + break + } + } + require.NotNil(t, deployment) + podTemplate := &deployment.Spec.Template + + // Verify PSS restricted compliance + containerSecurityContext := podTemplate.Spec.Containers[0].SecurityContext + require.NotNil(t, containerSecurityContext) + assert.True(t, *containerSecurityContext.ReadOnlyRootFilesystem, "ReadOnlyRootFilesystem should be true") + + // Verify writable volumes are present + volumeNames := make(map[string]bool) + for _, vol := range podTemplate.Spec.Volumes { + volumeNames[vol.Name] = true + } + assert.True(t, volumeNames["tmp"], "tmp volume should be present") + assert.True(t, volumeNames["cache"], "cache volume should be present") + + // Verify volume mounts + var container = podTemplate.Spec.Containers[0] + mountPaths := make(map[string]bool) + for _, mount := range container.VolumeMounts { + mountPaths[mount.MountPath] = true + } + assert.True(t, mountPaths["/tmp"], "/tmp mount should be present") + assert.True(t, mountPaths["/cache"], "/cache mount should be present") +} diff --git a/go/internal/httpserver/auth/oauth2.go b/go/internal/httpserver/auth/oauth2.go new file mode 100644 index 000000000..f27a57001 --- /dev/null +++ b/go/internal/httpserver/auth/oauth2.go @@ -0,0 +1,446 @@ +/* +Copyright 2025. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package auth + +import ( + "context" + "encoding/json" + "errors" + "fmt" + "net/http" + "net/url" + "strings" + "sync" + "time" + + "github.com/lestrrat-go/jwx/v2/jwk" + "github.com/lestrrat-go/jwx/v2/jwt" + + "github.com/kagent-dev/kagent/go/pkg/auth" +) + +var ( + // ErrMissingToken is returned when no bearer token is provided + ErrMissingToken = errors.New("missing bearer token") + // ErrInvalidToken is returned when the token is invalid or expired + ErrInvalidToken = errors.New("invalid or expired token") + // ErrTokenValidation is returned when token validation fails + ErrTokenValidation = errors.New("token validation failed") + // ErrDiscoveryFailed is returned when OIDC discovery fails + ErrDiscoveryFailed = errors.New("OIDC discovery failed") +) + +// OAuth2Config contains configuration for the OAuth2 authenticator +type OAuth2Config struct { + // IssuerURL is the OIDC issuer URL (e.g., https://accounts.google.com) + IssuerURL string + // ClientID is the OAuth2 client ID for token validation + ClientID string + // Audience is the expected audience claim in the token + Audience string + // RequiredScopes are the scopes that must be present in the token + RequiredScopes []string + // UserIDClaim is the claim to use as the user ID (default: "sub") + UserIDClaim string + // RolesClaim is the claim to use for roles (default: "roles") + RolesClaim string + // SkipIssuerValidation disables issuer validation (not recommended for production) + SkipIssuerValidation bool + // SkipExpiryValidation disables expiry validation (not recommended for production) + SkipExpiryValidation bool + // HTTPClient is an optional custom HTTP client for OIDC discovery + HTTPClient *http.Client + // JWKSCacheDuration is how long to cache the JWKS (default: 1 hour) + JWKSCacheDuration time.Duration +} + +// oidcDiscoveryDocument represents the OIDC discovery document +type oidcDiscoveryDocument struct { + Issuer string `json:"issuer"` + AuthorizationEndpoint string `json:"authorization_endpoint"` + TokenEndpoint string `json:"token_endpoint"` + JWKSURI string `json:"jwks_uri"` + UserInfoEndpoint string `json:"userinfo_endpoint,omitempty"` + ScopesSupported []string `json:"scopes_supported,omitempty"` + ClaimsSupported []string `json:"claims_supported,omitempty"` +} + +// OAuth2Authenticator implements the auth.AuthProvider interface for OAuth2/OIDC +type OAuth2Authenticator struct { + config OAuth2Config + httpClient *http.Client + + // JWKS cache + jwksMu sync.RWMutex + jwksCache jwk.Set + jwksCacheTime time.Time + + // Discovery cache + discoveryMu sync.RWMutex + discoveryCache *oidcDiscoveryDocument + discoveryCacheTime time.Time +} + +// OAuth2Session implements the auth.Session interface +type OAuth2Session struct { + principal auth.Principal + accessToken string + claims map[string]interface{} + expiresAt time.Time +} + +func (s *OAuth2Session) Principal() auth.Principal { + return s.principal +} + +// Claims returns the raw JWT claims +func (s *OAuth2Session) Claims() map[string]interface{} { + return s.claims +} + +// AccessToken returns the original access token +func (s *OAuth2Session) AccessToken() string { + return s.accessToken +} + +// ExpiresAt returns when the session expires +func (s *OAuth2Session) ExpiresAt() time.Time { + return s.expiresAt +} + +// NewOAuth2Authenticator creates a new OAuth2 authenticator +func NewOAuth2Authenticator(config OAuth2Config) (*OAuth2Authenticator, error) { + if config.IssuerURL == "" { + return nil, fmt.Errorf("issuer URL is required") + } + + // Set defaults + if config.UserIDClaim == "" { + config.UserIDClaim = "sub" + } + if config.RolesClaim == "" { + config.RolesClaim = "roles" + } + if config.JWKSCacheDuration == 0 { + config.JWKSCacheDuration = time.Hour + } + + httpClient := config.HTTPClient + if httpClient == nil { + httpClient = &http.Client{ + Timeout: 30 * time.Second, + } + } + + return &OAuth2Authenticator{ + config: config, + httpClient: httpClient, + }, nil +} + +// Authenticate validates the bearer token and returns a session +func (a *OAuth2Authenticator) Authenticate(ctx context.Context, reqHeaders http.Header, query url.Values) (auth.Session, error) { + // Extract bearer token from Authorization header + token := extractBearerToken(reqHeaders) + if token == "" { + // Also check query parameter for WebSocket connections + token = query.Get("access_token") + } + if token == "" { + return nil, ErrMissingToken + } + + // Validate the token + session, err := a.validateToken(ctx, token) + if err != nil { + return nil, fmt.Errorf("%w: %v", ErrInvalidToken, err) + } + + return session, nil +} + +// UpstreamAuth adds authentication to upstream requests +func (a *OAuth2Authenticator) UpstreamAuth(r *http.Request, session auth.Session, upstreamPrincipal auth.Principal) error { + if session == nil { + return nil + } + + // Forward the user ID in a header + if session.Principal().User.ID != "" { + r.Header.Set("X-User-Id", session.Principal().User.ID) + } + + // If we have an OAuth2Session, forward the access token + if oauth2Session, ok := session.(*OAuth2Session); ok && oauth2Session.accessToken != "" { + r.Header.Set("Authorization", "Bearer "+oauth2Session.accessToken) + } + + return nil +} + +// validateToken validates a JWT token and extracts claims +func (a *OAuth2Authenticator) validateToken(ctx context.Context, tokenString string) (*OAuth2Session, error) { + // Get JWKS for validation + keySet, err := a.getJWKS(ctx) + if err != nil { + return nil, fmt.Errorf("failed to get JWKS: %w", err) + } + + // Build validation options + validateOpts := []jwt.ValidateOption{ + jwt.WithContext(ctx), + } + + // If skipping expiry validation, reset default validators to avoid automatic exp check + if a.config.SkipExpiryValidation { + validateOpts = append(validateOpts, jwt.WithResetValidators(true)) + } + + if !a.config.SkipIssuerValidation && a.config.IssuerURL != "" { + validateOpts = append(validateOpts, jwt.WithIssuer(a.config.IssuerURL)) + } + + if a.config.Audience != "" { + validateOpts = append(validateOpts, jwt.WithAudience(a.config.Audience)) + } + + // Parse and validate the token + // Disable automatic validation during parsing so we can control it + parseOpts := []jwt.ParseOption{ + jwt.WithKeySet(keySet), + jwt.WithValidate(false), // We'll validate manually after parsing + } + + token, err := jwt.ParseString(tokenString, parseOpts...) + if err != nil { + return nil, fmt.Errorf("failed to parse token: %w", err) + } + + // Validate the token + if err := jwt.Validate(token, validateOpts...); err != nil { + return nil, fmt.Errorf("token validation failed: %w", err) + } + + // Check required scopes if configured + if len(a.config.RequiredScopes) > 0 { + if err := a.validateScopes(token); err != nil { + return nil, err + } + } + + // Extract claims for session + claims := make(map[string]interface{}) + for iter := token.Iterate(ctx); iter.Next(ctx); { + pair := iter.Pair() + claims[pair.Key.(string)] = pair.Value + } + + // Extract user ID + userID := "" + if idClaim, ok := claims[a.config.UserIDClaim]; ok { + if id, ok := idClaim.(string); ok { + userID = id + } + } + + // Extract roles + var roles []string + if rolesClaim, ok := claims[a.config.RolesClaim]; ok { + switch r := rolesClaim.(type) { + case []interface{}: + for _, role := range r { + if s, ok := role.(string); ok { + roles = append(roles, s) + } + } + case []string: + roles = r + case string: + // Some providers return roles as a space-separated string + roles = strings.Fields(r) + } + } + + // Get expiration time + expiresAt := token.Expiration() + + return &OAuth2Session{ + principal: auth.Principal{ + User: auth.User{ + ID: userID, + Roles: roles, + }, + }, + accessToken: tokenString, + claims: claims, + expiresAt: expiresAt, + }, nil +} + +// validateScopes checks if the token has all required scopes +func (a *OAuth2Authenticator) validateScopes(token jwt.Token) error { + scopeClaim, ok := token.Get("scope") + if !ok { + // Try "scp" claim (used by some providers like Azure AD) + scopeClaim, ok = token.Get("scp") + } + + if !ok { + return fmt.Errorf("token missing scope claim") + } + + var tokenScopes []string + switch s := scopeClaim.(type) { + case string: + tokenScopes = strings.Fields(s) + case []interface{}: + for _, scope := range s { + if str, ok := scope.(string); ok { + tokenScopes = append(tokenScopes, str) + } + } + case []string: + tokenScopes = s + default: + return fmt.Errorf("unexpected scope claim type: %T", scopeClaim) + } + + scopeSet := make(map[string]bool) + for _, s := range tokenScopes { + scopeSet[s] = true + } + + for _, required := range a.config.RequiredScopes { + if !scopeSet[required] { + return fmt.Errorf("missing required scope: %s", required) + } + } + + return nil +} + +// getJWKS retrieves the JWKS, using cache when available +func (a *OAuth2Authenticator) getJWKS(ctx context.Context) (jwk.Set, error) { + a.jwksMu.RLock() + if a.jwksCache != nil && time.Since(a.jwksCacheTime) < a.config.JWKSCacheDuration { + defer a.jwksMu.RUnlock() + return a.jwksCache, nil + } + a.jwksMu.RUnlock() + + // Need to refresh cache + a.jwksMu.Lock() + defer a.jwksMu.Unlock() + + // Double-check after acquiring write lock + if a.jwksCache != nil && time.Since(a.jwksCacheTime) < a.config.JWKSCacheDuration { + return a.jwksCache, nil + } + + // Get JWKS URI from discovery + discovery, err := a.getDiscoveryDocument(ctx) + if err != nil { + return nil, err + } + + // Fetch JWKS + keySet, err := jwk.Fetch(ctx, discovery.JWKSURI, jwk.WithHTTPClient(a.httpClient)) + if err != nil { + return nil, fmt.Errorf("failed to fetch JWKS: %w", err) + } + + a.jwksCache = keySet + a.jwksCacheTime = time.Now() + + return keySet, nil +} + +// getDiscoveryDocument retrieves the OIDC discovery document +func (a *OAuth2Authenticator) getDiscoveryDocument(ctx context.Context) (*oidcDiscoveryDocument, error) { + a.discoveryMu.RLock() + if a.discoveryCache != nil && time.Since(a.discoveryCacheTime) < a.config.JWKSCacheDuration { + defer a.discoveryMu.RUnlock() + return a.discoveryCache, nil + } + a.discoveryMu.RUnlock() + + a.discoveryMu.Lock() + defer a.discoveryMu.Unlock() + + // Double-check after acquiring write lock + if a.discoveryCache != nil && time.Since(a.discoveryCacheTime) < a.config.JWKSCacheDuration { + return a.discoveryCache, nil + } + + // Build discovery URL + issuerURL := strings.TrimSuffix(a.config.IssuerURL, "/") + discoveryURL := issuerURL + "/.well-known/openid-configuration" + + req, err := http.NewRequestWithContext(ctx, http.MethodGet, discoveryURL, nil) + if err != nil { + return nil, fmt.Errorf("failed to create discovery request: %w", err) + } + + resp, err := a.httpClient.Do(req) + if err != nil { + return nil, fmt.Errorf("%w: %v", ErrDiscoveryFailed, err) + } + defer resp.Body.Close() + + if resp.StatusCode != http.StatusOK { + return nil, fmt.Errorf("%w: status %d", ErrDiscoveryFailed, resp.StatusCode) + } + + var discovery oidcDiscoveryDocument + if err := json.NewDecoder(resp.Body).Decode(&discovery); err != nil { + return nil, fmt.Errorf("failed to decode discovery document: %w", err) + } + + a.discoveryCache = &discovery + a.discoveryCacheTime = time.Now() + + return &discovery, nil +} + +// extractBearerToken extracts the bearer token from the Authorization header +func extractBearerToken(headers http.Header) string { + authHeader := headers.Get("Authorization") + if authHeader == "" { + return "" + } + + // Check for Bearer prefix (case-insensitive) + if len(authHeader) > 7 && strings.EqualFold(authHeader[:7], "bearer ") { + return authHeader[7:] + } + + return "" +} + +// ClearCache clears the JWKS and discovery caches +func (a *OAuth2Authenticator) ClearCache() { + a.jwksMu.Lock() + a.jwksCache = nil + a.jwksMu.Unlock() + + a.discoveryMu.Lock() + a.discoveryCache = nil + a.discoveryMu.Unlock() +} + +// Ensure OAuth2Authenticator implements auth.AuthProvider +var _ auth.AuthProvider = (*OAuth2Authenticator)(nil) diff --git a/go/internal/httpserver/auth/oauth2_test.go b/go/internal/httpserver/auth/oauth2_test.go new file mode 100644 index 000000000..180540777 --- /dev/null +++ b/go/internal/httpserver/auth/oauth2_test.go @@ -0,0 +1,736 @@ +/* +Copyright 2025. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package auth + +import ( + "context" + "crypto/rand" + "crypto/rsa" + "encoding/json" + "net/http" + "net/http/httptest" + "net/url" + "testing" + "time" + + "github.com/lestrrat-go/jwx/v2/jwa" + "github.com/lestrrat-go/jwx/v2/jwk" + "github.com/lestrrat-go/jwx/v2/jwt" + "github.com/stretchr/testify/assert" + "github.com/stretchr/testify/require" + + "github.com/kagent-dev/kagent/go/pkg/auth" +) + +// testOIDCServer creates a mock OIDC server for testing +type testOIDCServer struct { + server *httptest.Server + privateKey *rsa.PrivateKey + keyID string +} + +func newTestOIDCServer(t *testing.T) *testOIDCServer { + t.Helper() + + // Generate RSA key pair for signing + privateKey, err := rsa.GenerateKey(rand.Reader, 2048) + require.NoError(t, err) + + keyID := "test-key-1" + + // Create JWK from public key + publicJWK, err := jwk.FromRaw(privateKey.PublicKey) + require.NoError(t, err) + err = publicJWK.Set(jwk.KeyIDKey, keyID) + require.NoError(t, err) + err = publicJWK.Set(jwk.AlgorithmKey, jwa.RS256) + require.NoError(t, err) + err = publicJWK.Set(jwk.KeyUsageKey, "sig") + require.NoError(t, err) + + keySet := jwk.NewSet() + err = keySet.AddKey(publicJWK) + require.NoError(t, err) + + ts := &testOIDCServer{ + privateKey: privateKey, + keyID: keyID, + } + + // Create test server + mux := http.NewServeMux() + + // OIDC discovery endpoint + mux.HandleFunc("/.well-known/openid-configuration", func(w http.ResponseWriter, r *http.Request) { + discovery := map[string]interface{}{ + "issuer": ts.server.URL, + "authorization_endpoint": ts.server.URL + "/authorize", + "token_endpoint": ts.server.URL + "/token", + "jwks_uri": ts.server.URL + "/jwks", + "userinfo_endpoint": ts.server.URL + "/userinfo", + } + w.Header().Set("Content-Type", "application/json") + json.NewEncoder(w).Encode(discovery) + }) + + // JWKS endpoint + mux.HandleFunc("/jwks", func(w http.ResponseWriter, r *http.Request) { + w.Header().Set("Content-Type", "application/json") + json.NewEncoder(w).Encode(keySet) + }) + + ts.server = httptest.NewServer(mux) + + return ts +} + +func (ts *testOIDCServer) Close() { + ts.server.Close() +} + +func (ts *testOIDCServer) URL() string { + return ts.server.URL +} + +func (ts *testOIDCServer) createToken(t *testing.T, claims map[string]interface{}, expiry time.Duration) string { + t.Helper() + + builder := jwt.NewBuilder() + + // Set standard claims + builder.Issuer(ts.server.URL) + builder.IssuedAt(time.Now()) + builder.Expiration(time.Now().Add(expiry)) + + // Set custom claims + for k, v := range claims { + builder.Claim(k, v) + } + + token, err := builder.Build() + require.NoError(t, err) + + // Create signing key + privateJWK, err := jwk.FromRaw(ts.privateKey) + require.NoError(t, err) + err = privateJWK.Set(jwk.KeyIDKey, ts.keyID) + require.NoError(t, err) + err = privateJWK.Set(jwk.AlgorithmKey, jwa.RS256) + require.NoError(t, err) + + // Sign the token + signed, err := jwt.Sign(token, jwt.WithKey(jwa.RS256, privateJWK)) + require.NoError(t, err) + + return string(signed) +} + +func TestNewOAuth2Authenticator(t *testing.T) { + tests := []struct { + name string + config OAuth2Config + expectError bool + }{ + { + name: "valid config", + config: OAuth2Config{ + IssuerURL: "https://example.com", + ClientID: "test-client", + }, + expectError: false, + }, + { + name: "missing issuer URL", + config: OAuth2Config{ + ClientID: "test-client", + }, + expectError: true, + }, + { + name: "default claims are set", + config: OAuth2Config{ + IssuerURL: "https://example.com", + }, + expectError: false, + }, + } + + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + auth, err := NewOAuth2Authenticator(tt.config) + if tt.expectError { + assert.Error(t, err) + assert.Nil(t, auth) + } else { + assert.NoError(t, err) + assert.NotNil(t, auth) + } + }) + } +} + +func TestOAuth2Authenticator_Authenticate(t *testing.T) { + ts := newTestOIDCServer(t) + defer ts.Close() + + authenticator, err := NewOAuth2Authenticator(OAuth2Config{ + IssuerURL: ts.URL(), + UserIDClaim: "sub", + RolesClaim: "roles", + }) + require.NoError(t, err) + + tests := []struct { + name string + setupHeaders func() http.Header + setupQuery func() url.Values + expectedUserID string + expectedRoles []string + expectError bool + errorContains string + }{ + { + name: "valid token in Authorization header", + setupHeaders: func() http.Header { + token := ts.createToken(t, map[string]interface{}{ + "sub": "user123", + "roles": []interface{}{"admin", "user"}, + }, time.Hour) + h := http.Header{} + h.Set("Authorization", "Bearer "+token) + return h + }, + setupQuery: func() url.Values { return url.Values{} }, + expectedUserID: "user123", + expectedRoles: []string{"admin", "user"}, + expectError: false, + }, + { + name: "valid token in query parameter", + setupHeaders: func() http.Header { + return http.Header{} + }, + setupQuery: func() url.Values { + token := ts.createToken(t, map[string]interface{}{ + "sub": "user456", + }, time.Hour) + q := url.Values{} + q.Set("access_token", token) + return q + }, + expectedUserID: "user456", + expectError: false, + }, + { + name: "missing token", + setupHeaders: func() http.Header { + return http.Header{} + }, + setupQuery: func() url.Values { return url.Values{} }, + expectError: true, + errorContains: "missing bearer token", + }, + { + name: "expired token", + setupHeaders: func() http.Header { + token := ts.createToken(t, map[string]interface{}{ + "sub": "user789", + }, -time.Hour) // Already expired + h := http.Header{} + h.Set("Authorization", "Bearer "+token) + return h + }, + setupQuery: func() url.Values { return url.Values{} }, + expectError: true, + errorContains: "invalid or expired token", + }, + { + name: "invalid token format", + setupHeaders: func() http.Header { + h := http.Header{} + h.Set("Authorization", "Bearer invalid-token") + return h + }, + setupQuery: func() url.Values { return url.Values{} }, + expectError: true, + errorContains: "invalid or expired token", + }, + { + name: "bearer prefix case insensitive", + setupHeaders: func() http.Header { + token := ts.createToken(t, map[string]interface{}{ + "sub": "user-case", + }, time.Hour) + h := http.Header{} + h.Set("Authorization", "BEARER "+token) + return h + }, + setupQuery: func() url.Values { return url.Values{} }, + expectedUserID: "user-case", + expectError: false, + }, + { + name: "roles as space-separated string", + setupHeaders: func() http.Header { + token := ts.createToken(t, map[string]interface{}{ + "sub": "user-roles", + "roles": "role1 role2 role3", + }, time.Hour) + h := http.Header{} + h.Set("Authorization", "Bearer "+token) + return h + }, + setupQuery: func() url.Values { return url.Values{} }, + expectedUserID: "user-roles", + expectedRoles: []string{"role1", "role2", "role3"}, + expectError: false, + }, + } + + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + ctx := context.Background() + session, err := authenticator.Authenticate(ctx, tt.setupHeaders(), tt.setupQuery()) + + if tt.expectError { + assert.Error(t, err) + if tt.errorContains != "" { + assert.Contains(t, err.Error(), tt.errorContains) + } + return + } + + require.NoError(t, err) + require.NotNil(t, session) + + principal := session.Principal() + assert.Equal(t, tt.expectedUserID, principal.User.ID) + if len(tt.expectedRoles) > 0 { + assert.Equal(t, tt.expectedRoles, principal.User.Roles) + } + + // Verify it's an OAuth2Session + oauth2Session, ok := session.(*OAuth2Session) + require.True(t, ok) + assert.NotEmpty(t, oauth2Session.AccessToken()) + assert.NotNil(t, oauth2Session.Claims()) + }) + } +} + +func TestOAuth2Authenticator_RequiredScopes(t *testing.T) { + ts := newTestOIDCServer(t) + defer ts.Close() + + authenticator, err := NewOAuth2Authenticator(OAuth2Config{ + IssuerURL: ts.URL(), + RequiredScopes: []string{"read", "write"}, + }) + require.NoError(t, err) + + tests := []struct { + name string + scopes interface{} + expectError bool + }{ + { + name: "has all required scopes as string", + scopes: "read write delete", + expectError: false, + }, + { + name: "has all required scopes as array", + scopes: []interface{}{"read", "write", "admin"}, + expectError: false, + }, + { + name: "missing required scope", + scopes: "read", + expectError: true, + }, + { + name: "no scopes", + scopes: nil, + expectError: true, + }, + } + + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + claims := map[string]interface{}{ + "sub": "user123", + } + if tt.scopes != nil { + claims["scope"] = tt.scopes + } + + token := ts.createToken(t, claims, time.Hour) + headers := http.Header{} + headers.Set("Authorization", "Bearer "+token) + + ctx := context.Background() + _, err := authenticator.Authenticate(ctx, headers, url.Values{}) + + if tt.expectError { + assert.Error(t, err) + } else { + assert.NoError(t, err) + } + }) + } +} + +func TestOAuth2Authenticator_AudienceValidation(t *testing.T) { + ts := newTestOIDCServer(t) + defer ts.Close() + + authenticator, err := NewOAuth2Authenticator(OAuth2Config{ + IssuerURL: ts.URL(), + Audience: "my-api", + }) + require.NoError(t, err) + + tests := []struct { + name string + audience interface{} + expectError bool + }{ + { + name: "matching audience", + audience: "my-api", + expectError: false, + }, + { + name: "audience in array", + audience: []interface{}{"other-api", "my-api"}, + expectError: false, + }, + { + name: "wrong audience", + audience: "wrong-api", + expectError: true, + }, + { + name: "missing audience", + audience: nil, + expectError: true, + }, + } + + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + claims := map[string]interface{}{ + "sub": "user123", + } + if tt.audience != nil { + claims["aud"] = tt.audience + } + + token := ts.createToken(t, claims, time.Hour) + headers := http.Header{} + headers.Set("Authorization", "Bearer "+token) + + ctx := context.Background() + _, err := authenticator.Authenticate(ctx, headers, url.Values{}) + + if tt.expectError { + assert.Error(t, err) + } else { + assert.NoError(t, err) + } + }) + } +} + +func TestOAuth2Authenticator_UpstreamAuth(t *testing.T) { + authenticator, err := NewOAuth2Authenticator(OAuth2Config{ + IssuerURL: "https://example.com", + }) + require.NoError(t, err) + + tests := []struct { + name string + session auth.Session + expectedUserID string + expectedAuth string + }{ + { + name: "forwards user ID and token", + session: &OAuth2Session{ + principal: auth.Principal{ + User: auth.User{ + ID: "user123", + }, + }, + accessToken: "test-token", + }, + expectedUserID: "user123", + expectedAuth: "Bearer test-token", + }, + { + name: "nil session", + session: nil, + expectedUserID: "", + expectedAuth: "", + }, + { + name: "session without token", + session: &OAuth2Session{ + principal: auth.Principal{ + User: auth.User{ + ID: "user456", + }, + }, + }, + expectedUserID: "user456", + expectedAuth: "", + }, + } + + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + req := httptest.NewRequest(http.MethodGet, "/test", nil) + + err := authenticator.UpstreamAuth(req, tt.session, auth.Principal{}) + require.NoError(t, err) + + if tt.expectedUserID != "" { + assert.Equal(t, tt.expectedUserID, req.Header.Get("X-User-Id")) + } else { + assert.Empty(t, req.Header.Get("X-User-Id")) + } + + if tt.expectedAuth != "" { + assert.Equal(t, tt.expectedAuth, req.Header.Get("Authorization")) + } else { + assert.Empty(t, req.Header.Get("Authorization")) + } + }) + } +} + +func TestExtractBearerToken(t *testing.T) { + tests := []struct { + name string + authorization string + expected string + }{ + { + name: "valid bearer token", + authorization: "Bearer abc123", + expected: "abc123", + }, + { + name: "bearer lowercase", + authorization: "bearer xyz789", + expected: "xyz789", + }, + { + name: "BEARER uppercase", + authorization: "BEARER TOKEN123", + expected: "TOKEN123", + }, + { + name: "empty header", + authorization: "", + expected: "", + }, + { + name: "basic auth", + authorization: "Basic dXNlcjpwYXNz", + expected: "", + }, + { + name: "no token after bearer", + authorization: "Bearer", + expected: "", + }, + } + + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + headers := http.Header{} + if tt.authorization != "" { + headers.Set("Authorization", tt.authorization) + } + result := extractBearerToken(headers) + assert.Equal(t, tt.expected, result) + }) + } +} + +func TestOAuth2Session(t *testing.T) { + session := &OAuth2Session{ + principal: auth.Principal{ + User: auth.User{ + ID: "test-user", + Roles: []string{"admin"}, + }, + Agent: auth.Agent{ + ID: "agent-1", + }, + }, + accessToken: "test-token", + claims: map[string]interface{}{ + "sub": "test-user", + "email": "test@example.com", + }, + expiresAt: time.Now().Add(time.Hour), + } + + t.Run("Principal", func(t *testing.T) { + p := session.Principal() + assert.Equal(t, "test-user", p.User.ID) + assert.Equal(t, []string{"admin"}, p.User.Roles) + assert.Equal(t, "agent-1", p.Agent.ID) + }) + + t.Run("AccessToken", func(t *testing.T) { + assert.Equal(t, "test-token", session.AccessToken()) + }) + + t.Run("Claims", func(t *testing.T) { + claims := session.Claims() + assert.Equal(t, "test-user", claims["sub"]) + assert.Equal(t, "test@example.com", claims["email"]) + }) + + t.Run("ExpiresAt", func(t *testing.T) { + assert.False(t, session.ExpiresAt().IsZero()) + assert.True(t, session.ExpiresAt().After(time.Now())) + }) +} + +func TestOAuth2Authenticator_CacheClearing(t *testing.T) { + ts := newTestOIDCServer(t) + defer ts.Close() + + authenticator, err := NewOAuth2Authenticator(OAuth2Config{ + IssuerURL: ts.URL(), + JWKSCacheDuration: time.Hour, + }) + require.NoError(t, err) + + // Make a request to populate the cache + token := ts.createToken(t, map[string]interface{}{ + "sub": "user123", + }, time.Hour) + headers := http.Header{} + headers.Set("Authorization", "Bearer "+token) + + ctx := context.Background() + _, err = authenticator.Authenticate(ctx, headers, url.Values{}) + require.NoError(t, err) + + // Clear the cache + authenticator.ClearCache() + + // Verify cache is cleared by checking internal state + authenticator.jwksMu.RLock() + assert.Nil(t, authenticator.jwksCache) + authenticator.jwksMu.RUnlock() + + authenticator.discoveryMu.RLock() + assert.Nil(t, authenticator.discoveryCache) + authenticator.discoveryMu.RUnlock() + + // Should still work after cache is cleared (will refetch) + _, err = authenticator.Authenticate(ctx, headers, url.Values{}) + require.NoError(t, err) +} + +func TestOAuth2Authenticator_SkipValidation(t *testing.T) { + ts := newTestOIDCServer(t) + defer ts.Close() + + t.Run("skip expiry validation", func(t *testing.T) { + authenticator, err := NewOAuth2Authenticator(OAuth2Config{ + IssuerURL: ts.URL(), + SkipExpiryValidation: true, + }) + require.NoError(t, err) + + // Create an expired token + token := ts.createToken(t, map[string]interface{}{ + "sub": "user123", + }, -time.Hour) + + headers := http.Header{} + headers.Set("Authorization", "Bearer "+token) + + ctx := context.Background() + session, err := authenticator.Authenticate(ctx, headers, url.Values{}) + require.NoError(t, err) + assert.Equal(t, "user123", session.Principal().User.ID) + }) + + t.Run("skip issuer validation", func(t *testing.T) { + authenticator, err := NewOAuth2Authenticator(OAuth2Config{ + IssuerURL: ts.URL(), + SkipIssuerValidation: true, + }) + require.NoError(t, err) + + // The token issuer from ts.createToken matches ts.URL(), + // so this should work regardless + token := ts.createToken(t, map[string]interface{}{ + "sub": "user456", + }, time.Hour) + + headers := http.Header{} + headers.Set("Authorization", "Bearer "+token) + + ctx := context.Background() + session, err := authenticator.Authenticate(ctx, headers, url.Values{}) + require.NoError(t, err) + assert.Equal(t, "user456", session.Principal().User.ID) + }) +} + +func TestOAuth2Authenticator_CustomClaims(t *testing.T) { + ts := newTestOIDCServer(t) + defer ts.Close() + + authenticator, err := NewOAuth2Authenticator(OAuth2Config{ + IssuerURL: ts.URL(), + UserIDClaim: "email", + RolesClaim: "groups", + }) + require.NoError(t, err) + + token := ts.createToken(t, map[string]interface{}{ + "sub": "user123", + "email": "test@example.com", + "groups": []interface{}{"developers", "admins"}, + }, time.Hour) + + headers := http.Header{} + headers.Set("Authorization", "Bearer "+token) + + ctx := context.Background() + session, err := authenticator.Authenticate(ctx, headers, url.Values{}) + require.NoError(t, err) + + // Should use email as user ID instead of sub + assert.Equal(t, "test@example.com", session.Principal().User.ID) + // Should use groups as roles instead of roles + assert.Equal(t, []string{"developers", "admins"}, session.Principal().User.Roles) +} + +// Verify interface compliance +var _ auth.Session = (*OAuth2Session)(nil) +var _ auth.AuthProvider = (*OAuth2Authenticator)(nil) diff --git a/go/internal/httpserver/middleware.go b/go/internal/httpserver/middleware.go index 23170352d..bc25eac97 100644 --- a/go/internal/httpserver/middleware.go +++ b/go/internal/httpserver/middleware.go @@ -2,12 +2,163 @@ package httpserver import ( "net/http" + "os" + "regexp" + "strings" "time" + "github.com/google/uuid" "github.com/kagent-dev/kagent/go/internal/httpserver/handlers" + "github.com/kagent-dev/kagent/go/pkg/auth" ctrllog "sigs.k8s.io/controller-runtime/pkg/log" ) +// AuditLogConfig holds configuration for the audit logging middleware +type AuditLogConfig struct { + // Enabled controls whether audit logging is active + Enabled bool + // LogLevel controls the verbosity level (0=Info, 1+=Debug) + LogLevel int + // IncludeHeaders specifies which headers to include in audit logs (for compliance) + IncludeHeaders []string +} + +// DefaultAuditLogConfig returns the default audit logging configuration +func DefaultAuditLogConfig() AuditLogConfig { + enabled := os.Getenv("KAGENT_AUDIT_LOG_ENABLED") != "false" + return AuditLogConfig{ + Enabled: enabled, + LogLevel: 0, + IncludeHeaders: []string{}, + } +} + +// requestIDKey is the context key for request ID +type requestIDKey struct{} + +// namespacePattern matches namespace in API paths like /api/agents/{namespace}/{name} +var namespacePattern = regexp.MustCompile(`^/api/[^/]+/([^/]+)(?:/|$)`) + +// auditLoggingMiddleware creates a middleware for structured audit logging +// It logs compliance-ready audit trail with user, namespace, action, result, and duration +func auditLoggingMiddleware(config AuditLogConfig) func(http.Handler) http.Handler { + return func(next http.Handler) http.Handler { + return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + if !config.Enabled { + next.ServeHTTP(w, r) + return + } + + start := time.Now() + + // Generate or extract request ID for correlation + requestID := r.Header.Get("X-Request-ID") + if requestID == "" { + requestID = uuid.New().String() + } + + // Extract user information from auth session + userID := "anonymous" + userRoles := []string{} + if session, ok := auth.AuthSessionFrom(r.Context()); ok && session != nil { + principal := session.Principal() + if principal.User.ID != "" { + userID = principal.User.ID + } + userRoles = principal.User.Roles + } + + // Extract namespace from path or header + namespace := extractNamespace(r) + + // Build action string (HTTP method + path pattern) + action := r.Method + " " + r.URL.Path + + // Create audit logger with structured fields + auditLog := ctrllog.Log.WithName("audit").WithValues( + "request_id", requestID, + "timestamp", start.UTC().Format(time.RFC3339Nano), + "user", userID, + "user_roles", userRoles, + "namespace", namespace, + "action", action, + "method", r.Method, + "path", r.URL.Path, + "remote_addr", r.RemoteAddr, + "user_agent", r.Header.Get("User-Agent"), + ) + + // Include specified headers for compliance if configured + for _, header := range config.IncludeHeaders { + if val := r.Header.Get(header); val != "" { + auditLog = auditLog.WithValues("header_"+strings.ToLower(strings.ReplaceAll(header, "-", "_")), val) + } + } + + // Wrap response writer to capture status code + ww := newStatusResponseWriter(w) + + // Log request start at configured level + auditLog.V(config.LogLevel).Info("Audit: request started") + + // Serve the request + next.ServeHTTP(ww, r) + + // Calculate duration + duration := time.Since(start) + + // Determine result category for compliance + resultCategory := categorizeResult(ww.status) + + // Log request completion with full audit trail + auditLog.Info("Audit: request completed", + "status", ww.status, + "result", resultCategory, + "duration_ms", duration.Milliseconds(), + "duration", duration.String(), + ) + }) + } +} + +// extractNamespace extracts the namespace from the request path or headers +func extractNamespace(r *http.Request) string { + // First, try to extract from the URL path pattern + // Patterns like /api/agents/{namespace}/{name} or /api/sessions/agent/{namespace}/{name} + matches := namespacePattern.FindStringSubmatch(r.URL.Path) + if len(matches) > 1 { + return matches[1] + } + + // Try query parameter + if ns := r.URL.Query().Get("namespace"); ns != "" { + return ns + } + + // Try header + if ns := r.Header.Get("X-Namespace"); ns != "" { + return ns + } + + return "unknown" +} + +// categorizeResult returns a human-readable result category for the status code +func categorizeResult(status int) string { + switch { + case status >= 200 && status < 300: + return "success" + case status >= 300 && status < 400: + return "redirect" + case status >= 400 && status < 500: + return "client_error" + case status >= 500: + return "server_error" + default: + return "unknown" + } +} + func loggingMiddleware(next http.Handler) http.Handler { return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { start := time.Now() diff --git a/go/internal/httpserver/middleware_audit_test.go b/go/internal/httpserver/middleware_audit_test.go new file mode 100644 index 000000000..897a3dab8 --- /dev/null +++ b/go/internal/httpserver/middleware_audit_test.go @@ -0,0 +1,358 @@ +package httpserver + +import ( + "net/http" + "net/http/httptest" + "testing" + + "github.com/kagent-dev/kagent/go/pkg/auth" +) + +// mockSession implements auth.Session for testing +type mockSession struct { + principal auth.Principal +} + +func (m *mockSession) Principal() auth.Principal { + return m.principal +} + +func TestAuditLoggingMiddleware_Disabled(t *testing.T) { + config := AuditLogConfig{ + Enabled: false, + } + + handlerCalled := false + handler := http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + handlerCalled = true + w.WriteHeader(http.StatusOK) + }) + + middleware := auditLoggingMiddleware(config) + wrappedHandler := middleware(handler) + + req := httptest.NewRequest(http.MethodGet, "/api/agents/default/test", nil) + rec := httptest.NewRecorder() + + wrappedHandler.ServeHTTP(rec, req) + + if !handlerCalled { + t.Error("Expected handler to be called when audit logging is disabled") + } + if rec.Code != http.StatusOK { + t.Errorf("Expected status 200, got %d", rec.Code) + } +} + +func TestAuditLoggingMiddleware_Enabled(t *testing.T) { + config := AuditLogConfig{ + Enabled: true, + LogLevel: 0, + } + + handlerCalled := false + handler := http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + handlerCalled = true + w.WriteHeader(http.StatusCreated) + }) + + middleware := auditLoggingMiddleware(config) + wrappedHandler := middleware(handler) + + req := httptest.NewRequest(http.MethodPost, "/api/agents/my-namespace/my-agent", nil) + rec := httptest.NewRecorder() + + wrappedHandler.ServeHTTP(rec, req) + + if !handlerCalled { + t.Error("Expected handler to be called when audit logging is enabled") + } + if rec.Code != http.StatusCreated { + t.Errorf("Expected status 201, got %d", rec.Code) + } +} + +func TestAuditLoggingMiddleware_WithAuthenticatedUser(t *testing.T) { + config := AuditLogConfig{ + Enabled: true, + LogLevel: 0, + } + + handler := http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + w.WriteHeader(http.StatusOK) + }) + + middleware := auditLoggingMiddleware(config) + wrappedHandler := middleware(handler) + + // Create request with auth session in context + req := httptest.NewRequest(http.MethodGet, "/api/agents/test-ns/test-agent", nil) + session := &mockSession{ + principal: auth.Principal{ + User: auth.User{ + ID: "test-user-123", + Roles: []string{"admin", "user"}, + }, + }, + } + ctx := auth.AuthSessionTo(req.Context(), session) + req = req.WithContext(ctx) + + rec := httptest.NewRecorder() + wrappedHandler.ServeHTTP(rec, req) + + if rec.Code != http.StatusOK { + t.Errorf("Expected status 200, got %d", rec.Code) + } +} + +func TestAuditLoggingMiddleware_WithRequestID(t *testing.T) { + config := AuditLogConfig{ + Enabled: true, + LogLevel: 0, + } + + handler := http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + w.WriteHeader(http.StatusOK) + }) + + middleware := auditLoggingMiddleware(config) + wrappedHandler := middleware(handler) + + req := httptest.NewRequest(http.MethodGet, "/api/agents", nil) + req.Header.Set("X-Request-ID", "custom-request-id-12345") + rec := httptest.NewRecorder() + + wrappedHandler.ServeHTTP(rec, req) + + if rec.Code != http.StatusOK { + t.Errorf("Expected status 200, got %d", rec.Code) + } +} + +func TestAuditLoggingMiddleware_WithIncludeHeaders(t *testing.T) { + config := AuditLogConfig{ + Enabled: true, + LogLevel: 0, + IncludeHeaders: []string{"X-Correlation-ID", "X-Trace-ID"}, + } + + handler := http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + w.WriteHeader(http.StatusOK) + }) + + middleware := auditLoggingMiddleware(config) + wrappedHandler := middleware(handler) + + req := httptest.NewRequest(http.MethodGet, "/api/agents", nil) + req.Header.Set("X-Correlation-ID", "corr-12345") + req.Header.Set("X-Trace-ID", "trace-67890") + rec := httptest.NewRecorder() + + wrappedHandler.ServeHTTP(rec, req) + + if rec.Code != http.StatusOK { + t.Errorf("Expected status 200, got %d", rec.Code) + } +} + +func TestExtractNamespace(t *testing.T) { + tests := []struct { + name string + path string + query string + header string + expected string + }{ + { + name: "extract from path - agents", + path: "/api/agents/my-namespace/my-agent", + expected: "my-namespace", + }, + { + name: "extract from path - sessions", + path: "/api/sessions/agent/test-ns/test-agent", + expected: "agent", // The first segment after /api/sessions/ + }, + { + name: "extract from path - tools", + path: "/api/tools/production/my-tool", + expected: "production", + }, + { + name: "extract from query parameter", + path: "/api/agents", + query: "namespace=query-ns", + expected: "query-ns", + }, + { + name: "extract from header", + path: "/api/agents", + header: "header-ns", + expected: "header-ns", + }, + { + name: "fallback to unknown", + path: "/api/agents", + expected: "unknown", + }, + { + name: "path takes precedence over query", + path: "/api/agents/path-ns/agent", + query: "namespace=query-ns", + expected: "path-ns", + }, + } + + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + url := tt.path + if tt.query != "" { + url += "?" + tt.query + } + req := httptest.NewRequest(http.MethodGet, url, nil) + if tt.header != "" { + req.Header.Set("X-Namespace", tt.header) + } + + result := extractNamespace(req) + + if result != tt.expected { + t.Errorf("extractNamespace() = %q, want %q", result, tt.expected) + } + }) + } +} + +func TestCategorizeResult(t *testing.T) { + tests := []struct { + status int + expected string + }{ + {200, "success"}, + {201, "success"}, + {204, "success"}, + {299, "success"}, + {301, "redirect"}, + {302, "redirect"}, + {304, "redirect"}, + {400, "client_error"}, + {401, "client_error"}, + {403, "client_error"}, + {404, "client_error"}, + {422, "client_error"}, + {500, "server_error"}, + {502, "server_error"}, + {503, "server_error"}, + {100, "unknown"}, + {199, "unknown"}, + } + + for _, tt := range tests { + t.Run(http.StatusText(tt.status), func(t *testing.T) { + result := categorizeResult(tt.status) + if result != tt.expected { + t.Errorf("categorizeResult(%d) = %q, want %q", tt.status, result, tt.expected) + } + }) + } +} + +func TestDefaultAuditLogConfig(t *testing.T) { + // Test default config (without env var set) + config := DefaultAuditLogConfig() + + if !config.Enabled { + t.Error("Expected audit logging to be enabled by default") + } + if config.LogLevel != 0 { + t.Errorf("Expected LogLevel 0, got %d", config.LogLevel) + } + if len(config.IncludeHeaders) != 0 { + t.Errorf("Expected empty IncludeHeaders, got %v", config.IncludeHeaders) + } +} + +func TestAuditLoggingMiddleware_ErrorStatus(t *testing.T) { + config := AuditLogConfig{ + Enabled: true, + LogLevel: 0, + } + + handler := http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + w.WriteHeader(http.StatusInternalServerError) + }) + + middleware := auditLoggingMiddleware(config) + wrappedHandler := middleware(handler) + + req := httptest.NewRequest(http.MethodGet, "/api/agents", nil) + rec := httptest.NewRecorder() + + wrappedHandler.ServeHTTP(rec, req) + + if rec.Code != http.StatusInternalServerError { + t.Errorf("Expected status 500, got %d", rec.Code) + } +} + +func TestAuditLoggingMiddleware_AnonymousUser(t *testing.T) { + config := AuditLogConfig{ + Enabled: true, + LogLevel: 0, + } + + handler := http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + w.WriteHeader(http.StatusOK) + }) + + middleware := auditLoggingMiddleware(config) + wrappedHandler := middleware(handler) + + // Request without auth session + req := httptest.NewRequest(http.MethodGet, "/api/health", nil) + rec := httptest.NewRecorder() + + wrappedHandler.ServeHTTP(rec, req) + + if rec.Code != http.StatusOK { + t.Errorf("Expected status 200, got %d", rec.Code) + } +} + +// TestAuditLoggingMiddleware_AllHTTPMethods tests that all HTTP methods are logged +func TestAuditLoggingMiddleware_AllHTTPMethods(t *testing.T) { + config := AuditLogConfig{ + Enabled: true, + LogLevel: 0, + } + + methods := []string{ + http.MethodGet, + http.MethodPost, + http.MethodPut, + http.MethodDelete, + http.MethodPatch, + } + + for _, method := range methods { + t.Run(method, func(t *testing.T) { + handler := http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + w.WriteHeader(http.StatusOK) + }) + + middleware := auditLoggingMiddleware(config) + wrappedHandler := middleware(handler) + + req := httptest.NewRequest(method, "/api/agents/ns/agent", nil) + rec := httptest.NewRecorder() + + wrappedHandler.ServeHTTP(rec, req) + + if rec.Code != http.StatusOK { + t.Errorf("Expected status 200 for %s, got %d", method, rec.Code) + } + }) + } +} + diff --git a/go/internal/httpserver/server.go b/go/internal/httpserver/server.go index 0875cf025..6be6caa4a 100644 --- a/go/internal/httpserver/server.go +++ b/go/internal/httpserver/server.go @@ -226,6 +226,7 @@ func (s *HTTPServer) setupRoutes() { // Use middleware for common functionality s.router.Use(auth.AuthnMiddleware(s.authenticator)) + s.router.Use(auditLoggingMiddleware(DefaultAuditLogConfig())) s.router.Use(contentTypeMiddleware) s.router.Use(loggingMiddleware) s.router.Use(errorHandlerMiddleware) diff --git a/go/pkg/app/app.go b/go/pkg/app/app.go index 8ee3b9589..312d0392d 100644 --- a/go/pkg/app/app.go +++ b/go/pkg/app/app.go @@ -379,6 +379,7 @@ func Start(getExtensionConfig GetExtensionConfig) { mgr.GetClient(), dbClient, cfg.DefaultModelConfig, + watchNamespacesList, ) if err := (&controller.ServiceController{ diff --git a/helm/kagent/templates/pdb.yaml b/helm/kagent/templates/pdb.yaml new file mode 100644 index 000000000..dcb67d310 --- /dev/null +++ b/helm/kagent/templates/pdb.yaml @@ -0,0 +1,40 @@ +{{- if .Values.pdb.enabled }} +--- +apiVersion: policy/v1 +kind: PodDisruptionBudget +metadata: + name: {{ include "kagent.fullname" . }}-controller + namespace: {{ include "kagent.namespace" . }} + labels: + {{- include "kagent.controller.labels" . | nindent 4 }} +spec: + {{- if .Values.pdb.controller.minAvailable }} + minAvailable: {{ .Values.pdb.controller.minAvailable }} + {{- else if .Values.pdb.controller.maxUnavailable }} + maxUnavailable: {{ .Values.pdb.controller.maxUnavailable }} + {{- else }} + minAvailable: 1 + {{- end }} + selector: + matchLabels: + {{- include "kagent.controller.selectorLabels" . | nindent 6 }} +--- +apiVersion: policy/v1 +kind: PodDisruptionBudget +metadata: + name: {{ include "kagent.fullname" . }}-ui + namespace: {{ include "kagent.namespace" . }} + labels: + {{- include "kagent.ui.labels" . | nindent 4 }} +spec: + {{- if .Values.pdb.ui.minAvailable }} + minAvailable: {{ .Values.pdb.ui.minAvailable }} + {{- else if .Values.pdb.ui.maxUnavailable }} + maxUnavailable: {{ .Values.pdb.ui.maxUnavailable }} + {{- else }} + minAvailable: 1 + {{- end }} + selector: + matchLabels: + {{- include "kagent.ui.selectorLabels" . | nindent 6 }} +{{- end }} diff --git a/helm/kagent/templates/prometheusrule.yaml b/helm/kagent/templates/prometheusrule.yaml new file mode 100644 index 000000000..c59a6d96a --- /dev/null +++ b/helm/kagent/templates/prometheusrule.yaml @@ -0,0 +1,65 @@ +{{- if and .Values.metrics.enabled .Values.metrics.prometheusRule.enabled }} +apiVersion: monitoring.coreos.com/v1 +kind: PrometheusRule +metadata: + name: {{ include "kagent.fullname" . }}-controller + namespace: {{ include "kagent.namespace" . }} + labels: + {{- include "kagent.controller.labels" . | nindent 4 }} + {{- with .Values.metrics.prometheusRule.labels }} + {{- toYaml . | nindent 4 }} + {{- end }} +spec: + groups: + - name: kagent-controller + rules: + {{- if .Values.metrics.prometheusRule.defaultRules.controllerDown }} + - alert: KagentControllerDown + expr: absent(up{job="{{ include "kagent.fullname" . }}-controller"} == 1) + for: 5m + labels: + severity: critical + {{- with .Values.metrics.prometheusRule.alertLabels }} + {{- toYaml . | nindent 12 }} + {{- end }} + annotations: + summary: Kagent controller is down + description: "Kagent controller has been down for more than 5 minutes." + {{- end }} + {{- if .Values.metrics.prometheusRule.defaultRules.highErrorRate }} + - alert: KagentHighErrorRate + expr: | + ( + sum(rate(http_requests_total{job="{{ include "kagent.fullname" . }}-controller", status=~"5.."}[5m])) + / + sum(rate(http_requests_total{job="{{ include "kagent.fullname" . }}-controller"}[5m])) + ) > {{ .Values.metrics.prometheusRule.thresholds.errorRatePercent | default 0.05 }} + for: 5m + labels: + severity: warning + {{- with .Values.metrics.prometheusRule.alertLabels }} + {{- toYaml . | nindent 12 }} + {{- end }} + annotations: + summary: Kagent controller high error rate + description: "Kagent controller error rate is above {{ mul (.Values.metrics.prometheusRule.thresholds.errorRatePercent | default 0.05) 100 }}% for more than 5 minutes." + {{- end }} + {{- if .Values.metrics.prometheusRule.defaultRules.highLatency }} + - alert: KagentHighLatency + expr: | + histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job="{{ include "kagent.fullname" . }}-controller"}[5m])) by (le)) + > {{ .Values.metrics.prometheusRule.thresholds.latencySeconds | default 5 }} + for: 5m + labels: + severity: warning + {{- with .Values.metrics.prometheusRule.alertLabels }} + {{- toYaml . | nindent 12 }} + {{- end }} + annotations: + summary: Kagent controller high latency + description: "Kagent controller 99th percentile latency is above {{ .Values.metrics.prometheusRule.thresholds.latencySeconds | default 5 }}s for more than 5 minutes." + {{- end }} + {{- with .Values.metrics.prometheusRule.additionalRules }} + {{- toYaml . | nindent 8 }} + {{- end }} +{{- end }} diff --git a/helm/kagent/templates/servicemonitor.yaml b/helm/kagent/templates/servicemonitor.yaml new file mode 100644 index 000000000..c8ac2083c --- /dev/null +++ b/helm/kagent/templates/servicemonitor.yaml @@ -0,0 +1,41 @@ +{{- if and .Values.metrics.enabled .Values.metrics.serviceMonitor.enabled }} +apiVersion: monitoring.coreos.com/v1 +kind: ServiceMonitor +metadata: + name: {{ include "kagent.fullname" . }}-controller + namespace: {{ include "kagent.namespace" . }} + labels: + {{- include "kagent.controller.labels" . | nindent 4 }} + {{- with .Values.metrics.serviceMonitor.labels }} + {{- toYaml . | nindent 4 }} + {{- end }} +spec: + selector: + matchLabels: + {{- include "kagent.controller.selectorLabels" . | nindent 6 }} + namespaceSelector: + matchNames: + - {{ include "kagent.namespace" . }} + endpoints: + - port: controller + {{- with .Values.metrics.serviceMonitor.interval }} + interval: {{ . }} + {{- end }} + {{- with .Values.metrics.serviceMonitor.scrapeTimeout }} + scrapeTimeout: {{ . }} + {{- end }} + path: {{ .Values.metrics.serviceMonitor.path | default "/metrics" }} + scheme: {{ .Values.metrics.serviceMonitor.scheme | default "http" }} + {{- with .Values.metrics.serviceMonitor.tlsConfig }} + tlsConfig: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.metrics.serviceMonitor.relabelings }} + relabelings: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.metrics.serviceMonitor.metricRelabelings }} + metricRelabelings: + {{- toYaml . | nindent 8 }} + {{- end }} +{{- end }} diff --git a/helm/kagent/tests/pdb_test.yaml b/helm/kagent/tests/pdb_test.yaml new file mode 100644 index 000000000..933e2ee83 --- /dev/null +++ b/helm/kagent/tests/pdb_test.yaml @@ -0,0 +1,158 @@ +suite: test pod disruption budgets +templates: + - pdb.yaml +tests: + - it: should not render PDBs when disabled + set: + pdb: + enabled: false + asserts: + - hasDocuments: + count: 0 + + - it: should render PDBs when enabled + set: + pdb: + enabled: true + controller: + minAvailable: 1 + ui: + minAvailable: 1 + asserts: + - hasDocuments: + count: 2 + + - it: should render controller PDB with correct name and labels + set: + pdb: + enabled: true + controller: + minAvailable: 1 + ui: + minAvailable: 1 + documentIndex: 0 + asserts: + - isKind: + of: PodDisruptionBudget + - equal: + path: metadata.name + value: RELEASE-NAME-controller + - equal: + path: metadata.labels["app.kubernetes.io/component"] + value: controller + + - it: should render ui PDB with correct name and labels + set: + pdb: + enabled: true + controller: + minAvailable: 1 + ui: + minAvailable: 1 + documentIndex: 1 + asserts: + - isKind: + of: PodDisruptionBudget + - equal: + path: metadata.name + value: RELEASE-NAME-ui + - equal: + path: metadata.labels["app.kubernetes.io/component"] + value: ui + + - it: should use minAvailable for controller when specified + set: + pdb: + enabled: true + controller: + minAvailable: 2 + ui: + minAvailable: 1 + documentIndex: 0 + asserts: + - equal: + path: spec.minAvailable + value: 2 + + - it: should use maxUnavailable for controller when specified + set: + pdb: + enabled: true + controller: + maxUnavailable: 1 + ui: + minAvailable: 1 + documentIndex: 0 + asserts: + - equal: + path: spec.maxUnavailable + value: 1 + + - it: should use minAvailable for ui when specified + set: + pdb: + enabled: true + controller: + minAvailable: 1 + ui: + minAvailable: 2 + documentIndex: 1 + asserts: + - equal: + path: spec.minAvailable + value: 2 + + - it: should use maxUnavailable for ui when specified + set: + pdb: + enabled: true + controller: + minAvailable: 1 + ui: + maxUnavailable: 1 + documentIndex: 1 + asserts: + - equal: + path: spec.maxUnavailable + value: 1 + + - it: should default to minAvailable 1 when neither is specified for controller + set: + pdb: + enabled: true + controller: {} + ui: + minAvailable: 1 + documentIndex: 0 + asserts: + - equal: + path: spec.minAvailable + value: 1 + + - it: should match controller selector labels + set: + pdb: + enabled: true + controller: + minAvailable: 1 + ui: + minAvailable: 1 + documentIndex: 0 + asserts: + - equal: + path: spec.selector.matchLabels["app.kubernetes.io/component"] + value: controller + + - it: should match ui selector labels + set: + pdb: + enabled: true + controller: + minAvailable: 1 + ui: + minAvailable: 1 + documentIndex: 1 + asserts: + - equal: + path: spec.selector.matchLabels["app.kubernetes.io/component"] + value: ui diff --git a/helm/kagent/values.yaml b/helm/kagent/values.yaml index 1bb8168e6..0aa26dd00 100644 --- a/helm/kagent/values.yaml +++ b/helm/kagent/values.yaml @@ -46,6 +46,26 @@ tolerations: [] # -- Node labels to match for `Pod` [scheduling](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/). nodeSelector: {} +# ============================================================================== +# HIGH AVAILABILITY CONFIGURATION +# ============================================================================== + +pdb: + # -- Enable PodDisruptionBudgets for controller and UI deployments + enabled: false + controller: + # -- Minimum number of controller pods that must be available during disruption + # Only one of minAvailable or maxUnavailable can be set + minAvailable: 1 + # -- Maximum number of controller pods that can be unavailable during disruption + # maxUnavailable: 1 + ui: + # -- Minimum number of UI pods that must be available during disruption + # Only one of minAvailable or maxUnavailable can be set + minAvailable: 1 + # -- Maximum number of UI pods that can be unavailable during disruption + # maxUnavailable: 1 + # ============================================================================== # DATABASE CONFIGURATION # ============================================================================== @@ -76,8 +96,11 @@ controller: maxBufSize: 1Mi # 1024 * 1024 initialBufSize: 4Ki # 4 * 1024 timeout: 600s # 600 seconds - # -- Namespaces the controller should watch. - # If empty, the controller will watch ALL available namespaces. + # -- Namespaces the controller should watch (namespace-scoped mode). + # When set, the controller enforces namespace isolation and only reconciles + # resources within the specified namespaces. This enables multi-tenant + # deployments where different tenants' resources are isolated. + # If empty, the controller watches ALL namespaces (cluster-wide mode). # @default -- [] (watches all available namespaces) watchNamespaces: [] # - watch-ns-1 @@ -380,3 +403,64 @@ otel: endpoint: http://host.docker.internal:4317 timeout: 15 insecure: true + +# ============================================================================== +# MONITORING CONFIGURATION (Prometheus Operator) +# ============================================================================== + +metrics: + # -- Enable metrics (controller exposes metrics by default) + enabled: true + + serviceMonitor: + # -- Enable ServiceMonitor for Prometheus Operator + enabled: false + # -- Additional labels for the ServiceMonitor + labels: {} + # -- Scrape interval (e.g., "30s") + # @default -- Prometheus default + interval: "" + # -- Scrape timeout (e.g., "10s") + # @default -- Prometheus default + scrapeTimeout: "" + # -- Metrics path + path: /metrics + # -- Scheme to use for scraping (http or https) + scheme: http + # -- TLS configuration for scraping + tlsConfig: {} + # -- RelabelConfigs to apply to samples before scraping + relabelings: [] + # -- MetricRelabelConfigs to apply to samples before ingestion + metricRelabelings: [] + + prometheusRule: + # -- Enable PrometheusRule for alerting + enabled: false + # -- Additional labels for the PrometheusRule + labels: {} + # -- Additional labels to add to all alerts + alertLabels: {} + # -- Enable/disable default alerting rules + defaultRules: + # -- Alert when controller is down + controllerDown: true + # -- Alert on high error rate + highErrorRate: true + # -- Alert on high latency + highLatency: true + # -- Thresholds for default alerts + thresholds: + # -- Error rate threshold (0.05 = 5%) + errorRatePercent: 0.05 + # -- Latency threshold in seconds (p99) + latencySeconds: 5 + # -- Additional custom alerting rules + additionalRules: [] + # - alert: CustomAlert + # expr: some_metric > 100 + # for: 5m + # labels: + # severity: warning + # annotations: + # summary: Custom alert triggered From 4a25b6d73e350842105f3f1e90e673e54bd56886 Mon Sep 17 00:00:00 2001 From: Boden Fuller Date: Thu, 1 Jan 2026 21:09:41 -0500 Subject: [PATCH 2/2] docs: add EP-476 design proposal and streamline documentation - Add design/EP-476-enterprise-enablement.md with full feature spec - Condense OpenShift deployment guide to match kagent style - Merge security context docs into OpenShift guide - Remove verbose reference docs (better suited for kagent.dev) The design proposal documents: - OAuth2/OIDC authentication requirements - Namespace isolation architecture - Audit logging specification - HA and observability patterns Signed-off-by: Boden Fuller --- design/EP-476-enterprise-enablement.md | 117 ++++ docs/openshift-deployment-guide.md | 579 ++---------------- docs/security-context.md | 302 --------- .../httpserver/middleware_audit_test.go | 11 +- 4 files changed, 180 insertions(+), 829 deletions(-) create mode 100644 design/EP-476-enterprise-enablement.md delete mode 100644 docs/security-context.md diff --git a/design/EP-476-enterprise-enablement.md b/design/EP-476-enterprise-enablement.md new file mode 100644 index 000000000..e1348b775 --- /dev/null +++ b/design/EP-476-enterprise-enablement.md @@ -0,0 +1,117 @@ +# EP-476: Enterprise Enablement + +* Issue: [#476](https://github.com/kagent-dev/kagent/issues/476) + +## Background + +This EP addresses enterprise deployment requirements for kagent, specifically authentication, multi-tenancy, and audit logging. These features are prerequisites for production deployment in regulated environments. + +## Motivation + +Currently kagent has limited support for enterprise deployment scenarios: + +1. **Authentication**: Only `UnsecureAuthenticator` is implemented +2. **Multi-tenancy**: Controller operates cluster-wide with no namespace isolation +3. **Audit logging**: Basic HTTP logging without compliance-ready audit trail + +### Goals + +1. Implement OAuth2/OIDC authentication provider +2. Add namespace-scoped controller mode for multi-tenancy +3. Add structured audit logging for compliance + +### Non-Goals + +- RBAC authorization (future work) +- Multi-cluster support (separate EP) +- Air-gapped installation (documentation only) + +## Implementation Details + +### OAuth2/OIDC Authentication + +New `OAuth2Authenticator` implementing `auth.AuthProvider`: + +```go +// go/internal/httpserver/auth/oauth2.go +type OAuth2Config struct { + IssuerURL string + ClientID string + Audience string + RequiredScopes []string + UserIDClaim string // default: "sub" + RolesClaim string // default: "roles" +} +``` + +Features: +- JWT validation with JWKS caching +- Configurable claims extraction +- Scope and audience validation +- Bearer token from header or query parameter + +### Namespace-Scoped Controller + +Add `watchedNamespaces` parameter to reconciler: + +```go +// go/internal/controller/reconciler/reconciler.go +type kagentReconciler struct { + // If empty, cluster-wide. If set, only these namespaces. + watchedNamespaces []string +} + +func (a *kagentReconciler) validateNamespaceIsolation(namespace string) error +``` + +All reconcile methods call `validateNamespaceIsolation()` before processing. + +### Structured Audit Logging + +Middleware that logs compliance-ready JSON: + +```go +// go/internal/httpserver/middleware.go +type AuditLogConfig struct { + Enabled bool + LogLevel int + IncludeHeaders []string +} +``` + +Logged fields: `request_id`, `timestamp`, `user`, `user_roles`, `namespace`, `action`, `status`, `duration_ms` + +### Helm Configuration + +```yaml +# values.yaml +controller: + watchNamespaces: [] # empty = cluster-wide + +pdb: + enabled: false + controller: + minAvailable: 1 + +metrics: + serviceMonitor: + enabled: false +``` + +### Test Plan + +- Unit tests for OAuth2 token validation (8 test cases) +- Unit tests for namespace isolation (15 test cases) +- Unit tests for audit middleware (11 test cases) +- Integration tests with mock OIDC server + +## Alternatives + +1. **Use existing auth middleware**: Rejected - kagent needs session-based auth for A2A protocol +2. **Namespace isolation via RBAC only**: Rejected - controller still needs to enforce boundaries +3. **External audit logging**: Considered - middleware approach is simpler and integrates with existing logging + +## Open Questions + +1. Should OAuth2 config be a CRD or Helm values? (Currently Helm values) +2. Integration with OpenShift OAuth server? (Future work) diff --git a/docs/openshift-deployment-guide.md b/docs/openshift-deployment-guide.md index fa0a26b9d..55fd49541 100644 --- a/docs/openshift-deployment-guide.md +++ b/docs/openshift-deployment-guide.md @@ -1,575 +1,112 @@ # OpenShift Deployment Guide -This guide covers deploying kagent on Red Hat OpenShift Container Platform. OpenShift provides additional security features and enterprise capabilities on top of Kubernetes, which require specific configuration considerations. - -## Table of Contents - -- [Prerequisites](#prerequisites) -- [OpenShift vs Kubernetes: Key Differences](#openshift-vs-kubernetes-key-differences) -- [Security Context Constraints (SCCs)](#security-context-constraints-sccs) - - [Understanding SCCs](#understanding-sccs) - - [Configuring kagent for SCCs](#configuring-kagent-for-sccs) - - [Creating Custom SCCs](#creating-custom-sccs) -- [Pod Security Standards (PSS)](#pod-security-standards-pss) - - [restricted-v2 Compatibility](#restricted-v2-compatibility) - - [Configuring Security Contexts](#configuring-security-contexts) -- [Routes vs Ingress](#routes-vs-ingress) - - [Automatic Route Creation](#automatic-route-creation) - - [Manual Route Configuration](#manual-route-configuration) - - [TLS Configuration](#tls-configuration) -- [Namespace Patterns and Multi-Tenancy](#namespace-patterns-and-multi-tenancy) - - [Single-Namespace Deployment](#single-namespace-deployment) - - [Multi-Namespace Deployment](#multi-namespace-deployment) - - [Project-Based Isolation](#project-based-isolation) -- [Deployment Scenarios](#deployment-scenarios) - - [Basic Deployment](#basic-deployment) - - [Production Deployment](#production-deployment) - - [Air-Gapped Deployment](#air-gapped-deployment) -- [Troubleshooting](#troubleshooting) +This guide covers OpenShift-specific deployment considerations for kagent. ## Prerequisites -Before deploying kagent on OpenShift, ensure you have: +- OpenShift 4.12+ +- `oc` CLI configured +- Helm 3.x -- OpenShift Container Platform 4.12 or later -- `oc` CLI installed and configured -- Cluster-admin or project-admin privileges (depending on deployment scope) -- Helm 3.x installed -- Access to container images (either from `cr.kagent.dev` or your internal registry) - -## OpenShift vs Kubernetes: Key Differences - -OpenShift extends Kubernetes with additional security and operational features: - -| Feature | Kubernetes | OpenShift | -|---------|-----------|-----------| -| Ingress | Ingress resources | Routes (native), Ingress (supported) | -| Security | Pod Security Standards | Security Context Constraints (SCCs) | -| Projects | Namespaces | Projects (namespaces with additional annotations) | -| Registry | External registries | Built-in integrated registry | -| Networking | CNI plugins | OpenShift SDN or OVN-Kubernetes | - -## Security Context Constraints (SCCs) - -### Understanding SCCs - -Security Context Constraints (SCCs) are OpenShift's mechanism for controlling pod security. They define what actions pods can perform and what resources they can access. SCCs are more granular than Kubernetes Pod Security Standards. - -OpenShift includes several built-in SCCs: - -| SCC | Description | Use Case | -|-----|-------------|----------| -| `restricted-v2` | Most restrictive, default for authenticated users | Standard workloads | -| `restricted` | Legacy restricted policy | Backward compatibility | -| `nonroot-v2` | Allows any non-root UID | Workloads needing specific UIDs | -| `hostnetwork` | Allows host networking | Networking components | -| `privileged` | Full access | System components only | - -### Configuring kagent for SCCs - -The kagent Helm chart supports security context configuration through `values.yaml`. For most OpenShift deployments, the default `restricted-v2` SCC works well with proper security context settings. - -To verify which SCC your pods are using: - -```bash -oc get pod -n kagent -o yaml | grep scc -``` - -Configure security contexts in your `values.yaml`: - -```yaml -# Pod-level security context (applies to all containers in the pod) -podSecurityContext: - runAsNonRoot: true - seccompProfile: - type: RuntimeDefault - -# Container-level security context -securityContext: - allowPrivilegeEscalation: false - capabilities: - drop: - - ALL - runAsNonRoot: true - seccompProfile: - type: RuntimeDefault - -# Controller-specific security context (optional override) -controller: - securityContext: - allowPrivilegeEscalation: false - capabilities: - drop: - - ALL - runAsNonRoot: true -``` - -### Creating Custom SCCs - -If your deployment requires capabilities not provided by built-in SCCs, create a custom SCC: - -```yaml -apiVersion: security.openshift.io/v1 -kind: SecurityContextConstraints -metadata: - name: kagent-scc -allowHostDirVolumePlugin: false -allowHostIPC: false -allowHostNetwork: false -allowHostPID: false -allowHostPorts: false -allowPrivilegeEscalation: false -allowPrivilegedContainer: false -allowedCapabilities: [] -defaultAddCapabilities: [] -fsGroup: - type: MustRunAs -readOnlyRootFilesystem: false -requiredDropCapabilities: - - ALL -runAsUser: - type: MustRunAsNonRoot -seLinuxContext: - type: MustRunAs -supplementalGroups: - type: RunAsAny -volumes: - - configMap - - downwardAPI - - emptyDir - - persistentVolumeClaim - - projected - - secret -users: - - system:serviceaccount:kagent:kagent-controller - - system:serviceaccount:kagent:kagent-ui -``` - -Apply the custom SCC: +## Installation ```bash -oc apply -f kagent-scc.yaml -``` - -## Pod Security Standards (PSS) - -### restricted-v2 Compatibility - -OpenShift 4.12+ uses the `restricted-v2` SCC by default, which aligns with the Kubernetes Pod Security Standards `restricted` profile. The kagent Helm chart is designed to be compatible with this profile. - -Key requirements for `restricted-v2` compatibility: - -1. **Non-root execution**: Containers must run as non-root users -2. **Read-only root filesystem**: Recommended but not required -3. **No privilege escalation**: `allowPrivilegeEscalation: false` -4. **Dropped capabilities**: All capabilities should be dropped -5. **Seccomp profile**: Must use `RuntimeDefault` or `Localhost` - -### Configuring Security Contexts - -For full `restricted-v2` compliance, use this configuration: - -```yaml -podSecurityContext: - runAsNonRoot: true - seccompProfile: - type: RuntimeDefault - -securityContext: - allowPrivilegeEscalation: false - capabilities: - drop: - - ALL - readOnlyRootFilesystem: true - runAsNonRoot: true - seccompProfile: - type: RuntimeDefault -``` - -Note: If using SQLite storage (the default), the controller needs write access to the SQLite volume. The chart mounts this as an emptyDir, which works with read-only root filesystem. - -## Routes vs Ingress - -### Automatic Route Creation - -The kagent Helm chart automatically creates an OpenShift Route when deployed on OpenShift. This is detected via the `route.openshift.io/v1` API availability. - -The Route is created for the UI service with the following defaults: - -- TLS termination: Edge -- Insecure traffic: Redirect to HTTPS +# Create namespace +oc new-project kagent -To verify the Route was created: +# Install CRDs +helm install kagent-crds ./helm/kagent-crds/ -n kagent -```bash -oc get route -n kagent +# Install kagent with OpenShift Route +helm install kagent ./helm/kagent/ -n kagent \ + --set providers.openAI.apiKey=$OPENAI_API_KEY \ + --set openshift.enabled=true ``` -### Manual Route Configuration +## Security Context Constraints -If you need custom Route configuration, you can disable the automatic Route and create your own: +kagent runs with `restricted-v2` SCC by default. No special SCCs required. -First, examine the auto-generated Route: +For custom SCCs: ```bash -oc get route kagent-ui -n kagent -o yaml -``` - -Create a custom Route: - -```yaml -apiVersion: route.openshift.io/v1 -kind: Route -metadata: - name: kagent-ui - namespace: kagent - labels: - app.kubernetes.io/name: kagent - app.kubernetes.io/component: ui -spec: - host: kagent.apps.your-cluster.example.com - to: - kind: Service - name: kagent-ui - weight: 100 - port: - targetPort: 8080 - tls: - termination: edge - insecureEdgeTerminationPolicy: Redirect - wildcardPolicy: None -``` +# View current SCC +oc get pod -n kagent -o yaml | grep -A5 securityContext -### TLS Configuration - -For production deployments, configure TLS certificates: - -**Using cluster default certificate (edge termination):** - -```yaml -spec: - tls: - termination: edge - insecureEdgeTerminationPolicy: Redirect +# Grant specific SCC (if needed) +oc adm policy add-scc-to-user anyuid -z kagent-controller -n kagent ``` -**Using custom certificate:** +## Routes -```yaml -spec: - tls: - termination: edge - certificate: | - -----BEGIN CERTIFICATE----- - ... - -----END CERTIFICATE----- - key: | - -----BEGIN RSA PRIVATE KEY----- - ... - -----END RSA PRIVATE KEY----- - caCertificate: | - -----BEGIN CERTIFICATE----- - ... - -----END CERTIFICATE----- - insecureEdgeTerminationPolicy: Redirect -``` - -**Passthrough to application TLS:** +The Helm chart creates an OpenShift Route when `openshift.enabled=true`: ```yaml -spec: - tls: - termination: passthrough +# values.yaml +openshift: + enabled: true + route: + host: kagent.apps.example.com # optional + tls: + termination: edge ``` -## Namespace Patterns and Multi-Tenancy - -### Single-Namespace Deployment - -The simplest deployment pattern puts all kagent components in a single namespace: +Access the UI: ```bash -# Create project (OpenShift's enhanced namespace) -oc new-project kagent - -# Install CRDs (cluster-scoped, requires cluster-admin) -helm install kagent-crds ./helm/kagent-crds -n kagent - -# Install kagent -helm install kagent ./helm/kagent -n kagent \ - --set providers.openAI.apiKey=$OPENAI_API_KEY -``` - -### Multi-Namespace Deployment - -For larger deployments, you may want to separate components: - -```yaml -# values-multi-namespace.yaml - -# Override namespace for CRDs (installed separately) -namespaceOverride: "" - -# Controller watches specific namespaces -controller: - watchNamespaces: - - kagent-agents-dev - - kagent-agents-staging - - kagent-agents-prod +oc get route kagent-ui -n kagent -o jsonpath='{.spec.host}' ``` -Deploy agents in separate namespaces: +## Pod Security Standards -```bash -# Create agent namespaces -oc new-project kagent-agents-dev -oc new-project kagent-agents-staging -oc new-project kagent-agents-prod - -# Label namespaces for controller discovery -oc label namespace kagent-agents-dev kagent.dev/managed=true -oc label namespace kagent-agents-staging kagent.dev/managed=true -oc label namespace kagent-agents-prod kagent.dev/managed=true -``` +kagent is compatible with PSS `restricted` profile: -### Project-Based Isolation +| Setting | Value | +|---------|-------| +| `runAsNonRoot` | true | +| `allowPrivilegeEscalation` | false | +| `capabilities.drop` | ALL | +| `seccompProfile` | RuntimeDefault | -OpenShift Projects provide additional isolation through: - -1. **Network Policies**: Automatic network isolation between projects -2. **RBAC**: Project-scoped roles and bindings -3. **Resource Quotas**: Per-project resource limits - -Configure network policies to allow kagent controller communication: - -```yaml -apiVersion: networking.k8s.io/v1 -kind: NetworkPolicy -metadata: - name: allow-kagent-controller - namespace: kagent-agents-dev -spec: - podSelector: {} - policyTypes: - - Ingress - ingress: - - from: - - namespaceSelector: - matchLabels: - kubernetes.io/metadata.name: kagent -``` - -## Deployment Scenarios - -### Basic Deployment - -For development or testing environments: - -```bash -# Create project -oc new-project kagent - -# Install CRDs -helm install kagent-crds ./helm/kagent-crds -n kagent - -# Install with minimal configuration -helm install kagent ./helm/kagent -n kagent \ - --set providers.default=ollama \ - --set providers.ollama.config.host=ollama.kagent.svc.cluster.local:11434 -``` - -### Production Deployment - -For production environments, consider these settings: +## High Availability ```yaml -# values-production.yaml - -# Use PostgreSQL for persistence -database: - type: postgres - postgres: - url: postgres://user:pass@postgresql.kagent.svc.cluster.local:5432/kagent - -# Security hardening -podSecurityContext: - runAsNonRoot: true - seccompProfile: - type: RuntimeDefault - -securityContext: - allowPrivilegeEscalation: false - capabilities: - drop: - - ALL - readOnlyRootFilesystem: true - runAsNonRoot: true - seccompProfile: - type: RuntimeDefault - -# Resource limits +# values-ha.yaml controller: - resources: - requests: - cpu: 200m - memory: 256Mi - limits: - cpu: 2 - memory: 1Gi replicas: 2 ui: - resources: - requests: - cpu: 100m - memory: 256Mi - limits: - cpu: 1 - memory: 512Mi replicas: 2 -# Enable observability -otel: - tracing: - enabled: true - exporter: - otlp: - endpoint: http://jaeger-collector.observability.svc.cluster.local:4317 - insecure: true -``` - -Deploy with production values: - -```bash -helm install kagent ./helm/kagent -n kagent \ - -f values-production.yaml \ - --set providers.openAI.apiKeySecretRef=kagent-openai +pdb: + enabled: true + controller: + minAvailable: 1 + ui: + minAvailable: 1 ``` -### Air-Gapped Deployment - -For disconnected or air-gapped environments: - -1. **Mirror images to internal registry:** - ```bash -# Pull images -podman pull cr.kagent.dev/kagent-dev/kagent/controller:latest -podman pull cr.kagent.dev/kagent-dev/kagent/ui:latest -podman pull cr.kagent.dev/kagent-dev/kagent/app:latest - -# Tag for internal registry -podman tag cr.kagent.dev/kagent-dev/kagent/controller:latest \ - registry.internal.example.com/kagent/controller:latest - -# Push to internal registry -podman push registry.internal.example.com/kagent/controller:latest -``` - -2. **Configure Helm values for internal registry:** - -```yaml -# values-airgapped.yaml - -registry: registry.internal.example.com/kagent - -imagePullSecrets: - - name: internal-registry-secret - -controller: - image: - registry: registry.internal.example.com - repository: kagent/controller - -ui: - image: - registry: registry.internal.example.com - repository: kagent/ui -``` - -3. **Create image pull secret:** - -```bash -oc create secret docker-registry internal-registry-secret \ - --docker-server=registry.internal.example.com \ - --docker-username=user \ - --docker-password=password \ - -n kagent +helm upgrade kagent ./helm/kagent/ -n kagent -f values-ha.yaml ``` ## Troubleshooting -### Common Issues - -**1. Pods stuck in CrashLoopBackOff with permission errors** - -This usually indicates SCC issues. Check which SCC is being used: - -```bash -oc get pod -n kagent -o yaml | grep scc -oc adm policy who-can use scc restricted-v2 -``` - -**2. Route not accessible** - -Verify the Route is created and has an assigned host: - ```bash -oc get route -n kagent -oc describe route kagent-ui -n kagent -``` - -Check if the router pods are running: - -```bash -oc get pods -n openshift-ingress -``` +# Check pod status +oc get pods -n kagent -**3. Controller cannot watch other namespaces** - -Ensure proper RBAC is configured for cross-namespace access: - -```bash -oc get clusterrolebinding | grep kagent -oc auth can-i list agents --as=system:serviceaccount:kagent:kagent-controller -n other-namespace -``` - -**4. Image pull errors in air-gapped environment** - -Verify the image pull secret is correctly configured: - -```bash -oc get secret internal-registry-secret -n kagent -oc describe pod -n kagent | grep -A5 "Events" -``` - -### Useful Commands - -```bash -# View all kagent resources -oc get all -n kagent - -# Check pod security context -oc get pod -n kagent -o jsonpath='{.spec.securityContext}' - -# View SCC for a specific service account -oc adm policy who-can use scc restricted-v2 -n kagent - -# Debug network connectivity -oc debug pod/ -n kagent +# Check SCC violations +oc get events -n kagent | grep -i scc # View controller logs -oc logs -f deployment/kagent-controller -n kagent +oc logs -l app.kubernetes.io/component=controller -n kagent -# View UI logs -oc logs -f deployment/kagent-ui -n kagent +# Check Route status +oc describe route kagent-ui -n kagent ``` -### Getting Help - -If you encounter issues: +## See Also -1. Check the [kagent documentation](https://kagent.dev/docs/kagent) -2. Search existing [GitHub issues](https://github.com/kagent-dev/kagent/issues) -3. Join the [CNCF Slack #kagent channel](https://cloud-native.slack.com/archives/C08ETST0076) -4. Join the [kagent Discord server](https://discord.gg/Fu3k65f2k3) +- [Installation Guide](https://kagent.dev/docs/kagent/introduction/installation) +- [Helm Chart README](../helm/README.md) diff --git a/docs/security-context.md b/docs/security-context.md deleted file mode 100644 index b32b9e7b4..000000000 --- a/docs/security-context.md +++ /dev/null @@ -1,302 +0,0 @@ -# Security Context Configuration - -This document describes how kagent handles security contexts for agent deployments, including compatibility with OpenShift Security Context Constraints (SCC) and Kubernetes Pod Security Standards (PSS). - -## Overview - -kagent allows users to configure security contexts at both the pod and container levels through the Agent CRD. These settings are passed through to the generated Kubernetes Deployment, enabling compliance with various security policies. - -## Configuration Options - -### Pod Security Context - -Configure pod-level security settings via `spec.declarative.deployment.podSecurityContext`: - -```yaml -apiVersion: kagent.dev/v1alpha2 -kind: Agent -metadata: - name: secure-agent -spec: - type: Declarative - declarative: - systemMessage: "A secure agent" - modelConfig: my-model - deployment: - podSecurityContext: - runAsNonRoot: true - runAsUser: 1000 - runAsGroup: 1000 - fsGroup: 1000 - seccompProfile: - type: RuntimeDefault -``` - -### Container Security Context - -Configure container-level security settings via `spec.declarative.deployment.securityContext`: - -```yaml -apiVersion: kagent.dev/v1alpha2 -kind: Agent -metadata: - name: secure-agent -spec: - type: Declarative - declarative: - systemMessage: "A secure agent" - modelConfig: my-model - deployment: - securityContext: - runAsNonRoot: true - runAsUser: 1000 - runAsGroup: 1000 - allowPrivilegeEscalation: false - readOnlyRootFilesystem: true - capabilities: - drop: - - ALL - seccompProfile: - type: RuntimeDefault -``` - -## OpenShift SCC Compatibility - -kagent supports deployment under various OpenShift Security Context Constraints. Below is a compatibility matrix showing which SCCs can be used with different kagent configurations. - -### SCC Compatibility Matrix - -| SCC | Basic Agent | Agent with Skills | Notes | -|-----|-------------|-------------------|-------| -| `restricted` | Yes | No | Most secure, requires UID range allocation | -| `restricted-v2` | Yes | No | Updated restricted with seccomp requirements | -| `anyuid` | Yes | No | Allows specific UID without range restrictions | -| `nonroot` | Yes | No | Must run as non-root user | -| `nonroot-v2` | Yes | No | Updated nonroot with seccomp requirements | -| `privileged` | Yes | Yes | Required for agents with skills/sandbox | - -### Restricted SCC Configuration - -For OpenShift's `restricted` or `restricted-v2` SCC: - -```yaml -apiVersion: kagent.dev/v1alpha2 -kind: Agent -metadata: - name: restricted-agent -spec: - type: Declarative - declarative: - systemMessage: "Agent running under restricted SCC" - modelConfig: my-model - deployment: - podSecurityContext: - runAsNonRoot: true - # OpenShift will assign UID from namespace range - fsGroup: 1000680000 # Use your namespace's allocated range - seccompProfile: - type: RuntimeDefault - securityContext: - runAsNonRoot: true - allowPrivilegeEscalation: false - capabilities: - drop: - - ALL - seccompProfile: - type: RuntimeDefault -``` - -### Anyuid SCC Configuration - -For OpenShift's `anyuid` SCC (when you need a specific UID): - -```yaml -apiVersion: kagent.dev/v1alpha2 -kind: Agent -metadata: - name: anyuid-agent -spec: - type: Declarative - declarative: - systemMessage: "Agent running under anyuid SCC" - modelConfig: my-model - deployment: - podSecurityContext: - runAsUser: 1000 - runAsGroup: 1000 - fsGroup: 1000 - securityContext: - runAsUser: 1000 - runAsGroup: 1000 - allowPrivilegeEscalation: false - capabilities: - drop: - - ALL -``` - -### Privileged SCC Configuration - -For agents that require skills or sandboxed code execution: - -```yaml -apiVersion: kagent.dev/v1alpha2 -kind: Agent -metadata: - name: privileged-agent -spec: - type: Declarative - skills: - refs: - - my-skill:latest - declarative: - systemMessage: "Agent with skills requiring privileged SCC" - modelConfig: my-model - deployment: - podSecurityContext: - runAsUser: 0 - runAsGroup: 0 -``` - -**Important**: When an agent uses skills, kagent automatically sets `privileged: true` on the container security context to enable sandbox functionality. - -## Pod Security Standards (PSS) Compliance - -kagent supports Kubernetes Pod Security Standards at all levels. The table below shows compatibility: - -### PSS Compatibility Matrix - -| PSS Level | Basic Agent | Agent with Skills | Notes | -|-----------|-------------|-------------------|-------| -| `privileged` | Yes | Yes | No restrictions | -| `baseline` | Yes | No | Prevents known privilege escalations | -| `restricted` | Yes | No | Highly restrictive, current best practices | - -### PSS Restricted Profile Configuration - -For namespaces enforcing the `restricted` PSS profile: - -```yaml -apiVersion: kagent.dev/v1alpha2 -kind: Agent -metadata: - name: pss-restricted-agent -spec: - type: Declarative - declarative: - systemMessage: "PSS restricted compliant agent" - modelConfig: my-model - deployment: - podSecurityContext: - runAsNonRoot: true - runAsUser: 65534 # nobody user - runAsGroup: 65534 - fsGroup: 65534 - seccompProfile: - type: RuntimeDefault - securityContext: - runAsNonRoot: true - runAsUser: 65534 - runAsGroup: 65534 - allowPrivilegeEscalation: false - readOnlyRootFilesystem: true - capabilities: - drop: - - ALL - seccompProfile: - type: RuntimeDefault - # Add volumes for writable paths when using readOnlyRootFilesystem - volumes: - - name: tmp - emptyDir: {} - - name: cache - emptyDir: {} - volumeMounts: - - name: tmp - mountPath: /tmp - - name: cache - mountPath: /cache -``` - -### PSS Baseline Profile Configuration - -For namespaces enforcing the `baseline` PSS profile: - -```yaml -apiVersion: kagent.dev/v1alpha2 -kind: Agent -metadata: - name: pss-baseline-agent -spec: - type: Declarative - declarative: - systemMessage: "PSS baseline compliant agent" - modelConfig: my-model - deployment: - securityContext: - allowPrivilegeEscalation: false - capabilities: - drop: - - ALL - add: - - NET_BIND_SERVICE # Allowed in baseline -``` - -## Sandbox and Privileged Mode - -When an agent uses skills (via `spec.skills.refs`), kagent automatically enables privileged mode for the container. This is required for: - -- Container-based sandbox execution -- Isolated code execution environments -- Skill package installation and execution - -The privileged mode is applied regardless of user-provided security context settings, though other security context fields (like `runAsUser`) are preserved. - -### Implications - -1. **OpenShift**: Requires the `privileged` SCC to be assigned to the service account -2. **Kubernetes PSS**: Cannot run in namespaces with `restricted` or `baseline` enforcement -3. **Security**: Carefully review which agents require skills and limit their use in production - -## Best Practices - -1. **Start Restricted**: Begin with the most restrictive settings and relax only as needed -2. **Use Non-Root**: Always prefer running as non-root (`runAsNonRoot: true`) -3. **Drop Capabilities**: Drop all capabilities and add back only what's needed -4. **Enable Seccomp**: Use `RuntimeDefault` seccomp profile for defense in depth -5. **Read-Only Filesystem**: Use `readOnlyRootFilesystem: true` with explicit writable volume mounts -6. **Limit Skills Usage**: Only use skills when necessary, as they require privileged mode -7. **Namespace Isolation**: Run privileged agents in dedicated namespaces with appropriate PSS/SCC policies - -## Troubleshooting - -### Common Issues - -**Pod fails to start with "container has runAsNonRoot and image will run as root"** -- Ensure both `runAsNonRoot: true` and a non-zero `runAsUser` are set - -**Pod fails with SCC/PSS policy violations** -- Check the namespace's PSS enforcement level: `kubectl get ns -o yaml` -- For OpenShift, verify SCC assignments: `oc get scc` and `oc adm policy who-can use scc/` - -**Agent with skills fails to start** -- Verify the privileged SCC is available (OpenShift) -- Ensure the namespace allows privileged pods (Kubernetes PSS) - -### Verification Commands - -```bash -# Check which SCC a pod is using (OpenShift) -oc get pod -o yaml | grep scc - -# Check namespace PSS enforcement (Kubernetes) -kubectl get ns -o yaml | grep pod-security - -# Describe pod for security context issues -kubectl describe pod -``` - -## References - -- [Kubernetes Pod Security Standards](https://kubernetes.io/docs/concepts/security/pod-security-standards/) -- [OpenShift Security Context Constraints](https://docs.openshift.com/container-platform/latest/authentication/managing-security-context-constraints.html) -- [Kubernetes Security Context](https://kubernetes.io/docs/tasks/configure-pod-container/security-context/) diff --git a/go/internal/httpserver/middleware_audit_test.go b/go/internal/httpserver/middleware_audit_test.go index 897a3dab8..6ef4e24c4 100644 --- a/go/internal/httpserver/middleware_audit_test.go +++ b/go/internal/httpserver/middleware_audit_test.go @@ -158,11 +158,11 @@ func TestAuditLoggingMiddleware_WithIncludeHeaders(t *testing.T) { func TestExtractNamespace(t *testing.T) { tests := []struct { - name string - path string - query string - header string - expected string + name string + path string + query string + header string + expected string }{ { name: "extract from path - agents", @@ -355,4 +355,3 @@ func TestAuditLoggingMiddleware_AllHTTPMethods(t *testing.T) { }) } } -