-
Notifications
You must be signed in to change notification settings - Fork 0
data_architecture
Arne Molland edited this page Sep 16, 2025
·
1 revision
fleetd handles multiple data types from thousands of edge devices, each requiring specialized storage solutions for optimal performance and scalability.
Data Types:
- Device registry (ID, name, metadata, API keys)
- User accounts and authentication
- Organizations and tenancy
- Update campaigns and configurations
- Webhook configurations
Why PostgreSQL:
- ACID compliance for critical data
- Rich querying capabilities
- Foreign key constraints
- Row-level security for multi-tenancy
TimescaleDB Extension:
- Automatic partitioning for time-series tables
- Compression for historical data
- Continuous aggregates for real-time dashboards
Data Types:
- Device metrics (CPU, memory, disk, network)
- Performance counters
- Custom application metrics
- Real-time telemetry
Why VictoriaMetrics:
- Prometheus-compatible (drop-in replacement)
- 10x better compression than Prometheus
- Handles millions of active time series
- Built-in downsampling and retention policies
- MetricsQL for advanced queries
Data Flow:
Device → Agent → Fleet Server → VictoriaMetrics → Grafana
↓
Long-term storage
(ClickHouse)
Data Types:
- Device system logs
- Application logs
- Audit trails
- Error logs and stack traces
Why Loki:
- Indexes metadata, not content (cost-effective)
- Native Grafana integration
- Kubernetes-native if needed
- LogQL query language
- Multi-tenancy support
Data Types:
- Aggregated metrics for reporting
- Historical trends
- Capacity planning data
- Business intelligence queries
Why ClickHouse:
- Columnar storage (100x compression)
- Sub-second OLAP queries
- SQL interface
- Materialized views for pre-aggregation
- Excellent for time-series analytics
graph TB
subgraph "Edge Devices"
D1[Device 1]
D2[Device 2]
DN[Device N]
end
subgraph "Fleet Server"
API[API Gateway]
DP[Data Processor]
end
subgraph "Storage Layer"
PG[(PostgreSQL<br/>+ TimescaleDB)]
VM[(VictoriaMetrics)]
LOKI[(Loki)]
CH[(ClickHouse)]
REDIS[(Valkey/Redis)]
end
subgraph "Query Layer"
GF[Grafana]
WEB[Web UI]
ANALYTICS[Analytics API]
end
D1 --> API
D2 --> API
DN --> API
API --> DP
DP -->|Metadata| PG
DP -->|Metrics| VM
DP -->|Logs| LOKI
DP -->|Sessions| REDIS
VM -->|Downsample| CH
PG -->|Archive| CH
GF --> VM
GF --> LOKI
GF --> PG
WEB --> API
WEB --> PG
ANALYTICS --> CH
ANALYTICS --> PG
- Migrate from SQLite to PostgreSQL
- Add TimescaleDB extension
- Set up connection pooling with PgBouncer
- Implement proper migrations
- Deploy VictoriaMetrics
- Configure remote write from agents
- Set up retention policies
- Create Grafana dashboards
- Deploy Loki
- Configure Promtail/Fluent Bit on devices
- Set up log forwarding
- Create log exploration dashboards
- Deploy ClickHouse
- Set up data pipelines from VictoriaMetrics
- Create materialized views
- Build analytics API
For 10,000 devices reporting every 30 seconds:
- Raw: ~100 metrics/device × 10,000 devices = 1M data points/30s
- Storage: ~2 bytes/point after compression = 5.7 GB/day
- 1-year retention: ~2 TB
- Assuming 1 KB/log line, 100 lines/minute/device
- Storage: ~1.4 TB/day (before compression)
- With 10:1 compression: ~140 GB/day
- 30-day retention: ~4.2 TB
- Device metadata: ~10 KB/device = 100 MB
- Configurations: ~1 GB
- Total: <10 GB for core data
- Downsampled metrics: ~200 GB/year
- Aggregated reports: ~50 GB/year
# PostgreSQL
DATABASE_URL=postgresql://user:pass@localhost:5432/fleetd
DATABASE_POOL_SIZE=25
# TimescaleDB
TIMESCALE_CHUNK_INTERVAL=1d
TIMESCALE_RETENTION_DAYS=30
# VictoriaMetrics
VICTORIA_METRICS_URL=http://localhost:8428
METRICS_RETENTION_DAYS=365
METRICS_DOWNSAMPLING=5m:30d,1h:90d,1d:1y
# Loki
LOKI_URL=http://localhost:3100
LOKI_RETENTION_DAYS=30
# ClickHouse
CLICKHOUSE_URL=http://localhost:8123
CLICKHOUSE_DATABASE=fleetd_analytics
# Valkey/Redis
VALKEY_ADDR=localhost:6379- Ingestion Rate: Points/second into VictoriaMetrics
- Query Latency: P95 response times
- Storage Growth: GB/day per system
- Device Check-in Rate: Devices/minute
- Error Rates: Failed writes, timeouts
- Device offline > 5 minutes
- Ingestion rate drop > 20%
- Storage usage > 80%
- Query latency > 1s (P95)
- Failed writes > 1%
- TLS for all data in transit
- Encryption at rest for sensitive data
- API key rotation every 90 days
- Row-level security in PostgreSQL
- Tenant isolation in VictoriaMetrics
- Label-based access in Loki
- GDPR: 30-day log retention, right to deletion
- Data residency: Regional deployments
- Audit logging: All access logged
- PostgreSQL: Daily full + hourly incremental
- VictoriaMetrics: Daily snapshots
- Loki: S3 backend with versioning
- ClickHouse: Weekly full backups
- RPO (Recovery Point Objective): 1 hour
- RTO (Recovery Time Objective): 4 hours
- Hot (0-7 days): NVMe SSD
- Warm (7-30 days): Standard SSD
- Cold (30+ days): Object storage (S3)
- Raw metrics: 30 days
- Downsampled metrics: 1 year
- Logs: 30 days (configurable)
- Analytics: 2 years
- Kafka for event streaming
- Multiple VictoriaMetrics clusters
- ClickHouse sharding
- Edge aggregation nodes
- Anomaly detection on metrics
- Predictive maintenance
- Automated root cause analysis
- Local metric aggregation
- Edge-based alerting
- Federated learning