Monitoring System
Overview
The EDURange Cloud Monitoring System provides real-time resource usage metrics and visualization for the platform. It collects data on CPU, memory, network traffic, and challenge pod status, making this information available through both a dedicated API and Prometheus metrics endpoints. This system helps administrators monitor the health and performance of their EDURange Cloud deployment.
Architecture
The monitoring system consists of several interconnected components:
- Monitoring Service: A dedicated microservice that collects metrics from various sources and exposes them via REST API endpoints.
- Prometheus: An open-source monitoring and alerting toolkit that stores time-series data.
- Dashboard Integration: Components in the EDURange dashboard that visualize the collected metrics.
- Kubernetes Integration: RBAC permissions and service accounts that allow the monitoring service to access cluster metrics.
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ │ │ │ │ │
│ Dashboard │◄────┤ Monitoring │◄────┤ Prometheus │
│ (Frontend) │ │ Service │ │ │
│ │ │ │ │ │
└─────────────────┘ └─────────────────┘ └─────────────────┘
▲ ▲
│ │
│ │
▼ │
┌─────────────────┐ │
│ │ │
│ Kubernetes │───────────┘
│ API │
│ │
└─────────────────┘
Monitoring Service
The monitoring service is a Flask application that collects metrics from multiple sources:
- Kubernetes API: For pod and node information
- Prometheus: For network traffic and other system metrics
- psutil: As a fallback for local system metrics when Prometheus data is unavailable
Key Features
- Metrics Collection: Gathers CPU, memory, network, and challenge pod metrics
- Caching: Implements a TTL-based cache to reduce API load
- History Tracking: Maintains a 24-hour history of metrics for trend visualization
- Fallback Mechanisms: Provides reliable data even when primary sources fail
- Prometheus Integration: Exposes collected metrics in Prometheus format
API Endpoints
Endpoint | Method | Description |
---|---|---|
/api/metrics/current | GET | Returns current values for all metrics |
/api/metrics/history | GET | Returns historical data for specified metric type |
/api/pods | GET | Returns detailed information about challenge pods |
/api/nodes | GET | Returns information about cluster nodes |
/health | GET | Health check endpoint |
Prometheus Integration
The monitoring service both consumes data from and exposes data to Prometheus:
Metrics Exposed
edurange_cpu_usage_system
: CPU usage by system components (%)edurange_cpu_usage_challenges
: CPU usage by challenge pods (%)edurange_cpu_usage_total
: Total CPU usage (%)edurange_memory_used
: Memory used (%)edurange_memory_available
: Memory available (%)edurange_memory_total_bytes
: Total memory (bytes)edurange_memory_used_bytes
: Used memory (bytes)edurange_network_inbound
: Network inbound traffic (MB/s)edurange_network_outbound
: Network outbound traffic (MB/s)edurange_network_total
: Total network traffic (MB/s)edurange_challenge_count_total
: Total number of challenge podsedurange_challenge_count_running
: Number of running challenge podsedurange_challenge_count_pending
: Number of pending challenge podsedurange_challenge_count_failed
: Number of failed challenge pods
Metrics Consumed
node_network_receive_bytes_total
: Network receive bytesnode_network_transmit_bytes_total
: Network transmit bytes
Dashboard Integration
The dashboard integrates with the monitoring service to display resource usage metrics:
- Resource Usage Charts: Visualize CPU, memory, network, and challenge pod metrics
- Real-time Updates: Automatically refresh data every minute
- Historical View: Display 24-hour history with time-based navigation
- Current Time Indicator: Show the current position in the timeline
Installation and Configuration
The monitoring system is installed using the install-monitoring.sh
script, which:
- Installs Prometheus and related components in the Kubernetes cluster
- Deploys the monitoring service with appropriate RBAC permissions
- Configures ServiceMonitor for Prometheus integration
- Sets up proper network access between components
Environment Variables
The monitoring service can be configured using the following environment variables:
Variable | Description | Default |
---|---|---|
PROMETHEUS_URL | URL of the Prometheus server | http://prometheus-kube-prometheus-prometheus.monitoring:9090 |
METRICS_CACHE_TTL | Cache time-to-live in seconds | 15 |
METRICS_PORT | Port for Prometheus metrics endpoint | 9100 |
HISTORY_RETENTION_HOURS | Hours to retain metrics history | 24 |
Network Traffic Monitoring
Network traffic monitoring is a critical component that:
- Queries Prometheus for network interface metrics
- Filters out virtual interfaces (lo, veth, docker, etc.)
- Calculates inbound, outbound, and total traffic rates
- Falls back to psutil for direct system metrics if Prometheus is unavailable
- Ensures minimum values to maintain dashboard visualization
Troubleshooting
Common issues and their solutions:
No Network Traffic Data
- Check if Prometheus is running:
kubectl get pods -n monitoring
- Verify ServiceMonitor configuration:
kubectl get servicemonitor -n monitoring
- Check monitoring service logs:
kubectl logs -l app=monitoring-service
- Ensure node-exporter is collecting network metrics:
kubectl get pods -n monitoring | grep node-exporter
Missing or Incomplete Metrics
- Verify RBAC permissions for the monitoring service
- Check if the monitoring service can access the Kubernetes API
- Ensure Prometheus is scraping the monitoring service endpoints
- Check for errors in the monitoring service logs
Dashboard Not Showing Metrics
- Verify the monitoring service is running
- Check if the dashboard can reach the monitoring service API
- Inspect browser network requests for API errors
- Ensure the correct environment variables are set in the dashboard configuration
Security Considerations
The monitoring system is designed with security in mind:
- Uses dedicated service accounts with minimal permissions
- Implements proper RBAC for Kubernetes API access
- Exposes metrics only on internal cluster networks
- Validates and sanitizes all input data
- Implements rate limiting to prevent abuse
Future Enhancements
Planned improvements to the monitoring system:
- Alerting: Integration with alert managers for proactive notification
- Custom Metrics: Support for user-defined metrics
- Extended History: Longer retention periods for historical data
- Grafana Dashboards: Pre-configured dashboards for detailed analysis
- Resource Prediction: ML-based prediction of resource usage trends