Nebula Agent
Agent daemon for managing models on Nebula GPU nodes.
Features
Core Daemon (cmd/nebula-agent/main.go)
- CLI based on urfave/cli with flag and environment variable support
- Structured logging (slog) with JSON and text format support
- Graceful shutdown with 30-second timeout
- gRPC interceptors for logging and panic recovery
- gRPC reflection for debugging
- Prometheus metrics server
- Automatic node capability detection at startup
Business Logic (compute/daemon/agent.go)
- Structured logging for all operations
- Improved error handling
- Support for:
- GPU management via NVML
- Docker container management
- Model caching
- Deployment state monitoring
gRPC Server (platform/api/grpc/server/agent_server.go)
- Detailed logging of all gRPC requests
- Conversion between protobuf and internal types
- Error handling with clear messages
Usage
Running the Agent
# With configuration file
./nebula-agent --config /etc/nebula/agent.yaml
# With environment variables
export NEBULA_NODE_ID="my-node-123"
export NEBULA_CACHE_PATH="/var/lib/nebula/models"
export NEBULA_LOG_LEVEL="debug"
./nebula-agent
# With CLI flags
./nebula-agent --log-level debug --log-format json
CLI Options
| Option | Description | Default |
|---|---|---|
--config, -c | Configuration file path | /etc/nebula/agent.yaml |
--log-level, -l | Log level: debug, info, warn, error | info |
--log-format | Log format: text, json | text |
--help, -h | Help | - |
--version, -v | Version | - |
Configuration
Example configuration (agent.yaml):
agent:
node_id: "test-node-123"
cache_path: "/var/lib/nebula/models"
version: "0.1.0"
grpc_port: 9091
control_plane:
endpoint: "localhost:9090"
tls: false
docker:
socket: "/var/run/docker.sock"
network: "nebula-network"
metrics:
enabled: true
port: 9100
path: "/metrics"
logging:
level: "info"
format: "json"
cache:
max_size_gb: 100
cleanup_policy: "lru"
Environment Variables
| Variable | Description |
|---|---|
NEBULA_CONFIG | Configuration path |
NEBULA_NODE_ID | Node ID |
NEBULA_CACHE_PATH | Model cache path |
NEBULA_CONTROL_PLANE | Control plane address |
NEBULA_LOG_LEVEL | Log level |
NEBULA_LOG_FORMAT | Log format |
NEBULA_METRICS_PORT | Metrics port |
gRPC API
The agent provides the following gRPC methods:
| Method | Description |
|---|---|
| GetCapabilities | Get node capabilities (GPU, CPU, memory) |
| PrepareModel | Prepare model (download and cache) |
| StartRuntime | Start runtime container with model |
| StopRuntime | Stop runtime container |
| GetStats | Get resource usage statistics |
| GetHealth | Check agent health |
Usage with grpcurl
# Get capabilities
grpcurl -plaintext localhost:9091 proto.AgentService/GetCapabilities
# Check health
grpcurl -plaintext localhost:9091 proto.AgentService/GetHealth
# Get statistics
grpcurl -plaintext -d '{"node_id": "test-node"}' localhost:9091 proto.AgentService/GetStats
Provisioner
For automatic agent installation on remote nodes:
import "nebula/platform/service/provisioning/ssh"
config := &ssh.ProvisionConfig{
// SSH connection
Host: "192.168.1.100",
Port: 22,
User: "ubuntu",
KeyPath: "~/.ssh/id_rsa",
// Agent settings
NodeName: "gpu-node-1",
AgentVersion: "0.1.0",
CachePath: "/var/lib/nebula/models",
GRPCPort: 9091,
MetricsPort: 9100,
// Options
Interactive: true,
}
provisioner, err := ssh.NewProvisioner(config)
if err != nil {
log.Fatal(err)
}
defer provisioner.Close()
nodeInfo, err := provisioner.Provision(context.Background())
System Requirements
- Go 1.24+
- Docker (for running models)
- NVIDIA GPU + drivers (optional, for GPU workloads)
- NVIDIA Container Runtime (optional, for GPU workloads)
Architecture
cmd/nebula-agent/ - Entry point
└── main.go - CLI, logging, gRPC server
compute/
├── daemon/ - Agent implementation
├── cache/ - Model cache management
├── docker/ - Container management
├── gpu/ - GPU monitoring via NVML
├── runtime/ - Runtime implementations
└── storage/ - State persistence
Logging
The agent uses structured logging (slog) at all levels:
level=INFO msg="Starting Nebula Node Agent" version=0.1.0
level=INFO msg="Configuration loaded" node_id=test-node-123 cache_path=/tmp/nebula-test-cache grpc_port=9091
level=INFO msg="Agent initialized successfully"
level=INFO msg="Node capabilities discovered" gpus=0 cpu_cores=10 memory_gb=16
level=INFO msg="Starting metrics server" address=:9100
level=INFO msg="gRPC server starting" port=9091
level=INFO msg="Nebula Agent is ready and accepting requests"
Building
# Build for current platform
go build -o nebula-agent ./cmd/nebula-agent
# Cross-compile for Linux
GOOS=linux GOARCH=amd64 go build -o nebula-agent-linux ./cmd/nebula-agent
# Cross-compile for Linux ARM64
GOOS=linux GOARCH=arm64 go build -o nebula-agent-linux-arm64 ./cmd/nebula-agent