Skip to main content

Nebula Agent

Agent daemon for managing models on Nebula GPU nodes.

Features

Core Daemon (cmd/nebula-agent/main.go)

  • CLI based on urfave/cli with flag and environment variable support
  • Structured logging (slog) with JSON and text format support
  • Graceful shutdown with 30-second timeout
  • gRPC interceptors for logging and panic recovery
  • gRPC reflection for debugging
  • Prometheus metrics server
  • Automatic node capability detection at startup

Business Logic (compute/daemon/agent.go)

  • Structured logging for all operations
  • Improved error handling
  • Support for:
    • GPU management via NVML
    • Docker container management
    • Model caching
    • Deployment state monitoring

gRPC Server (platform/api/grpc/server/agent_server.go)

  • Detailed logging of all gRPC requests
  • Conversion between protobuf and internal types
  • Error handling with clear messages

Usage

Running the Agent

# With configuration file
./nebula-agent --config /etc/nebula/agent.yaml

# With environment variables
export NEBULA_NODE_ID="my-node-123"
export NEBULA_CACHE_PATH="/var/lib/nebula/models"
export NEBULA_LOG_LEVEL="debug"
./nebula-agent

# With CLI flags
./nebula-agent --log-level debug --log-format json

CLI Options

OptionDescriptionDefault
--config, -cConfiguration file path/etc/nebula/agent.yaml
--log-level, -lLog level: debug, info, warn, errorinfo
--log-formatLog format: text, jsontext
--help, -hHelp-
--version, -vVersion-

Configuration

Example configuration (agent.yaml):

agent:
node_id: "test-node-123"
cache_path: "/var/lib/nebula/models"
version: "0.1.0"
grpc_port: 9091

control_plane:
endpoint: "localhost:9090"
tls: false

docker:
socket: "/var/run/docker.sock"
network: "nebula-network"

metrics:
enabled: true
port: 9100
path: "/metrics"

logging:
level: "info"
format: "json"

cache:
max_size_gb: 100
cleanup_policy: "lru"

Environment Variables

VariableDescription
NEBULA_CONFIGConfiguration path
NEBULA_NODE_IDNode ID
NEBULA_CACHE_PATHModel cache path
NEBULA_CONTROL_PLANEControl plane address
NEBULA_LOG_LEVELLog level
NEBULA_LOG_FORMATLog format
NEBULA_METRICS_PORTMetrics port

gRPC API

The agent provides the following gRPC methods:

MethodDescription
GetCapabilitiesGet node capabilities (GPU, CPU, memory)
PrepareModelPrepare model (download and cache)
StartRuntimeStart runtime container with model
StopRuntimeStop runtime container
GetStatsGet resource usage statistics
GetHealthCheck agent health

Usage with grpcurl

# Get capabilities
grpcurl -plaintext localhost:9091 proto.AgentService/GetCapabilities

# Check health
grpcurl -plaintext localhost:9091 proto.AgentService/GetHealth

# Get statistics
grpcurl -plaintext -d '{"node_id": "test-node"}' localhost:9091 proto.AgentService/GetStats

Provisioner

For automatic agent installation on remote nodes:

import "nebula/platform/service/provisioning/ssh"

config := &ssh.ProvisionConfig{
// SSH connection
Host: "192.168.1.100",
Port: 22,
User: "ubuntu",
KeyPath: "~/.ssh/id_rsa",

// Agent settings
NodeName: "gpu-node-1",
AgentVersion: "0.1.0",
CachePath: "/var/lib/nebula/models",
GRPCPort: 9091,
MetricsPort: 9100,

// Options
Interactive: true,
}

provisioner, err := ssh.NewProvisioner(config)
if err != nil {
log.Fatal(err)
}
defer provisioner.Close()

nodeInfo, err := provisioner.Provision(context.Background())

System Requirements

  • Go 1.24+
  • Docker (for running models)
  • NVIDIA GPU + drivers (optional, for GPU workloads)
  • NVIDIA Container Runtime (optional, for GPU workloads)

Architecture

cmd/nebula-agent/          - Entry point
└── main.go - CLI, logging, gRPC server

compute/
├── daemon/ - Agent implementation
├── cache/ - Model cache management
├── docker/ - Container management
├── gpu/ - GPU monitoring via NVML
├── runtime/ - Runtime implementations
└── storage/ - State persistence

Logging

The agent uses structured logging (slog) at all levels:

level=INFO msg="Starting Nebula Node Agent" version=0.1.0
level=INFO msg="Configuration loaded" node_id=test-node-123 cache_path=/tmp/nebula-test-cache grpc_port=9091
level=INFO msg="Agent initialized successfully"
level=INFO msg="Node capabilities discovered" gpus=0 cpu_cores=10 memory_gb=16
level=INFO msg="Starting metrics server" address=:9100
level=INFO msg="gRPC server starting" port=9091
level=INFO msg="Nebula Agent is ready and accepting requests"

Building

# Build for current platform
go build -o nebula-agent ./cmd/nebula-agent

# Cross-compile for Linux
GOOS=linux GOARCH=amd64 go build -o nebula-agent-linux ./cmd/nebula-agent

# Cross-compile for Linux ARM64
GOOS=linux GOARCH=arm64 go build -o nebula-agent-linux-arm64 ./cmd/nebula-agent