Nebula Agent

Agent daemon for managing models on Nebula GPU nodes.

Features

Core Daemon (`cmd/nebula-agent/main.go`)

CLI based on urfave/cli with flag and environment variable support
Structured logging (slog) with JSON and text format support
Graceful shutdown with 30-second timeout
gRPC interceptors for logging and panic recovery
gRPC reflection for debugging
Prometheus metrics server
Automatic node capability detection at startup

Business Logic (`compute/daemon/agent.go`)

Structured logging for all operations
Improved error handling
Support for:
- GPU management via NVML
- Docker container management
- Model caching
- Deployment state monitoring

gRPC Server (`platform/api/grpc/server/agent_server.go`)

Detailed logging of all gRPC requests
Conversion between protobuf and internal types
Error handling with clear messages

Usage

Running the Agent

# With configuration file
./nebula-agent --config /etc/nebula/agent.yaml

# With environment variables
export NEBULA_NODE_ID="my-node-123"
export NEBULA_CACHE_PATH="/var/lib/nebula/models"
export NEBULA_LOG_LEVEL="debug"
./nebula-agent

# With CLI flags
./nebula-agent --log-level debug --log-format json

CLI Options

Option	Description	Default
`--config, -c`	Configuration file path	`/etc/nebula/agent.yaml`
`--log-level, -l`	Log level: debug, info, warn, error	`info`
`--log-format`	Log format: text, json	`text`
`--help, -h`	Help	-
`--version, -v`	Version	-

Configuration

Example configuration (agent.yaml):

agent:
  node_id: "test-node-123"
  cache_path: "/var/lib/nebula/models"
  version: "0.1.0"
  grpc_port: 9091

control_plane:
  endpoint: "localhost:9090"
  tls: false

docker:
  socket: "/var/run/docker.sock"
  network: "nebula-network"

metrics:
  enabled: true
  port: 9100
  path: "/metrics"

logging:
  level: "info"
  format: "json"

cache:
  max_size_gb: 100
  cleanup_policy: "lru"

Environment Variables

Variable	Description
`NEBULA_CONFIG`	Configuration path
`NEBULA_NODE_ID`	Node ID
`NEBULA_CACHE_PATH`	Model cache path
`NEBULA_CONTROL_PLANE`	Control plane address
`NEBULA_LOG_LEVEL`	Log level
`NEBULA_LOG_FORMAT`	Log format
`NEBULA_METRICS_PORT`	Metrics port

gRPC API

The agent provides the following gRPC methods:

Method	Description
GetCapabilities	Get node capabilities (GPU, CPU, memory)
PrepareModel	Prepare model (download and cache)
StartRuntime	Start runtime container with model
StopRuntime	Stop runtime container
GetStats	Get resource usage statistics
GetHealth	Check agent health

Usage with grpcurl

# Get capabilities
grpcurl -plaintext localhost:9091 proto.AgentService/GetCapabilities

# Check health
grpcurl -plaintext localhost:9091 proto.AgentService/GetHealth

# Get statistics
grpcurl -plaintext -d '{"node_id": "test-node"}' localhost:9091 proto.AgentService/GetStats

Provisioner

For automatic agent installation on remote nodes:

import "nebula/platform/service/provisioning/ssh"

config := &ssh.ProvisionConfig{
    // SSH connection
    Host:     "192.168.1.100",
    Port:     22,
    User:     "ubuntu",
    KeyPath:  "~/.ssh/id_rsa",

    // Agent settings
    NodeName:     "gpu-node-1",
    AgentVersion: "0.1.0",
    CachePath:    "/var/lib/nebula/models",
    GRPCPort:     9091,
    MetricsPort:  9100,

    // Options
    Interactive: true,
}

provisioner, err := ssh.NewProvisioner(config)
if err != nil {
    log.Fatal(err)
}
defer provisioner.Close()

nodeInfo, err := provisioner.Provision(context.Background())

System Requirements

Go 1.24+
Docker (for running models)
NVIDIA GPU + drivers (optional, for GPU workloads)
NVIDIA Container Runtime (optional, for GPU workloads)

Architecture

cmd/nebula-agent/          - Entry point
  └── main.go              - CLI, logging, gRPC server

compute/
  ├── daemon/              - Agent implementation
  ├── cache/               - Model cache management
  ├── docker/              - Container management
  ├── gpu/                 - GPU monitoring via NVML
  ├── runtime/             - Runtime implementations
  └── storage/             - State persistence

Logging

The agent uses structured logging (slog) at all levels:

level=INFO msg="Starting Nebula Node Agent" version=0.1.0
level=INFO msg="Configuration loaded" node_id=test-node-123 cache_path=/tmp/nebula-test-cache grpc_port=9091
level=INFO msg="Agent initialized successfully"
level=INFO msg="Node capabilities discovered" gpus=0 cpu_cores=10 memory_gb=16
level=INFO msg="Starting metrics server" address=:9100
level=INFO msg="gRPC server starting" port=9091
level=INFO msg="Nebula Agent is ready and accepting requests"

Building

# Build for current platform
go build -o nebula-agent ./cmd/nebula-agent

# Cross-compile for Linux
GOOS=linux GOARCH=amd64 go build -o nebula-agent-linux ./cmd/nebula-agent

# Cross-compile for Linux ARM64
GOOS=linux GOARCH=arm64 go build -o nebula-agent-linux-arm64 ./cmd/nebula-agent

Features​

Core Daemon (cmd/nebula-agent/main.go)​

Business Logic (compute/daemon/agent.go)​

gRPC Server (platform/api/grpc/server/agent_server.go)​

Usage​

Running the Agent​

CLI Options​

Configuration​

Environment Variables​

gRPC API​

Usage with grpcurl​

Provisioner​

System Requirements​

Architecture​

Logging​

Building​