gRPC Connection Types

Nebula supports two methods of connecting to agents via gRPC:

Direct - Direct connection to agent's gRPC server
SSH - Connection via SSH tunnel

Architecture

In version 1, Nebula agent operates fully autonomously and doesn't know about control plane.

Key Principles

Agent is autonomous: Nebula Agent is a standalone gRPC server that:
- Manages local resources (GPU, Docker)
- Runs model runtime containers
- Exports metrics
- Does NOT connect to control plane
- Does NOT initiate outgoing connections
nebulactl is the initiator: Only nebulactl:
- Initiates gRPC connections to agents
- Stores connection settings in local SQLite DB
- Manages deployments via agent gRPC API
Two connection methods:
- Direct - Direct connection to agent's gRPC port
- SSH - Connection via SSH tunnel (for agents behind NAT/firewall)

Interaction Diagram

┌─────────────┐          gRPC           ┌──────────────┐
│             │ ─────────────────────> │              │
│  nebulactl  │   (direct or SSH)      │ Nebula Agent │
│             │ <───────────────────── │   (gRPC)     │
└─────────────┘                         └──────────────┘
     │                                         │
     │                                         │
     v                                         v
┌─────────────┐                         ┌──────────────┐
│   SQLite    │                         │    Docker    │
│ (~/.nebula) │                         │   + GPUs     │
└─────────────┘                         └──────────────┘

Important: Agent does NOT know about nebulactl and contains no control plane references in its configuration.

Connection Type: Direct

Direct connection is used when:

Agent is directly accessible over network
Agent's gRPC port is open and available
Typically used for local agents

Usage Example

# Add local node with direct connection
nebulactl node add localhost --local --connection-type direct

# Or explicitly specify connection-type for remote host
nebulactl node add 192.168.1.100 --connection-type direct --user admin

Technical Details

Connection: host:grpc_port
Transport: Direct TCP/IP
Requirements: gRPC port must be accessible

Connection Type: SSH

SSH tunneling is used when:

Agent is behind NAT/firewall
Direct access to gRPC port is not possible
Only SSH access to server is available

Usage Example

# Add SSH node with SSH tunneling (default)
nebulactl node add user@remote.server.com \
  --user admin \
  --port 22 \
  --key-path ~/.ssh/id_rsa

# Explicitly specify connection-type
nebulactl node add 10.0.0.50 \
  --connection-type ssh \
  --user root \
  --key-path ~/.ssh/nebula_key

Technical Details

SSH client establishes connection to remote server
Local listener is created on 127.0.0.1:<random-port>
SSH tunnel forwards traffic to remote 127.0.0.1:grpc_port
gRPC client connects to local listener

nebulactl -> SSH Client -> SSH Tunnel -> Remote Agent (127.0.0.1:9091)
              |                              ^
              v                              |
         Local Listener (127.0.0.1:random) --+
              ^
              |
         gRPC Client

Database

Connection settings are stored in ~/.nebula/nebula.db:

CREATE TABLE nodes (
    id TEXT PRIMARY KEY,
    name TEXT NOT NULL,
    host TEXT NOT NULL,
    grpc_port INTEGER NOT NULL,
    connection_type TEXT NOT NULL DEFAULT 'direct', -- 'direct' or 'ssh'
    ssh_user TEXT,
    ssh_port INTEGER,
    ssh_key_path TEXT,
    ...
);

Code Usage

Creating Client

import (
    "context"
    "nebula/platform/client"
    "nebula/platform/storage"
)

// Get node from DB
node, err := store.GetNode(ctx, nodeID)
if err != nil {
    return err
}

// Create client (automatically determines connection type)
agentClient, err := client.NewAgentClient(ctx, node)
if err != nil {
    return err
}
defer agentClient.Close()

// Use client
health, err := agentClient.GetHealth(ctx)

Supported Methods

GetHealth() - Agent health status
GetCapabilities() - Node resource information
GetStats() - Current metrics
PrepareModel() - Model preparation
StartRuntime() - Start runtime container
StopRuntime() - Stop runtime

Testing Connection

# Test connection to agent
nebulactl node test <node-id>

The command:

Establishes connection (direct or SSH)
Checks health endpoint
Gets capabilities
Outputs node information

Example Output

Testing connection to node: my-gpu-server
Connection Type: ssh
SSH: admin@192.168.1.100:22

Connecting to agent...
✅ Connection established

Testing health endpoint...
✅ Health Status: healthy
   Uptime: 3600 seconds

Testing capabilities endpoint...
✅ Capabilities retrieved
   CPU Cores: 32
   Memory: 128.00 GB
   OS: linux x86_64
   Driver Version: 535.129.03
   CUDA Version: 12.2
   GPUs: 4
     GPU 0: NVIDIA A100-SXM4-40GB (40.00 GB)
     GPU 1: NVIDIA A100-SXM4-40GB (40.00 GB)
     GPU 2: NVIDIA A100-SXM4-40GB (40.00 GB)
     GPU 3: NVIDIA A100-SXM4-40GB (40.00 GB)
   Supported Runtimes: [vllm tgi]

✅ All tests passed successfully!

CLI Commands

Add Node

# Direct connection (local)
nebulactl node add localhost --local

# Direct connection (remote with open gRPC port)
nebulactl node add 192.168.1.100 --connection-type direct

# SSH tunnel (default for remote nodes)
nebulactl node add 10.0.0.50 --user ubuntu --key-path ~/.ssh/key.pem

# SSH tunnel (explicit)
nebulactl node add remote.server.com \
  --connection-type ssh \
  --user admin \
  --port 2222 \
  --key-path ~/.ssh/custom_key

View Nodes

# List all nodes
nebulactl node list

# Node status
nebulactl node status <node-id>

# Test connection
nebulactl node test <node-id>

Security

SSH Authentication

Public key is used (recommended)
Password authentication is supported (not recommended)
TODO: Add proper host key verification

gRPC Security

Current version: insecure credentials
TODO: Add TLS support

Current Limitations

Host Key Verification: Uses ssh.InsecureIgnoreHostKey()
gRPC without TLS: Connection without encryption
SSH Tunnel Performance: Each connection creates new tunnel

Agent Configuration

Example agent configuration (/etc/nebula/agent.yaml):

agent:
  node_id: "550e8400-e29b-41d4-a716-446655440000"
  node_name: "gpu-server-1"
  cache_path: "/var/lib/nebula/models"
  version: "0.1.0"
  grpc_port: 9091

docker:
  socket: "/var/run/docker.sock"
  network: "nebula-network"

metrics:
  enabled: true
  port: 9100
  path: "/metrics"

Note: Agent config does NOT contain:

control_plane section
Endpoints for outgoing connections
Information about nebulactl

The agent simply starts gRPC server and waits for incoming connections.

Architecture​

Key Principles​

Interaction Diagram​

Connection Type: Direct​

Usage Example​

Technical Details​

Connection Type: SSH​

Usage Example​

Technical Details​

Database​

Code Usage​

Creating Client​

Supported Methods​

Testing Connection​

Example Output​

CLI Commands​

Add Node​

View Nodes​

Security​

SSH Authentication​

gRPC Security​

Current Limitations​

Agent Configuration​

Architecture

Key Principles

Interaction Diagram

Connection Type: Direct

Usage Example

Technical Details

Connection Type: SSH

Usage Example

Technical Details

Database

Code Usage

Creating Client

Supported Methods

Testing Connection

Example Output

CLI Commands

Add Node

View Nodes

Security

SSH Authentication

gRPC Security

Current Limitations

Agent Configuration