Skip to main content

gRPC Connection Types

Nebula supports two methods of connecting to agents via gRPC:

  1. Direct - Direct connection to agent's gRPC server
  2. SSH - Connection via SSH tunnel

Architecture

In version 1, Nebula agent operates fully autonomously and doesn't know about control plane.

Key Principles

  1. Agent is autonomous: Nebula Agent is a standalone gRPC server that:

    • Manages local resources (GPU, Docker)
    • Runs model runtime containers
    • Exports metrics
    • Does NOT connect to control plane
    • Does NOT initiate outgoing connections
  2. nebulactl is the initiator: Only nebulactl:

    • Initiates gRPC connections to agents
    • Stores connection settings in local SQLite DB
    • Manages deployments via agent gRPC API
  3. Two connection methods:

    • Direct - Direct connection to agent's gRPC port
    • SSH - Connection via SSH tunnel (for agents behind NAT/firewall)

Interaction Diagram

┌─────────────┐          gRPC           ┌──────────────┐
│ │ ─────────────────────> │ │
│ nebulactl │ (direct or SSH) │ Nebula Agent │
│ │ <───────────────────── │ (gRPC) │
└─────────────┘ └──────────────┘
│ │
│ │
v v
┌─────────────┐ ┌──────────────┐
│ SQLite │ │ Docker │
│ (~/.nebula) │ │ + GPUs │
└─────────────┘ └──────────────┘

Important: Agent does NOT know about nebulactl and contains no control plane references in its configuration.

Connection Type: Direct

Direct connection is used when:

  • Agent is directly accessible over network
  • Agent's gRPC port is open and available
  • Typically used for local agents

Usage Example

# Add local node with direct connection
nebulactl node add localhost --local --connection-type direct

# Or explicitly specify connection-type for remote host
nebulactl node add 192.168.1.100 --connection-type direct --user admin

Technical Details

  • Connection: host:grpc_port
  • Transport: Direct TCP/IP
  • Requirements: gRPC port must be accessible

Connection Type: SSH

SSH tunneling is used when:

  • Agent is behind NAT/firewall
  • Direct access to gRPC port is not possible
  • Only SSH access to server is available

Usage Example

# Add SSH node with SSH tunneling (default)
nebulactl node add user@remote.server.com \
--user admin \
--port 22 \
--key-path ~/.ssh/id_rsa

# Explicitly specify connection-type
nebulactl node add 10.0.0.50 \
--connection-type ssh \
--user root \
--key-path ~/.ssh/nebula_key

Technical Details

  1. SSH client establishes connection to remote server
  2. Local listener is created on 127.0.0.1:<random-port>
  3. SSH tunnel forwards traffic to remote 127.0.0.1:grpc_port
  4. gRPC client connects to local listener
nebulactl -> SSH Client -> SSH Tunnel -> Remote Agent (127.0.0.1:9091)
| ^
v |
Local Listener (127.0.0.1:random) --+
^
|
gRPC Client

Database

Connection settings are stored in ~/.nebula/nebula.db:

CREATE TABLE nodes (
id TEXT PRIMARY KEY,
name TEXT NOT NULL,
host TEXT NOT NULL,
grpc_port INTEGER NOT NULL,
connection_type TEXT NOT NULL DEFAULT 'direct', -- 'direct' or 'ssh'
ssh_user TEXT,
ssh_port INTEGER,
ssh_key_path TEXT,
...
);

Code Usage

Creating Client

import (
"context"
"nebula/platform/client"
"nebula/platform/storage"
)

// Get node from DB
node, err := store.GetNode(ctx, nodeID)
if err != nil {
return err
}

// Create client (automatically determines connection type)
agentClient, err := client.NewAgentClient(ctx, node)
if err != nil {
return err
}
defer agentClient.Close()

// Use client
health, err := agentClient.GetHealth(ctx)

Supported Methods

  • GetHealth() - Agent health status
  • GetCapabilities() - Node resource information
  • GetStats() - Current metrics
  • PrepareModel() - Model preparation
  • StartRuntime() - Start runtime container
  • StopRuntime() - Stop runtime

Testing Connection

# Test connection to agent
nebulactl node test <node-id>

The command:

  1. Establishes connection (direct or SSH)
  2. Checks health endpoint
  3. Gets capabilities
  4. Outputs node information

Example Output

Testing connection to node: my-gpu-server
Connection Type: ssh
SSH: admin@192.168.1.100:22

Connecting to agent...
✅ Connection established

Testing health endpoint...
✅ Health Status: healthy
Uptime: 3600 seconds

Testing capabilities endpoint...
✅ Capabilities retrieved
CPU Cores: 32
Memory: 128.00 GB
OS: linux x86_64
Driver Version: 535.129.03
CUDA Version: 12.2
GPUs: 4
GPU 0: NVIDIA A100-SXM4-40GB (40.00 GB)
GPU 1: NVIDIA A100-SXM4-40GB (40.00 GB)
GPU 2: NVIDIA A100-SXM4-40GB (40.00 GB)
GPU 3: NVIDIA A100-SXM4-40GB (40.00 GB)
Supported Runtimes: [vllm tgi]

✅ All tests passed successfully!

CLI Commands

Add Node

# Direct connection (local)
nebulactl node add localhost --local

# Direct connection (remote with open gRPC port)
nebulactl node add 192.168.1.100 --connection-type direct

# SSH tunnel (default for remote nodes)
nebulactl node add 10.0.0.50 --user ubuntu --key-path ~/.ssh/key.pem

# SSH tunnel (explicit)
nebulactl node add remote.server.com \
--connection-type ssh \
--user admin \
--port 2222 \
--key-path ~/.ssh/custom_key

View Nodes

# List all nodes
nebulactl node list

# Node status
nebulactl node status <node-id>

# Test connection
nebulactl node test <node-id>

Security

SSH Authentication

  • Public key is used (recommended)
  • Password authentication is supported (not recommended)
  • TODO: Add proper host key verification

gRPC Security

  • Current version: insecure credentials
  • TODO: Add TLS support

Current Limitations

  1. Host Key Verification: Uses ssh.InsecureIgnoreHostKey()
  2. gRPC without TLS: Connection without encryption
  3. SSH Tunnel Performance: Each connection creates new tunnel

Agent Configuration

Example agent configuration (/etc/nebula/agent.yaml):

agent:
node_id: "550e8400-e29b-41d4-a716-446655440000"
node_name: "gpu-server-1"
cache_path: "/var/lib/nebula/models"
version: "0.1.0"
grpc_port: 9091

docker:
socket: "/var/run/docker.sock"
network: "nebula-network"

metrics:
enabled: true
port: 9100
path: "/metrics"

Note: Agent config does NOT contain:

  • control_plane section
  • Endpoints for outgoing connections
  • Information about nebulactl

The agent simply starts gRPC server and waits for incoming connections.