Skip to main content

Architecture Overview

Nebula is an open-source, cloud-agnostic platform for deploying and scaling Large Language Models (LLMs) across local, private, and cloud GPU infrastructure.

High-Level Architecture

┌─────────────────────────────────────────────────────────────────┐
│ USER INTERFACES │
│ CLI (nebulactl) | Web UI (React) | TUI | REST API │
└──────────────────────────┬──────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────────┐
│ CONTROL PLANE │
│ ┌─────────────┐ ┌──────────────┐ ┌────────────────────────┐ │
│ │ Scheduler │ │ Provisioner │ │ REST/gRPC API Gateway │ │
│ ├─────────────┤ ├──────────────┤ ├────────────────────────┤ │
│ │ SSH Setup │ │ Node Manager │ │ OpenAI-compatible │ │
│ └─────────────┘ └──────────────┘ └────────────────────────┘ │
│ │
│ SQLite Store │
│ (Nodes, Deployments, Stats, Events) │
└────────────────────┬────────────────────────────────────────────┘
│ gRPC (Agent Communication)

┌─────────────────────────────────────────────────────────────────┐
│ COMPUTE PLANE (GPU Nodes) │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ Node Agent (nebula-agent) │ │
│ ├────────────────────────────────────────────────────────┤ │
│ │ • gRPC Server (Agent Service) │ │
│ │ • Docker Container Manager │ │
│ │ • GPU Monitoring (NVML integration) │ │
│ │ • Model Cache Manager │ │
│ │ • State Management (SQLite agent.db) │ │
│ │ • Prometheus Metrics Export │ │
│ └───────────────┬──────────────┬──────────────────────────┘ │
│ │ │ │
│ ▼ │ ▼ │
│ ┌────────────────────┐ ┌─────────────────┐ │
│ │ Model Runtimes │ │ Docker │ │
│ ├────────────────────┤ │ Containers │ │
│ │ • vLLM │ │ │ │
│ │ • TGI │ │ • Port Binding │ │
│ │ • Ollama │ │ • GPU Passthrough │
│ └────────────────────┘ └─────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

Two-Plane Architecture

Control Plane (platform/)

The orchestration logic manages cluster state, schedules deployments, and provisions nodes.

Responsibilities:

  • Receive and validate user commands
  • Provision and manage compute nodes
  • Schedule deployments to appropriate nodes
  • Maintain cluster state (nodes, deployments)
  • Monitor node health via heartbeats

Compute Layer (compute/)

The nebula-agent daemon runs on every compute node and manages all local operations autonomously.

Responsibilities:

  • Download and cache model files
  • Pull Docker images and manage containers
  • Monitor GPU resources (NVML integration)
  • Execute deployment lifecycle (start, stop, health checks)
  • Persist local state for crash recovery
  • Expose Prometheus metrics
  • Report heartbeat to orchestrator

Core Components

1. nebulactl CLI

Location: cmd/nebulactl/

The primary user interface for all Nebula operations.

Commands:

  • nebulactl deploy - Deploy models via CLI flags or YAML spec
  • nebulactl deployment - Manage deployments (list, get, logs, restart, delete)
  • nebulactl node - Manage compute nodes (add, list, status, remove)

2. nebula-agent

Location: cmd/nebula-agent/, compute/

A stateful daemon running on each compute node.

gRPC API Endpoints:

  • GetCapabilities - Report node capabilities (GPUs, memory, etc.)
  • PrepareModel - Download and cache model files
  • StartRuntime - Launch model container
  • StopRuntime - Stop running deployment
  • GetStats - Real-time GPU and system metrics
  • GetHealth - Health check for deployment
  • ListDeployments - List all deployments on node
  • GetLogs - Stream container logs

3. Scheduler

Location: platform/service/deployment/scheduler.go

Selects the optimal node for each deployment based on resource requirements.

Scheduling Logic:

  • For GPU deployments: Select node with most available GPUs
  • For CPU deployments: Select node with most available CPU cores
  • Validate GPU memory and type constraints
  • Enforce homogeneous GPU requirements for multi-GPU deployments

4. GPU Monitor

Location: compute/gpu/gpu.go

Integrates with NVIDIA NVML to discover and monitor GPUs.

Capabilities:

  • Discover all NVIDIA GPUs on host
  • Track GPU UUID, model name, memory capacity
  • Monitor real-time metrics (utilization, memory, temperature, power)

5. Docker Manager

Location: compute/docker/manager.go

Manages the lifecycle of model containers.

Operations:

  • Pull Docker images for runtimes
  • Create containers with GPU passthrough
  • Start/stop/remove containers
  • Stream container logs
  • Health checks with configurable timeouts

6. Runtime Implementations

Location: compute/runtime/

RuntimeImageUse CaseDevice Support
vLLMvllm/vllm-openaiOpenAI-compatible, high performanceGPU, CPU
TGIghcr.io/huggingface/text-generation-inferenceHugging Face modelsGPU
Ollamaollama/ollamaLocal models, quantizedGPU, CPU

Technology Stack

Backend

  • Language: Go 1.24.0
  • RPC Framework: gRPC + Protocol Buffers 3
  • CLI Framework: urfave/cli v3
  • Containerization: Docker SDK
  • GPU Monitoring: NVIDIA go-nvml
  • Observability: Prometheus client
  • Database: SQLite

Frontend

  • Web UI: React + Tailwind CSS
  • TUI: Charmbracelet (bubbletea, lipgloss)

Directory Structure

nebula/
├── cmd/ # Executable entry points
│ ├── nebulactl/ # CLI tool
│ ├── nebula-agent/ # Agent daemon
│ └── nebulad/ # Control plane server

├── platform/ # Orchestration layer
│ ├── api/ # gRPC API definitions
│ ├── client/ # gRPC client wrapper
│ ├── domain/ # Domain models
│ ├── service/ # Business logic
│ └── storage/ # SQLite persistence

├── compute/ # Agent layer
│ ├── daemon/ # Agent daemon implementation
│ ├── docker/ # Docker container manager
│ ├── gpu/ # NVIDIA GPU monitoring
│ ├── cache/ # Model cache manager
│ ├── runtime/ # Runtime implementations
│ └── storage/ # Agent-side storage

├── shared/ # Shared utilities
└── ui/ # Web UI

Deployment Patterns

Single Node (Local Development)

Developer Machine
└── nebulactl (CLI) ←→ nebula-agent (localhost:9091)
└── vLLM container (model serving)

Multi-Node (Enterprise)

Control Plane (separate server)
├─→ Agent Node 1 (GPU)
├─→ Agent Node 2 (GPU)
├─→ Agent Node 3 (GPU)
└─→ Agent Node 4 (CPU-only)