Local AI Infrastructure Overview

Local AI Infrastructure Overview
A distributed, self-hosted AI platform operating entirely on a private Tailscale mesh network. Specialized nodes handle GPU-accelerated inference, orchestration, transcription, and monitoring — all behind a unified OpenAI-compatible API surface with zero public internet exposure.
Explore the Stack
Open WebUI
Performance Benchmarks
These benchmarks show inference performance across different model sizes and GPU configurations, with 14B models offering the best per-GPU efficiency and lowest latency.
Platform Architecture
Infrastructure Nodes at a Glance
Four specialized systems form a cohesive, self-hosted AI platform. Each node has a dedicated role — from raw GPU inference to centralized routing and user-facing interfaces — all connected through a private encrypted mesh.
Spark1
Primary GPU inference — Gemma4 31B via vLLM
Spark2
Secondary inference — Qwen2.5-32B via TensorRT-LLM
Spark3
Orchestration layer — LiteLLM, Open WebUI, OpenPlaud
Mac Mini
AI workstation — desktop client with centralized access
Tailscale
Private mesh — encrypted cross-device connectivity
Node 01
Spark1 — Primary GPU Inference Server
Hardware Platform
NVIDIA DGX Spark
CPU: NVIDIA Grace
GPU: NVIDIA Blackwell
Unified Memory: 128 GB
AI Compute: Up to 1 PFLOP FP4
Storage: Local NVMe SSD
OS: Linux
Runtime: Docker
Active Model
Gemma4 31B NVFP4
Runtime: vLLM
Endpoint: http://spark1:8000/v1
Responsibilities
Main conversational AI workload for the entire platform
GPU-accelerated inference with concurrent multi-client support
OpenAI-compatible API serving for seamless client integration
Node 02
Spark2 — Secondary GPU Inference Server
Dedicated reasoning and instruction-tuned inference node. Spark2 provides an alternate backend using TensorRT-LLM for optimized NVIDIA runtime performance, handling structured instruction workflows and complex reasoning tasks.
Active Model
Qwen2.5-32B-Instruct running on TensorRT-LLM
Endpoint: http://spark2:8000/v1
LiteLLM Alias
Exposed as qwen3 through the central routing layer
Hardware Specifications
Hardware Platform: NVIDIA DGX Spark
CPU: NVIDIA Grace
GPU: NVIDIA Blackwell
Unified Memory: 128 GB
AI Compute: Up to 1 PFLOP FP4
Storage: Local NVMe SSD
OS: Linux
Runtime: Docker
Reasoning Workloads
Structured instruction and chain-of-thought tasks
Alternate Backend
TensorRT-LLM runtime for NVIDIA-optimized serving
OpenAI-Compatible API
Seamless integration via standard API clients
Node 03
Spark3 — AI Application & Orchestration Server
The central nervous system of the platform. Spark3 hosts every application-layer service — routing, interfaces, transcription, and observability — tying all inference nodes into a unified, manageable stack.
Hardware Platform
NVIDIA DGX Spark
CPU
NVIDIA Grace
GPU
NVIDIA Blackwell
Unified Memory
128 GB
AI Compute
Up to 1 PFLOP FP4
Storage
Local NVMe SSD
OS
Linux
Runtime
Docker
Application Layer
Open WebUI — Browser-Based AI Interface
What It Provides
Open WebUI is the primary human-facing interface for the entire AI stack. It delivers a ChatGPT-style experience with full multi-model selection, persistent conversations, and remote access through Tailscale — all without exposing any service publicly.
Access: http://spark3:8080
Backend Request Flow
01
Open WebUI — User submits a prompt through the browser
02
LiteLLM Router — Request forwarded to spark3:4000/v1
03
Spark1 or Spark2 — Inference executed on the selected GPU node
04
Response — Streamed back through LiteLLM to the browser
Application Layer
LiteLLM — Central API Routing Layer
LiteLLM is the abstraction layer that makes the entire platform appear as a single OpenAI-compatible endpoint. Clients never address individual GPU nodes directly — they connect to LiteLLM, which handles model selection, aliasing, and request distribution.
Unified Endpoint
http://spark3:4000/v1
Model Aliases
gemma4 → Spark1 (vLLM)
qwen3 → Spark2 (TensorRT-LLM)
Core Responsibilities
Unified API Endpoint
Single entry point for all AI clients
Backend Model Routing
Transparent dispatch to Spark1 or Spark2
Alias Management
Human-friendly model names for clients
Multi-Model Orchestration
Centralized client access and failover
Application Layer
OpenPlaud — Transcription & Note Platform
Self-hosted transcription and note-processing platform integrated with Plaud audio devices. All audio stays local — recordings are transcribed via Whisper STT, then summarized through the local LLM cluster via LiteLLM.
Real-World Example
The example above shows how OpenPlaud turns a recorded lecture into searchable transcript notes, highlights, and a concise summary for fast review.
The pipeline is entirely local — no audio or transcripts leave the private network at any stage.
Features
Local Whisper STT transcription
Audio upload management
AI-generated summaries via Gemma4/Qwen
Browser-based interface
Storage & Access
Access: http://spark3:3000
Audio Volume:
/var/lib/docker/volumes/openplaud_audio/_data
Client Node
Mac Mini — AI Workstation
Hardware Specifications
Platform: Apple Mac Mini
Processor: Apple M4
Memory: 16 GB Unified Memory
Storage: 256 GB SSD
OS: macOS
Networking: Tailscale Mesh Networking
Role in the Stack
The Mac Mini serves as the local desktop AI client and productivity workstation. It does not host any models — instead, it accesses the full centralized inference cluster through LiteLLM, providing a lightweight interface with zero GPU overhead on the workstation itself.
API Endpoint: http://spark3:4000/v1
Services
OpenClaw
Desktop AI interface for local productivity workflows
Remote Model Access
All inference routed through the centralized cluster
Benefits
Lightweight local interface — no GPU required
Shared centralized GPU resources across all clients
No duplicated model hosting or storage
Unified access to Gemma4 and Qwen models
Network Layer
Tailscale Private Mesh Network
Tailscale provides end-to-end encrypted connectivity between every node in the platform. No service is exposed to the public internet — all communication flows through the private mesh with internal DNS resolution and seamless mobile access.
5
Connected Systems
All nodes on a single encrypted mesh
0
Public Exposures
No service reachable from the internet
2
GPU Nodes
Spark1 and Spark2 dedicated to inference
1
Unified API
Single LiteLLM endpoint for all clients