Skip to content

Vision / VLM

Vision sidecar for visual understanding and embodied AI learning.

Model Selection

The vision sidecar automatically selects the appropriate model based on available hardware:

Hardware Model Format
GPU nodes MiniCPM Native PyTorch; requires VRAM managed by vram_manager.detect_gpu().
CPU nodes MobileVLM ONNX runtime; no GPU required.

API Endpoint

POST /visual_agent

Accepts an image (base64 or URL) and a text prompt. Returns the model's visual analysis as structured JSON.

Embodied AI Learning

The vision sidecar integrates with the embodied AI learning pipeline:

  • Visual observations are fed into the agent's context during CREATE mode.
  • Recipes can include vision steps that are replayed in REUSE mode with cached visual features.
  • Supports iterative refinement where the agent observes, acts, and observes again.

GPU Management

GPU detection and VRAM allocation are handled by vram_manager.detect_gpu() and vram_manager.clear_cuda_cache(). These are the single sources of truth for GPU state -- do not call torch.cuda.empty_cache() directly.

Source Files

  • integrations/vision/ (vision sidecar implementation)
  • integrations/service_tools/vram_manager.py
  • hart_intelligence_entry.py (/visual_agent route)