Aspexilary AI

Deployment Model

Your infrastructure.
Our intelligence.

Every domain RAG system ships as a self-contained Docker Compose stack with a pre-built vector database. No ingestion pipeline. No GPU. No cloud dependency. Pull, start, query — under 60 seconds.

607
Domain systems
<60s
To first query
0
External API calls
3.5M+
Vector points

Every domain is a sealed, four-container stack. Nothing needs to be configured, built, or downloaded after delivery. The vector database arrives pre-loaded with embedded regulatory documents.

RAG API
FastAPI service with domain-tuned retrieval, citation formatting, and hash-chained audit logging.
Port 8001 · Python 3.12
Qdrant Vector DB
Pre-loaded with a binary snapshot of the domain's embedded corpus. HNSW-indexed. Ready on startup.
Port 6333 · Rust
Ollama LLM
Fine-tuned language model quantized to Q4_K_M. Runs on CPU — no GPU required. 4 GB RAM.
Port 11434 · Go
BGE-Large Embedder
Generates query embeddings at inference time, matching the embeddings used during corpus ingestion.
Port 8085 · Rust (TEI)

From delivery to a live, queryable API. Any machine with Docker installed. No internet required after initial pull.

terminal
# Pull the domain stack
docker compose pull

# Start everything
docker compose up -d

# Query immediately
curl localhost:8001/query \
  -H "Content-Type: application/json" \
  -d '{"question": "What are the NFPA 855 requirements for BESS installations?"}'

Every domain runs on its own internal Docker network with no internet egress. This isn't a software filter — it's enforced at the kernel network layer. A compromised container in one domain has no network path to another domain's data or to the outside world.

network topology
# Each domain is a sealed unit

aspexilary_fuel_internal     (internal: true, no egress)
  rag-api       ←→  qdrant
  rag-api       ←→  ollama
  rag-api       ←→  embedder

aspexilary_parking_internal  (internal: true, no egress)
  rag-api       ←→  qdrant
  rag-api       ←→  ollama
  rag-api       ←→  embedder

Deploy one, deploy ten — they don't interfere.

The alternative is shipping raw documents and an ingestion pipeline. That means your team needs to:

Aspexilary DIY Pipeline
Time to first query Under 60 seconds 4–8 hours
GPU required No Recommended
Embedding pipeline Pre-built snapshot Run yourself
Document processing Already done PDF extraction + chunking
HNSW index build Pre-indexed Runs at startup
Debug failures Tested before delivery Your problem
Retrieval quality Validated at build Unknown until tested

Every domain runs on CPU out of the box. Add a GPU for faster inference. Here are real-world specs based on our production environment.

CPU Only
4-core CPU, 16 GB RAM, 20 GB disk per domain. The Q4_K_M model (5.6 GB) runs entirely in system memory. Inference: 8–15 tokens/sec. No GPU required.
1–2 domains · Eval & pilot
Entry GPU
8 GB VRAM (RTX 4060, A2000, T4). Model fits entirely in VRAM. Inference: 40–60 tokens/sec. Embedder stays on CPU (~800 MB RAM).
1–5 domains · Small team
Production GPU
16–24 GB VRAM (RTX 4090, A5000, L4). Room for the model + embedder on GPU. Inference: 80–120 tokens/sec. Multiple concurrent queries.
5–20 domains · Department
Enterprise
32+ GB VRAM (A100, H100, RTX 5090) or multi-GPU. Full FP16 model for maximum quality. 50+ concurrent users. On-demand gateway manages service lifecycle.
20–600+ domains · Organization
Component CPU Mode GPU Mode
LLM (Q4_K_M quantized) 5.6 GB system RAM 5.6 GB VRAM
LLM (FP16 full precision) 18 GB system RAM 18 GB VRAM
BGE-Large embedder ~800 MB RAM ~1.8 GB VRAM
Qdrant (per domain) 25–130 MB RAM 25–130 MB RAM
Disk (per domain) 8–15 GB 8–15 GB
Docker Engine 24+ 24+ with NVIDIA Container Toolkit
Internet Not required after pull Not required after pull

Multi-domain deployments share the LLM and embedder across all domains — only Qdrant and the API container are per-domain. Running 10 domains adds ~300 MB RAM total, not 10x the model weight. Our production environment runs 608 domains on a single RTX 5090 (32 GB VRAM) with an on-demand gateway that starts services as needed and stops them after 10 minutes of idle time.

Ready to deploy?
Browse 608 domain-specific RAG systems across 29 industry categories. Each one ships as a sealed Docker stack.
Browse domains

Questions about deployment

info@aspexilary.ai