Deployment Model

Your infrastructure.
Our intelligence.

Every domain RAG system ships as a self-contained Docker Compose stack with a pre-built vector database. No ingestion pipeline. No GPU. No cloud dependency. Pull, start, query — under 60 seconds.

607

Domain systems

<60s

To first query

External API calls

3.5M+

Vector points

What Ships

Every domain is a sealed, four-container stack. Nothing needs to be configured, built, or downloaded after delivery. The vector database arrives pre-loaded with embedded regulatory documents.

RAG API

FastAPI service with domain-tuned retrieval, citation formatting, and hash-chained audit logging.

Port 8001 · Python 3.12

Qdrant Vector DB

Pre-loaded with a binary snapshot of the domain's embedded corpus. HNSW-indexed. Ready on startup.

Port 6333 · Rust

Ollama LLM

Fine-tuned language model quantized to Q4_K_M. Runs on CPU — no GPU required. 4 GB RAM.

Port 11434 · Go

BGE-Large Embedder

Generates query embeddings at inference time, matching the embeddings used during corpus ingestion.

Port 8085 · Rust (TEI)

Three Commands

From delivery to a live, queryable API. Any machine with Docker installed. No internet required after initial pull.

          
          terminal
        
# Pull the domain stack
docker compose pull

# Start everything
docker compose up -d

# Query immediately
curl localhost:8001/query \
  -H "Content-Type: application/json" \
  -d '{"question": "What are the NFPA 855 requirements for BESS installations?"}'

Air-Gapped by Design

Every domain runs on its own internal Docker network with no internet egress. This isn't a software filter — it's enforced at the kernel network layer. A compromised container in one domain has no network path to another domain's data or to the outside world.

          
          network topology
        
# Each domain is a sealed unit

aspexilary_fuel_internal     (internal: true, no egress)
  rag-api       ←→  qdrant
  rag-api       ←→  ollama
  rag-api       ←→  embedder

aspexilary_parking_internal  (internal: true, no egress)
  rag-api       ←→  qdrant
  rag-api       ←→  ollama
  rag-api       ←→  embedder

Deploy one, deploy ten — they don't interfere.

Why Pre-Built Snapshots

The alternative is shipping raw documents and an ingestion pipeline. That means your team needs to:

	Aspexilary	DIY Pipeline
Time to first query	Under 60 seconds	4–8 hours
GPU required	No	Recommended
Embedding pipeline	Pre-built snapshot	Run yourself
Document processing	Already done	PDF extraction + chunking
HNSW index build	Pre-indexed	Runs at startup
Debug failures	Tested before delivery	Your problem
Retrieval quality	Validated at build	Unknown until tested

What's Included

Docker Compose stack — four containers, one command to start. Pre-configured networking, health checks, and restart policies.
Pre-built Qdrant snapshot — your domain's regulatory corpus, extracted, chunked, embedded, and HNSW-indexed. Loads on startup.
Fine-tuned LLM — quantized to Q4_K_M for CPU inference. Trained on domain-specific regulatory language, not generic web text.
Web UI — domain-branded query interface with conversation history. Drop-in ready, no frontend build step.
Network isolation diagram — Graphviz render of actual Docker network state. Machine-generated, not hand-drawn.
Egress proof — tcpdump capture proving zero external traffic during query processing. SHA-256 hashed and appended to audit log.
Hash-chained audit trail — every query logged with SHA-256 chain. Tampering breaks the chain. Suitable for HIPAA AU, SOC 2 CC7, FedRAMP AU-2/AU-9.

System Requirements

Every domain runs on CPU out of the box. Add a GPU for faster inference. Here are real-world specs based on our production environment.

CPU Only

4-core CPU, 16 GB RAM, 20 GB disk per domain. The Q4_K_M model (5.6 GB) runs entirely in system memory. Inference: 8–15 tokens/sec. No GPU required.

1–2 domains · Eval & pilot

Entry GPU

8 GB VRAM (RTX 4060, A2000, T4). Model fits entirely in VRAM. Inference: 40–60 tokens/sec. Embedder stays on CPU (~800 MB RAM).

1–5 domains · Small team

Production GPU

16–24 GB VRAM (RTX 4090, A5000, L4). Room for the model + embedder on GPU. Inference: 80–120 tokens/sec. Multiple concurrent queries.

5–20 domains · Department

Enterprise

32+ GB VRAM (A100, H100, RTX 5090) or multi-GPU. Full FP16 model for maximum quality. 50+ concurrent users. On-demand gateway manages service lifecycle.

20–600+ domains · Organization

Component	CPU Mode	GPU Mode
LLM (Q4_K_M quantized)	5.6 GB system RAM	5.6 GB VRAM
LLM (FP16 full precision)	18 GB system RAM	18 GB VRAM
BGE-Large embedder	~800 MB RAM	~1.8 GB VRAM
Qdrant (per domain)	25–130 MB RAM	25–130 MB RAM
Disk (per domain)	8–15 GB	8–15 GB
Docker Engine	24+	24+ with NVIDIA Container Toolkit
Internet	Not required after pull	Not required after pull

Multi-domain deployments share the LLM and embedder across all domains — only Qdrant and the API container are per-domain. Running 10 domains adds ~300 MB RAM total, not 10x the model weight. Our production environment runs 608 domains on a single RTX 5090 (32 GB VRAM) with an on-demand gateway that starts services as needed and stops them after 10 minutes of idle time.

Ready to deploy?

Browse 608 domain-specific RAG systems across 29 industry categories. Each one ships as a sealed Docker stack.

Browse domains

Full Domain Catalog

555 systems across 29 categories

Compliance & Evidence Framework

HIPAA · SOC 2 · FedRAMP — verified, not claimed

Field Notes

Technical deep-dives on RAG architecture and deployment

Questions about deployment

info@aspexilary.ai

Your infrastructure.Our intelligence.

Your infrastructure.
Our intelligence.