On-premises RAG must stay current without leaking proprietary data. Here is how the architecture actually works — and why the answer is simpler than it looks.
A question we hear from every regulated-industry customer is some version of this: "If the AI is running on our servers, how do we know our queries aren't going out to the internet?" It is a legitimate concern, and the short answer is: you enforce it at the network layer, not the policy layer.
But that raises the next question immediately. If the query containers have no internet access, how does the system stay current? Regulations change. Case law evolves. Clinical guidelines are updated. An AI system frozen at ingestion time has a rapidly shrinking useful life in any compliance-sensitive domain.
The answer is a strict separation between two data flows that never touch each other.
Every enterprise RAG deployment we build operates on this principle: the query path and the data acquisition path are architecturally isolated. They share a vector store, but they share nothing else — not network access, not runtime context, not pipeline code.
The ingestion platform is the only component with internet access. It runs in a DMZ, has no visibility into enterprise documents, and never sees a user query. It fetches from a fixed allowlist of approved public sources — regulatory bodies, standards organizations, public databases — on a schedule. Everything it produces passes through a content scanner before entering the vector store.
The RAG query containers sit on an internal-only network. At the Docker level, they have no route to the internet. This is not a configuration setting that can drift — it is a network topology enforcement. A compromised RAG container cannot exfiltrate data because it has nowhere to send it.
The shared vector store is where the two paths meet, and the design ensures the meeting is one-directional and safe. The ingestion platform writes vectors in. The RAG containers read vectors out. There is no channel through which query content can travel back toward the ingestion platform or out to the internet.
Each domain gets two collections: one for public knowledge (updated by the ingestion platform on schedule) and one for the customer's proprietary documents (populated during onboarding, never touched by external pipelines). When a user submits a query, the RAG container searches both collections, merges the results, and sends the combined context to the local LLM for inference. The user gets answers grounded in both current public information and their own internal documents.
Not every customer has the same posture. The architecture supports three configurations:
One of the operational wins of this architecture is that adding a new knowledge domain — construction, healthcare, legal, HR, finance — does not require new ingestion infrastructure. The ingestion platform, embedding model, vector store, and LLM inference are all shared. A new domain is a configuration entry and a container.
This matters for enterprise deployments where a single customer may need several specialized RAG instances. The security model scales with zero additional attack surface per domain.
Every component produces structured, hash-chained audit logs. Ingestion events record the source URL, fetch timestamp, content hash, and destination collection. Query events record what was retrieved and from which collection. The chain of custody for every piece of data — from public source through embedding through retrieval through inference — is verifiable and tamper-evident.
The goal of this architecture is to make the compliance story not just technically true, but demonstrably true — so that when an auditor asks how you know enterprise queries are not going to the internet, the answer is not a policy document. It is a network diagram and a log file.