Building RAG Pipelines: The Data Access Layer

Key takeaway: Most RAG tutorials focus on vector search over unstructured documents. But enterprise RAG pipelines almost always need structured data from SQL databases. The retrieval layer between your LLM and your database requires an API gateway that enforces parameterized queries, role-based access, field masking, and audit logging. Getting this wrong means sending raw customer data straight into an LLM context window.

Retrieval-Augmented Generation (RAG) is the dominant pattern for connecting large language models to enterprise data. The concept is straightforward: instead of fine-tuning a model on proprietary data, you retrieve relevant context at inference time and inject it into the prompt. The LLM generates a response grounded in your actual data rather than its training corpus.

Most RAG implementations focus on unstructured data. Tutorials walk through chunking PDFs, generating embeddings, and querying vector databases. That covers one half of the problem. The other half, retrieving structured data from SQL databases, gets far less attention despite being where most enterprise data actually lives.

This article covers the data access layer for structured-data RAG. Specifically, how to build the retrieval component that sits between an LLM orchestrator and your relational databases, and why that component needs to be more than a database connection string.

RAG Architecture: Where Data Access Fits

A RAG pipeline has three stages: retrieval, augmentation, and generation. The retrieval stage fetches data relevant to the user's query. The augmentation stage formats that data and injects it into the LLM prompt as context. The generation stage produces the final response.

For structured data RAG against SQL databases, the retrieval path looks like this: the LLM agent determines what data it needs, sends a request to a retriever component, which calls an API gateway, which executes a query against the database, which returns JSON, which gets injected into the LLM's context window for generation. Each hop in that chain is a place where security, performance, and reliability can break down.

The critical component is the layer between the retriever and the database. In naive implementations, this is a direct database connection with dynamically constructed SQL. The retriever builds a query string, executes it, and passes the results to the LLM. This works in a demo. In production, it is a vector for SQL injection, data exfiltration, and compliance violations.

The data access layer exists to mediate that interaction. It exposes database tables and views as API endpoints, enforces access controls, redacts sensitive fields, rate-limits requests, and logs every query. The mechanics of how LLMs access enterprise data depend on this layer being both functional and secure.

Structured Data RAG vs Unstructured Data RAG

Unstructured data RAG and structured data RAG solve different problems, and they have different architectural requirements. Understanding the distinction matters because the tooling is different.

In unstructured RAG, the retrieval step involves semantic search. You convert a query into an embedding vector, search a vector database for similar chunks, and return the top-k results. The data is text. The retrieval mechanism is similarity-based. The main risk is returning irrelevant context, not leaking sensitive records. For a deeper look at the vector side of this, see the API layer for vector databases.

Structured data RAG is fundamentally different. The data is rows and columns in relational tables. The retrieval mechanism is precise: SQL queries that return exact records matching specific criteria. An LLM agent might need all orders for a given customer, inventory levels for a product category, or revenue by region for Q3. These are not fuzzy semantic searches. They are deterministic lookups that return structured records.

This precision creates a different risk profile. A bad semantic search returns irrelevant paragraphs. A bad SQL query returns unauthorized records, exposes PII, or modifies data through injection. The retrieval layer for structured data RAG needs access controls that do not exist in typical vector search pipelines.

There is also a schema problem. Vector databases store embeddings with minimal metadata. SQL databases have complex schemas with foreign keys, constraints, and relationships. The retrieval layer needs to understand or at least safely expose that schema so the LLM agent can request the right data. Letting an LLM generate arbitrary SQL against a production schema is not a viable strategy. The data access layer must constrain the query surface to predefined endpoints that map to safe, parameterized operations.

Why the Retrieval Layer Needs an API Gateway

A direct database connection from an LLM orchestrator is the simplest possible architecture. It is also the most dangerous. An API gateway between the orchestrator and the database provides the control plane that production RAG pipelines require.

The first reason is query safety. An API gateway exposes specific endpoints, such as GET /api/v1/orders?customer_id=42, not a raw SQL interface. Each endpoint maps to a parameterized query. The LLM agent cannot construct arbitrary SQL because it never touches SQL. It calls REST endpoints with typed parameters. This eliminates SQL injection by design rather than by careful string escaping.

The second reason is abstraction. Database schemas change. Tables get renamed, columns get added, relationships get restructured. If your LLM agent constructs SQL directly, every schema change breaks the RAG pipeline. An API gateway decouples the agent from the schema. The endpoints stay stable even when the underlying tables change. You update the gateway configuration, not the LLM's tooling.

The third reason is observability. Every request through an API gateway can be logged with the requesting identity, the parameters, the response size, and the latency. When an LLM agent retrieves data for a RAG pipeline, you need to know what data was accessed, by which agent, for which user query. Direct database connections make this audit trail difficult to maintain. API gateways make it automatic.

Rate limiting is the fourth reason. LLM agents can be unpredictable in how many data requests they generate for a single user query. An agent reasoning through a complex question might hit the database dozens of times. Without rate limiting at the gateway, a single user prompt could generate enough database load to affect other applications. The gateway throttles requests per agent, per endpoint, and per time window.

DreamFactory, a platform that auto-generates REST and GraphQL APIs from SQL database schemas, provides this gateway layer out of the box. It connects to MySQL, PostgreSQL, SQL Server, Oracle, and other relational databases and produces documented API endpoints with built-in authentication, role-based access control, and request logging. For RAG pipelines, this means the retrieval layer gets a fully formed API without writing custom middleware.

Security Requirements for RAG Data Access

RAG pipelines introduce a new class of data security concerns. The data you retrieve does not just go into a report or a dashboard. It goes into an LLM's context window, where it influences generated text that a user will read. If sensitive data enters the context, it can surface in the response. Securing the API layer for AI data access is not optional; it is a prerequisite for production deployment.

Parameterized queries are the baseline. Every database interaction in the retrieval layer must use parameterized queries, never string concatenation. This is standard practice for web applications, but it is worth emphasizing because LLM-generated queries are a new attack surface. If your system allows an LLM to influence query construction, parameterization must be enforced at the gateway level, not trusted to the LLM.

Role-based access control (RBAC) determines what data each agent or user can retrieve. A customer-facing chatbot should not have access to internal HR tables. A financial reporting agent should not see raw employee records. The API gateway must enforce per-role permissions on every endpoint, restricting which tables, columns, and even which rows each role can access. Row-level filtering based on the requesting user's identity is essential for multi-tenant scenarios.

Field masking is the requirement that most teams miss. Even when an agent has legitimate access to a table, certain columns should never reach the LLM context. Social Security numbers, email addresses, credit card numbers, and other PII must be redacted or masked before the JSON response leaves the gateway. If a customer record includes an SSN column, the API response should return that field as a masked value or omit it entirely. The LLM cannot leak data it never received.

This is field-level security applied at the API layer. The database may store the full SSN, but the API endpoint configured for the RAG agent's role returns "ssn": "***-**-1234" or excludes the field from the response schema. This masking must happen at the gateway, not in the LLM orchestration code. If the full value reaches the orchestrator, it is already too late.

Audit trails complete the security picture. Every data retrieval in a RAG pipeline should be logged with enough detail to reconstruct what happened: which user prompted the query, which agent executed it, which endpoint was called, what parameters were passed, and how many records were returned. This is not just for compliance. It is for debugging. When a RAG pipeline produces an incorrect answer, you need to trace back through the retrieval chain to find whether the problem was bad data, wrong query parameters, or hallucination.

Token-based authentication ties everything together. Each LLM agent or orchestration service authenticates with an API key or OAuth token that maps to a specific role. The gateway validates the token, applies the role's permissions and field masks, and logs the request against that identity. Rotating keys and revoking access happens at the gateway without touching the database or the LLM configuration.

Integrating DreamFactory with LangChain and LlamaIndex

LangChain and LlamaIndex are the two dominant orchestration frameworks for building RAG pipelines. Both support tool-calling patterns where an LLM agent invokes external APIs as part of its reasoning chain. The integration point between these frameworks and a data access layer is the tool definition, which describes an API endpoint the agent can call.

In LangChain, you define tools using the Tool class or the @tool decorator. Each tool wraps an API call with a name, description, and input schema. For structured data RAG, each tool maps to an API endpoint on your data access layer. A tool called get_customer_orders would call GET /api/v1/orders?customer_id={id} and return the JSON response. The agent uses the tool description to decide when to call it and what parameters to pass.

LlamaIndex takes a similar approach through its QueryEngineTool and custom tool abstractions. You define a tool that wraps a REST API call, give it a natural-language description, and register it with an agent. The framework handles the orchestration loop: the LLM decides which tools to call, LlamaIndex executes the calls, and the results are injected into the next prompt turn.

DreamFactory's auto-generated API endpoints map directly to these tool definitions. Each database table gets a REST endpoint with filtering, sorting, pagination, and field selection built in. The DreamFactory MCP (Model Context Protocol) server takes this further by exposing the API catalog in a format that LLM agents can discover and invoke programmatically. Instead of manually defining a tool for each endpoint, the MCP server lets the agent discover available data sources and their schemas at runtime.

The practical integration pattern works like this. DreamFactory connects to your SQL databases and generates REST APIs. You configure roles with appropriate table, column, and row-level permissions. You set up field masking for sensitive columns. Then you point your LangChain or LlamaIndex agent at the DreamFactory API using either manual tool definitions or the MCP server for dynamic discovery. The agent calls the API endpoints as tools during retrieval. The gateway enforces access controls, masks fields, rate-limits requests, and logs everything. The JSON response flows back through the orchestrator into the LLM context.

This separation of concerns is the key architectural insight. The LLM orchestration framework handles reasoning, tool selection, and prompt construction. The API gateway handles data access, security, and observability. Neither system needs to understand the other's internals. LangChain does not need to know your database schema. DreamFactory does not need to know your prompt templates. They communicate through a well-defined REST interface with documented endpoints and typed parameters.

For teams building structured data RAG in production, the data access layer is where most of the engineering complexity lives. The LLM side, prompt engineering, chain construction, and output parsing, gets the most attention. But the retrieval side, getting the right data to the LLM safely and reliably, is what determines whether the system works in production or only in demos. An API gateway purpose-built for database access eliminates an entire category of problems: query safety, access control, field masking, rate limiting, and audit logging. That infrastructure lets your team focus on the RAG logic instead of building and maintaining custom data plumbing.