How LLMs Access Enterprise Data: Patterns and Pitfalls

Key takeaway: There are five mainstream patterns for connecting large language models to enterprise databases. Most of them trade security for convenience. API-mediated retrieval is the only pattern that preserves existing access controls, prevents credential exposure, and fits into enterprise governance frameworks without requiring a new security model.

Large language models are useful only when they can reach the data that matters. For most enterprises, that data lives in relational databases: customer records, financial transactions, inventory systems, operational logs. The question is not whether to connect an LLM to that data. The question is how to do it without creating a security incident.

This article surveys five integration patterns that have emerged over the past two years. Each solves the data access problem differently, with distinct trade-offs in security, latency, cost, and complexity. If you are building an AI data gateway or evaluating architectures for LLM-powered applications, this is the landscape you need to understand.

Five Ways LLMs Access Enterprise Data

The patterns fall into two broad categories: direct database access and mediated access. Direct access means the LLM (or an agent acting on its behalf) connects to the database and executes queries. Mediated access means a service layer sits between the model and the data, enforcing rules the model cannot bypass.

The five patterns are: text-to-SQL generation, retrieval-augmented generation (RAG) over structured data via APIs, fine-tuning on proprietary datasets, function calling with tool use, and API-mediated retrieval through a gateway. These are not mutually exclusive. Production systems often combine two or three. But each introduces its own failure modes, and understanding those failure modes is the point of this survey.

The differences matter most at the boundaries: what happens when a user asks for data they should not see, when a query would be too expensive to run, or when the model hallucinates a valid-looking but destructive SQL statement.

Text-to-SQL: The Tempting Trap

Text-to-SQL is the most direct approach. The model receives a natural language question, generates a SQL query, and a runtime executes that query against a live database. The results come back as rows, which the model formats into a human-readable answer. It is elegant on a whiteboard. In production, it is a loaded gun.

The first problem is SQL injection via prompt. Traditional SQL injection exploits input fields in web forms. Prompt-based SQL injection is subtler. An attacker crafts a prompt that causes the model to generate a query containing malicious clauses: DROP TABLE, UNION SELECT from sensitive tables, or subqueries that exfiltrate data the user should not access. The model does not understand that it is being manipulated. It generates syntactically valid SQL because that is what it was told to do.

The second problem is credential exposure. For text-to-SQL to work, the runtime needs database credentials. Those credentials typically have broad read access (and sometimes write access) because the model needs to query arbitrary tables. If the runtime is compromised, or if credentials leak through logs or error messages, the attacker gets the same access the model had. There is no scoping mechanism inherent in the pattern.

The third problem is the absence of RBAC. A database connection string does not know who the end user is. If a sales representative and a chief financial officer both use the same text-to-SQL interface, they get the same data. Implementing per-user access control requires building an authorization layer on top of the SQL generation pipeline. At that point, you are building a gateway.

There are mitigations: read-only replicas, query validation layers that reject DDL and DML, allowlists of permitted tables and columns. But each mitigation adds complexity, and the resulting system starts to look like an API layer with extra steps. Text-to-SQL works for internal analytics where all users are trusted. For anything touching PII, financial records, or health data, the risks outweigh the convenience.

RAG Over Structured Data: APIs as the Retrieval Layer

Retrieval-augmented generation is the dominant pattern for grounding LLM responses in factual data. In its most common form, RAG retrieves unstructured documents (PDFs, wiki pages, support tickets) from a vector database. But RAG works equally well with structured data when the retrieval layer is an API rather than a vector search.

The pattern works like this. The LLM receives a user query. A preprocessing step extracts the intent and relevant parameters: customer ID, date range, product category. Those parameters are used to call a REST API endpoint that returns the matching records as JSON. The JSON is injected into the model's context window alongside the original query. The model generates a response grounded in the retrieved data.

This is fundamentally different from text-to-SQL because the model never sees the database. It sees API responses. The API enforces authentication. The API enforces authorization. The API enforces rate limits. The API returns only the fields the caller is permitted to see. If the model asks for data the user cannot access, the API returns a 403, not a result set.

The trade-off is flexibility. Text-to-SQL can answer any question that SQL can express. API-mediated RAG can only answer questions the API endpoints were designed to support. If a user asks for a join across three tables that no endpoint covers, the system cannot answer. This is a feature, not a bug, from a security perspective. But it means the API surface must be designed thoughtfully. A well-designed data access layer for RAG pipelines anticipates the kinds of questions users will ask and exposes endpoints accordingly.

DreamFactory, a platform that auto-generates REST APIs from database schemas, reduces the effort required to build this retrieval layer. Instead of hand-coding endpoints for every table and relationship, you point it at a database and get a full CRUD API with role-based access control, API key management, and rate limiting already in place. The generated endpoints become the retrieval surface for RAG pipelines.

Fine-Tuning vs Real-Time Retrieval

Fine-tuning is the process of training a model on a proprietary dataset so that the knowledge is encoded in the model's weights. After fine-tuning, the model can answer questions about the data without any retrieval step. The data is baked in.

This sounds appealing until you consider three realities. First, fine-tuning is expensive. Training runs on enterprise-scale datasets cost thousands of dollars in compute and take hours or days. Every time the underlying data changes, you need to retrain. For data that changes daily (inventory, pricing, customer records), fine-tuning is permanently stale.

Second, fine-tuning has no access control mechanism. Once the data is in the model weights, it is accessible to every user of that model. You cannot fine-tune a model on salary data and then prevent a non-HR user from asking about salaries. The model does not distinguish between authorized and unauthorized queries. It simply knows what it knows.

Third, fine-tuning introduces hallucination risk at the data layer. A retrieval-based system can cite its sources: "this answer came from the /orders endpoint, record ID 4521." A fine-tuned model cannot point to where it learned a fact. When it generates an incorrect answer, there is no audit trail to diagnose the error. For regulated industries where data provenance matters (finance, healthcare, government), this is a non-starter.

Fine-tuning has valid use cases. Teaching a model industry-specific vocabulary, medical abbreviations, or codebase naming conventions is well-suited to it. But using fine-tuning as a data access pattern for live enterprise data is a misapplication of the technique.

Real-time retrieval via API keeps data where it belongs: in the database, governed by the same policies that govern every other access path. The model gets fresh data on every request. Access control is enforced per request. And when the answer is wrong, you can trace it back to the exact API call and response payload that produced it.

Why API-Mediated Access Wins for Enterprises

Function calling, introduced by OpenAI in mid-2023 and now supported by most major model providers, gives LLMs a structured way to invoke external tools. The model outputs a JSON object describing the function it wants to call and the arguments it wants to pass. The runtime executes the function and returns the result. This is a significant improvement over text-to-SQL because the model does not generate arbitrary code. It selects from a predefined set of functions.

But function calling is a mechanism, not a policy. It tells you how the model invokes a tool. It does not tell you who is allowed to invoke which tools, how many times per minute, or what data the tool should return for a given user. Those are governance questions, and they require an infrastructure layer to answer.

This is where API-mediated access becomes the clear winner for enterprise deployments. An API gateway sits between the model's function calls and the database. It provides authentication (who is making this request), authorization (what are they allowed to access), rate limiting (how often can they ask), field-level filtering (which columns should be visible), audit logging (what did they access and when), and request validation (is this a well-formed query that will not bring down the database).

None of these capabilities exist in the model. None of them exist in the function calling specification. They exist in infrastructure, and the standard infrastructure for mediating data access is an API. This is not a new idea. It is the same architecture that governs how web applications, mobile apps, and third-party integrations access enterprise data. LLMs are just another client.

The operational advantage is consolidation. When AI agents need database access, they use the same API that the web application uses. One set of access policies. One rate limiting configuration. One audit log. One set of API keys to rotate. The alternative, building a parallel access path for AI that bypasses the existing API layer, doubles the governance surface and guarantees that the two paths will drift out of sync.

DreamFactory generates these API layers automatically from existing database schemas, producing secured REST endpoints with role-based access control that both traditional applications and LLM-based systems can consume through the same interface. The generated APIs include per-role field masking, request throttling, and comprehensive audit trails, which are precisely the controls that text-to-SQL and fine-tuning cannot provide.

The pattern is straightforward. Expose your data through authenticated, authorized API endpoints. Point your RAG pipeline or function-calling runtime at those endpoints. Let the API layer handle security, governance, and access control. The model does what models are good at: understanding natural language and generating useful responses. The API does what APIs are good at: enforcing rules consistently, at scale, without exception. If your API layer can keep up with the volume and velocity of LLM-driven requests, connecting a model to your data is an integration project, not a research project.