← All posts May 6, 2026

Build a Simple RAG Pipeline with Ollama (No Cloud Required)

Retrieval-Augmented Generation (RAG) lets a language model answer questions using your documents instead of relying solely on what it learned during training. Most tutorials assume you're calling OpenAI or another cloud API. But if you're thinking about AI security, running everything locally with Ollama is a great way to experiment without sending your data anywhere, and it's a good way to understand the attack surface of RAG systems firsthand.

Here's a minimal RAG setup you can run on your own machine in about 20 minutes.

What You'll Need

Ollama installed (ollama pull llama3.2 and ollama pull nomic-embed-text)
Python 3.10+
chromadb and ollama Python packages (pip install chromadb ollama)

Step 1: Chunk and Embed Your Documents

import ollama
import chromadb

client = chromadb.Client()
collection = client.create_collection("docs")

documents = [
    "Ollama lets you run open-weight LLMs locally.",
    "RAG retrieves relevant context before generating an answer.",
    "Prompt injection can occur when untrusted text is passed to an LLM.",
]

for i, doc in enumerate(documents):
    response = ollama.embeddings(model="nomic-embed-text", prompt=doc)
    collection.add(
        ids=[str(i)],
        embeddings=[response["embedding"]],
        documents=[doc],
    )

Step 2: Retrieve Relevant Chunks

def retrieve(query, n_results=2):
    query_embedding = ollama.embeddings(model="nomic-embed-text", prompt=query)
    results = collection.query(
        query_embeddings=[query_embedding["embedding"]],
        n_results=n_results,
    )
    return results["documents"][0]

Step 3: Generate an Answer Using Retrieved Context

def ask(query):
    context = retrieve(query)
    prompt = f"""Use the following context to answer the question.

Context:
{chr(10).join(context)}

Question: {query}
Answer:"""

    response = ollama.generate(model="llama3.2", prompt=prompt)
    return response["response"]

print(ask("What is RAG?"))

That's it: a working local RAG loop of embed, store, retrieve, generate.

Why This Matters for Security

This toy example also demonstrates where RAG systems get risky in production:

Untrusted content in the context window. Anything retrieved and stuffed into the prompt is treated as trusted input by the model, including instructions an attacker planted in a document.
No access control. This demo has one collection and no user scoping. In a real app, mixing documents from different users or trust levels in the same vector store is how data leaks across accounts.
No input/output filtering. Nothing here sanitizes what goes in or comes out.

In the next post, I'll poison this same vector database with a malicious document and show how it can hijack the model's output, a preview of why RAG pipelines need the same security scrutiny as any other data pipeline.

Full code for this demo is on my GitHub. Link coming soon.

AI security RAG Ollama