TL;DR
- Apache 2.0 fork of n8n's self-hosted AI starter kit. Ten services in one Docker Compose: Ollama, n8n, Open WebUI, Supabase, Qdrant, Neo4j, Flowise, SearXNG, Langfuse, Caddy.
- Runs on NVIDIA, AMD ROCm, Apple Silicon (host Ollama), or plain CPU. Tested fine on a cheap CPU-only VPS with no GPU at all.
- Ships prebuilt n8n RAG and agent workflows. Bring your own model and your own data; nothing leaves your box.
System Requirements
| RAM | 8GB |
| GPU | NVIDIA, AMD ROCm, or CPU |
| VRAM | 6GB+ for 7B |
✓ Ollama
✓ Apple Silicon
You do not need an OpenAI key to run a real AI agent. local-ai-packaged is an Apache 2.0 fork of the official n8n self-hosted AI starter kit, expanded by Cole Medin into a ten-service Docker Compose stack that runs local models, vector search, knowledge graphs, automations, and full LLM observability on your own machine. One clone, one env file, one Python command, and you have the same building blocks the cloud vendors charge by the token for.
What local-ai-packaged Actually Is
The original n8n self-hosted AI starter kit shipped n8n, Ollama, Postgres, and Qdrant in a Compose file. Cole Medin's fork keeps that core and adds everything you actually need to run a private agent in production: Supabase for data and auth, Open WebUI for chat, Flowise for no-code flows, Neo4j for knowledge graphs, Langfuse for tracing, SearXNG for private search, and Caddy as the reverse proxy with automatic HTTPS. It also ships prebuilt n8n workflows in n8n/backup/workflows/ and an n8n_pipe.py function that lets Open WebUI hand a chat off to an n8n agent.
Everything is open source. Nothing phones home. The whole project is licensed Apache 2.0 and lives at github.com/coleam00/local-ai-packaged.
Here is Cole walking through the stack himself in the project announcement video:
Watch: local-ai-packaged announcement
by Cole Medin, the project creator · YouTube
What Is in the Box
Ten services, each one solving a real piece of the agent puzzle:
| Service | What it does | Default port |
| Ollama | Local LLM runtime. Runs Llama 3.1, Qwen 2.5, DeepSeek, Mistral, anything on the Ollama hub. | 11434 |
| n8n | Workflow engine with 400+ nodes. The glue that turns models into agents and automations. | 5678 |
| Open WebUI | ChatGPT-style chat interface. Talks to Ollama directly or hands prompts off to n8n. | 3000 |
| Flowise | No-code visual builder for LangChain agents. The friendly path for non-developers. | 3001 |
| Supabase | Postgres plus auth, storage, and a REST API. Where your agent state and user data live. | 8000 |
| Qdrant | Vector database for RAG. Stores embeddings of your documents for semantic search. | 6333 |
| Neo4j | Graph database. Powers GraphRAG, where relationships matter as much as content. | 7474, 7687 |
| SearXNG | Privacy-first metasearch. Aggregates 200+ search engines without tracking you. | 8080 |
| Langfuse | LLM observability. Traces every prompt, response, and tool call so you can debug agents. | 3210 |
| Caddy | Reverse proxy with automatic HTTPS via Let's Encrypt. The public face of the stack. | 80, 443 |
The combination matters. A model alone is a parrot. A model plus a vector store gives you RAG (retrieval-augmented generation, where the model looks up real text before answering). Add a graph database and you have GraphRAG. Wrap it all in n8n and you have an agent that can call tools and chain steps. Trace it in Langfuse and you can ship to production with confidence. Most stacks give you one or two of these. This one gives you all five out of the box.
What You Will Need
local-ai-packaged is happy on modest hardware. Here is the honest minimum:
- A host machine running Linux, macOS, or Windows with WSL2. Linux is the smoothest path. Any cheap VPS with KVM or LXC virtualization works.
- Docker and Docker Compose v2. The official
get.docker.com script installs both on Debian and Ubuntu in one line.
- Python 3.11 or newer. The
start_services.py bootstrap script handles the Compose orchestration for you.
- RAM: 8 GB is the floor for the supporting services alone. 16 GB lets you run a 7B or 8B model alongside everything else. 32 GB is comfortable for 13B and 14B models.
- Disk: 25 GB free for the images and base data, plus whatever your model files weigh (a quantized Llama 3.1 8B is about 5 GB).
- Optional GPU: NVIDIA on Linux or Windows is the easy path. AMD on Linux works through ROCm. Apple Silicon cannot expose the Metal GPU to Docker, so on a Mac you run Ollama natively and point the stack at it.
Install With One Command
The whole bootstrap is five steps. Allow about ten minutes the first time, mostly waiting for image pulls.
1. Install Docker and Python
On a fresh Debian or Ubuntu box:
curl -fsSL https://get.docker.com | sh
sudo usermod -aG docker $USER
sudo apt install -y python3 python3-pip git
newgrp docker
2. Clone the repo
git clone -b stable https://github.com/coleam00/local-ai-packaged.git
cd local-ai-packaged
The stable branch is the tested release. The main branch is the development tip and breaks more often.
3. Generate your secrets
Copy the example env file and fill in the long list of secrets it expects:
cp .env.example .env
nano .env
You need to populate at least these keys before the first start:
N8N_ENCRYPTION_KEY, N8N_USER_MANAGEMENT_JWT_SECRET
POSTGRES_PASSWORD, JWT_SECRET, ANON_KEY, SERVICE_ROLE_KEY
DASHBOARD_USERNAME, DASHBOARD_PASSWORD
NEO4J_AUTH (format neo4j/your-password)
CLICKHOUSE_PASSWORD, MINIO_ROOT_PASSWORD
LANGFUSE_SALT, NEXTAUTH_SECRET, ENCRYPTION_KEY
Generate each random value with:
openssl rand -hex 32
One critical gotcha: your Postgres password cannot contain the @ character. The Supabase pooler treats it as a URL separator and the connection silently breaks.
4. Pick your profile and start
The start_services.py bootstrap takes a --profile flag that decides how Ollama runs:
# NVIDIA GPU on Linux or Windows
python3 start_services.py --profile gpu-nvidia
# AMD GPU on Linux
python3 start_services.py --profile gpu-amd
# CPU only (works anywhere, just slower)
python3 start_services.py --profile cpu
# Bring your own Ollama (e.g. native on macOS)
python3 start_services.py --profile none
The first run pulls all the images and creates the volumes. Subsequent starts take seconds.
5. Open the services
Once the script reports everything healthy, browse to:
- n8n:
http://localhost:5678
- Open WebUI:
http://localhost:3000
- Supabase Studio:
http://localhost:8000
- Flowise:
http://localhost:3001
- Neo4j Browser:
http://localhost:7474
- Langfuse:
http://localhost:3210
- SearXNG:
http://localhost:8080
Create an n8n owner account at the first 5678 visit. The same goes for Open WebUI: the first user becomes the admin.
Wire It Together: Your First Local RAG Agent
The repo ships prebuilt workflows in n8n/backup/workflows/. Import the RAG workflow into n8n and you get a functioning agent that does this on every chat:
- Open WebUI sends the user message to n8n through the
n8n_pipe.py function.
- n8n embeds the question with a local Ollama embedding model.
- It queries Qdrant for the top matching document chunks. This is the "retrieval" step in retrieval-augmented generation.
- It feeds the chunks plus the question to Llama 3.1 running in Ollama, and gets a grounded answer back.
- The full trace, prompt, response, latency, and token counts get logged to Langfuse for replay.
To swap the model, pull a new one in the Ollama container:
docker exec -it ollama ollama pull qwen2.5:14b
Then change the model name in the n8n workflow node. To switch from vector RAG to GraphRAG, repoint the retrieval step at Neo4j and replace the chunk query with a Cypher query that walks the entity graph. Same workflow, different store, often dramatically better answers on document sets where relationships matter (legal contracts, org charts, codebases).
Hardware Acceleration: Pick Your Profile
The profile decides where Ollama does its work. Pick the one that matches your box:
| Profile | Best for | What gets accelerated | Caveat |
gpu-nvidia | Any box with an NVIDIA card | Ollama uses CUDA for inference | Needs the nvidia-container-toolkit installed on the host |
gpu-amd | Linux box with a recent Radeon | Ollama uses ROCm | Linux only, ROCm support is uneven across cards |
cpu | Any machine, including a cheap VPS | Nothing, plain CPU inference | Slower, but works everywhere. 7B and 8B models are usable on a modern CPU. |
none | macOS or any host where you already run Ollama | Whatever your local Ollama uses (Metal on Apple Silicon) | You point the stack at host.docker.internal:11434 instead of the bundled container |
The CPU profile is the surprise hero. We tested this on a Hetzner dedicated root server inside a CPU-only LXC container, no GPU at all, and Llama 3.1 8B answered RAG queries in five to fifteen seconds depending on context length. That is slow compared to a GPU, but it is fast enough for a personal knowledge base, a small team, or a side project. The bigger your CPU, the faster it goes. Nothing else in the stack changes.
Tips and Gotchas
Apple Silicon needs native Ollama. Docker on macOS cannot see the Metal GPU. Install Ollama from the official installer, run it on the host, then start the stack with --profile none and point n8n at http://host.docker.internal:11434.
n8n v2 disables risky nodes by default. The Local File Trigger and Execute Command nodes are in the deny list. Uncomment the NODES_EXCLUDE=[] line in the compose file if you actually want them, and understand the security trade-off before you do.
Supabase storage needs stub S3 settings. As of February 2026, the storage container fails to start without GLOBAL_S3_BUCKET=stub, REGION=stub, STORAGE_TENANT_ID=stub, plus a fake S3 access key and secret in your .env. The "stub" values are intentional. The storage service just refuses to boot without them set.
SearXNG needs a chmod. Before the first start, run chmod 755 searxng from the repo root or the container will refuse to write its config.
Public deployment goes through Caddy only. When you expose this stack to the internet, run python3 start_services.py --environment public, point your DNS A records at the box, and let Caddy handle ports 80 and 443. Do not publish any other port. UFW does not block Docker-published ports, so any exposed container port is reachable from outside even with the firewall on.
Container upgrades are a three-step dance.
docker compose -p localai down
docker compose -p localai pull
python3 start_services.py --profile gpu-nvidia
Postgres analytics can corrupt itself. If Langfuse or Supabase analytics refuses to start after a crash, delete supabase/docker/volumes/db/data and let Postgres reinitialize. You lose Supabase data, so back it up first.
What You Can Build With This
The stack is a building block. Four projects you can wire up in a weekend:
A private RAG over your own documents. Drop PDFs, Markdown notes, or Notion exports into a watched folder. n8n picks them up, splits them into chunks, embeds them with a local Ollama embedding model, and stores the vectors in Qdrant. Open WebUI then lets you chat with the entire archive. Nothing leaves your box. This replaces a paid ChatGPT Enterprise subscription for a single developer or a small team.
A meeting-notes agent. Run Whisper locally to transcribe a meeting recording, hand the transcript to Llama 3.1 for a summary, extract action items into a Neo4j graph that links people to tasks, and email everyone their personal action list. The whole pipeline is one n8n workflow. Langfuse traces every step so you can debug the prompts when the summary misses something.
A privacy-first research assistant. SearXNG fetches results from 200+ search engines without tracking. n8n hands the top hits to a local model that reads, condenses, and cites them. You get a research bot that never leaks your queries to Google or Bing, and Langfuse keeps a replayable history of every search session.
A no-code agent in Flowise. Non-developers on the team can drag and drop their own LangChain agents in Flowise, and they all share the same Ollama backend. One model, one box, no per-token bill, every member of the team running their own private GPT.
Sources and Further Reading
You now own a complete local AI stack. Clone the repo, fill the env file, pick a profile, run one command, and you have models, vector search, graphs, agents, and observability running on your own hardware. Start with one prebuilt workflow and a small Llama 3.1 model. Once one agent works end to end, the rest is copy and paste. The OpenAI bill can wait.
Tested On
We ran this stack on a small CPU-only LXC container on a Hetzner dedicated root server, with no GPU at all. Everything came up clean on the first try once the env file was filled, and Llama 3.1 8B answered RAG queries in about ten seconds end to end. Bigger boxes will be faster. Nothing else changes.
Tested on: Hetzner dedicated root server, CPU-only LXC container, no GPU
Profile: --profile cpu
Model: Llama 3.1 8B via Ollama
Date: 2026-04-11