Building a Production Homelab: How We Cut SaaS Costs and Regained Control

The Business Case for Bringing It Home

Cloud computing democratised access to enterprise-grade infrastructure, but it also normalised a subtle form of vendor dependency. For a small consultancy, the death-by-a-thousand-cuts of SaaS pricing is real: $50 here for a managed database, $120 there for a CI runner, $200 for log aggregation, $300 for a Kubernetes control plane you do not fully control. Individually, each tool seems reasonable. Collectively, they erode margins and constrain architectural decisions.

We reached an inflection point in 2025. Our client workload was growing, our testing requirements were expanding, and our monthly cloud invoice was growing faster than either. We asked a simple question: could we host this ourselves without hiring a dedicated DevOps engineer?

The answer, we discovered, was yes — but only because the barrier to sophisticated self-hosting has dropped dramatically. Modern AI-assisted development means a two-person consultancy can write Terraform modules, Ansible playbooks, and Kubernetes manifests at a speed and quality that would have required a specialist team five years ago. The tools have matured, the documentation is exhaustive, and AI fills the gaps when you are staring at a yaml file wondering why a pod is stuck in Pending.

The Network Foundation: Ubiquiti and VLAN Segmentation

Every secure infrastructure starts with the network. We run a Ubiquiti Dream Machine Pro as our gateway and firewall, paired with a UniFi 24-port PoE switch. This is not hobbyist gear with an enterprise sticker — it is a genuinely capable platform that gives us routing, intrusion detection, VLAN trunking, and firewall rule visualisation in a single pane.

Our network is segmented into seven VLANs, each with a specific trust boundary:

VLAN	Purpose	Internet Access	Notes
10	Management	No	Proxmox, SSH, IPMI only
20	Internal Tools	Optional	Coolify, Prometheus, Portainer
30	Client Applications	Yes	Per-client app workloads
40	Databases	No	App VLAN only; no direct WAN
50	Reverse Proxy	Yes	Caddy/Nginx; sole public ingress
60	k3s Cluster	Restricted	11-node Kubernetes on Proxmox
70	Talos Cluster	Restricted	Immutable Kubernetes for GitOps

The UDM Pro is the single source of truth for inter-VLAN routing. Proxmox does not route between networks; it merely attaches VMs to the correct virtual bridge. If a client application needs to talk to a database, the request traverses the firewall, where we log it, rate-limit it, and block it if the pattern changes. This is defence in depth without complexity: the firewall enforces policy, the hypervisor provides compute, and the guest OS handles application-level hardening.

Proxmox: The Virtualisation Layer

We run a three-node Proxmox cluster built from refurbished mini PCs. This is not a datacenter; it is a shelf in our office, and that is precisely the point. Proxmox gives us enterprise virtualisation — KVM, LXC, live migration, ZFS storage, and a robust API — without licensing costs.

Every VM is born from a Packer-built golden image: Ubuntu Server 24.04, QEMU guest agent, cloud-init, and baseline hardening baked in. From there, Terraform declares the VM’s existence: vCPUs, RAM, disk size, which Proxmox node it lives on, and critically, which VLAN its network interface attaches to. This means a client database VM and a public reverse-proxy VM can coexist on the same physical host while being logically air-gapped at the network layer.

We follow a one-VM-per-concern principle unless there is an explicit justification to colocate. This makes debugging, scaling, and recovery simpler. When a client project ends, terraform destroy removes the VM and its storage. When a new project starts, terraform apply brings up a identically-configured environment in minutes.

The Kubernetes Workhorse: k3s on VLAN 60

Our primary application platform is a k3s cluster: three control-plane nodes with embedded etcd, fronted by a kube-vip virtual IP for high availability, and eight worker nodes with differentiated classes. Some workers are general-purpose; two are storage-heavy, running Longhorn distributed block storage on dedicated disks; three are lighter-weight nodes for less demanding workloads.

The entire cluster lives on VLAN 60 (10.0.60.0/24), isolated from our internal tools and management networks. Traefik serves as the ingress controller, MetalLB handles load-balancer services in L2/ARP mode, and Longhorn provides replicated persistent volumes. We can deploy a client application, expose it via an ingress, and have TLS termination handled automatically — all without leaving our LAN.

What makes this practical rather than precarious is the provisioning pipeline. Terraform creates the VMs across the Proxmox nodes for physical redundancy. Ansible then takes over in six ordered plays: OS baseline, control-plane initialisation, kube-vip deployment (critical — the VIP must exist before the other control planes join), control-plane membership, worker registration, and finally cluster addons. If a node fails, we replace the VM, re-run the playbook, and the cluster heals.

Coolify: Our Internal Platform-as-a-Service

Kubernetes is powerful but opinionated. For simpler applications — internal tools, client prototypes, one-off services — we wanted something closer to Heroku or Render in ergonomics without the per-seat pricing. We found it in Coolify.

Coolify runs on a dedicated VM in VLAN 20 and functions as our self-hosted PaaS. It connects to our Docker VMs via SSH, builds containers from Git repositories, and deploys them with automatic SSL certificates, database provisioning, and environment variable management. For a standard web application with a Postgres database, the workflow is identical to a managed platform: push to Git, Coolify builds and deploys, domain is live.

The difference is ownership. We control the build servers, the registry, the runtime, and the data. When a client asks where their data lives, we can point to a specific VM on a specific VLAN in our office. There is no multi-tenant SaaS provider between us and the hardware.

The Experimental Edge: Talos OS and GitOps

Parallel to our k3s cluster, we maintain a smaller Talos OS cluster for experimental and internal platform workloads. Talos is a minimal, immutable Linux distribution designed specifically for Kubernetes. There is no SSH access, no package manager, and no drift — the entire OS is configured via a declarative API.

This cluster runs Argo CD for GitOps: every application, every configuration, every certificate is defined in a Git repository and applied automatically. MetalLB, Traefik, and cert-manager are all managed as Argo CD Applications with sync-wave annotations that guarantee correct ordering. The result is a cluster where the desired state is version-controlled and the actual state converges toward it continuously.

Talos forces discipline. Because you cannot log in and tweak things, every change must travel through the GitOps pipeline. This feels restrictive until you need to rebuild the cluster from scratch. Then it becomes a superpower: apply the bootstrap manifest, sync the root application, and the entire platform reconstitutes itself.

Infrastructure as Code: The Real Competitive Advantage

The hardware is neat, but the code is what makes this sustainable. Our infrastructure repository contains:

Packer configurations for golden VM images
Terraform modules for Proxmox and BinaryLane VPS provisioning
Ansible playbooks for OS hardening, Docker setup, k3s installation, and service configuration
Kubernetes manifests and Helm values for cluster addons
Argo CD Application definitions for the Talos cluster
RackPeek YAML files documenting physical hardware layout

This is not documentation that happens to be executable. It is executable infrastructure that happens to be self-documenting. If our office flooded tomorrow, we could stand up the entire stack on fresh hardware — or on cloud VPSs — by following the recovery order: install Proxmox, restore Terraform state, recreate VMs, run Ansible, restore database backups.

BinaryLane, an Australian VPS provider, serves as our external continuity layer. Critical client workloads have Terraform configurations that target both Proxmox and BinaryLane. The same Ansible playbooks configure both. We practice asymmetric redundancy: primary on-prem, failover in the cloud, same code driving both.

The AI Multiplier

None of this would be practical without AI-assisted development. Writing a Terraform module for Proxmox VM provisioning, an Ansible role for k3s hardening, or an Argo CD Application manifest is not inherently difficult, but it is detailed. The context surface is enormous: provider APIs, module versions, privilege models, network CIDRs, storage backends, certificate chains.

AI tools — Claude, GitHub Copilot, and the agentic coding workflows we have built around them — collapse that context. They help us generate initial configurations, debug Pending pods, write firewall rules, and even reason about disaster recovery sequences. The result is that a boutique consultancy with two senior developers can operate infrastructure that would have required a mid-sized DevOps team a few years ago.

This is the real shift: self-hosting is no longer the preserve of hobbyists or enterprises with dedicated platform teams. It is now a legitimate strategic option for small businesses that value control, predictability, and data sovereignty.

Results and Lessons

Six months after cutting over, our infrastructure spend is down roughly 70%. The remaining cost is primarily internet bandwidth, hardware depreciation, and a small BinaryLane reserve. We have gained something more valuable than cost savings, though: optionality. We can choose where workloads run. We can spin up production-identical environments for testing. We can tell clients exactly where their data resides and who has access to it.

The lessons are practical, not philosophical:

Start with the network. VLAN segmentation feels like over-engineering until it saves you from a misconfigured firewall rule.
One VM per concern. Colocation saves negligible resources and costs enormous debugging time.
GitOps is worth the learning curve. The ability to rebuild a cluster from a repository is worth every hour spent on Argo CD sync waves.
AI is a force multiplier, not a replacement. It accelerates writing infrastructure code, but you still need to understand what it generates.
Document your recovery order. The best infrastructure code is useless if nobody knows which playbook to run first after a power outage.

Self-hosting is not a religion. It is a business decision. For us, the math was simple: we could continue renting our platform month-to-month at escalating prices, or we could invest in owning it. We chose ownership. The tools — Proxmox, k3s, Coolify, Talos, Terraform, Ansible, and the AI that helps us wield them — made it not just possible, but practical.

The Problem

Our Solution

Outcome