Overview
This project documents my journey building a multi-node Kubernetes cluster from scratch using budget-friendly mini-PCs (Raspberry Pi 4s and Intel NUCs). The goal was to simulate real-world DevOps and SRE challenges in a homelab environment.
Problem Statement
As an embedded systems engineer transitioning to cloud-native infrastructure, I needed a hands-on environment to:
- Develop and test infrastructure-as-code without expensive cloud bills
- Experience real Kubernetes operational challenges (node failures, resource constraints, networking issues)
- Build observability and monitoring solutions from scratch
- Prototype SRE techniques and incident response procedures
Architecture & Hardware
Hardware Stack:
- 1x control plane node (Intel NUC with 16GB RAM, 512GB NVMe)
- 3x worker nodes (Raspberry Pi 4 with 8GB RAM each)
- 1x storage node (Synology NAS for persistent volumes)
- 10Gbps managed switch for low-latency networking
Key Design Decisions:
- Used Kubeadm for cluster bootstrap (closest to production setups)
- Chose Flannel for CNI (minimal overhead on resource-constrained nodes)
- Implemented etcd with local backups every 6 hours
- Used kube-apiserver behind a HAProxy load balancer
Networking Challenges
One of the biggest learnings was managing network policies and service discovery across a heterogeneous cluster:
- Arm64 (Raspberry Pi nodes) ↔ AMD64 (Intel control plane) compatibility
- DNS latency issues with CoreDNS on constrained nodes
- Network policies preventing inter-pod communication initially
Solution: Implemented network policies explicitly allowing traffic between namespaces, and tuned CoreDNS caching to reduce lookup times from 200ms to 15ms.
Lessons Learned
- Monitoring from day 1: Set up Prometheus + Grafana immediately to catch performance regressions
- Capacity planning matters: Running on mini-PCs taught me to optimize resource requests/limits aggressively
- Automation saves time: GitOps workflow (Flux CD) prevented manual deployment drift
- Documentation is critical: Maintained runbooks for common failure scenarios
Current State & Future Work
The cluster is now running 40+ containerized applications, with 99.7% uptime over 6 months. Future improvements include:
- Upgrade to Kubernetes 1.29 (support for more advanced security policies)
- Implement service mesh (Istio) for advanced traffic management
- Set up multi-cluster replication across different physical locations
- Automate disaster recovery testing
Resources & Code
View on GitHub →
Infrastructure-as-Code Repository →
Kubernetes Resource YAML Files →