PG-Swarm - PostgreSQL HA at the Edge

The Problem

!

Edge clusters are behind NAT

You can't reach them directly. Traditional push-based management doesn't work.

!

Failover must be local

A round-trip to a central control plane during a primary failure adds unacceptable latency.

!

Split-brain is catastrophic

Two primaries writing to the same cluster corrupts data permanently.

!

No visibility at scale

Managing hundreds of PG clusters across edge sites with kubectl is impossible.

The Solution

✓

Satellite-initiated connections

Satellites connect outward via persistent gRPC streams. No inbound access required.

✓

Per-pod failover sidecar

Leader election via K8s Leases. Promotion in seconds with no central dependency.

✓

SQL fencing + timeline recovery

Immediate write blocking, automatic pg_rewind, zero-restart container recovery.

✓

Single pane of glass

Web dashboard with real-time health, replication lag, slow queries, and one-click switchover.

Features

Everything you need to run PostgreSQL HA at the edge, built from scratch in Go.

⚡

Automatic Failover

Per-pod sidecar with Kubernetes Lease-based leader election. Primary failure detected in 15 seconds, promotion in under 5.

🛡

Split-Brain Prevention

SQL fencing blocks writes and kills connections instantly. Old primary auto-demotes to standby via K8s exec. No data corruption.

🔄

Timeline Recovery

After failover, replicas auto-detect timeline divergence and recover via pg_rewind or re-basebackup. No manual intervention.

📈

Deep Observability

Replication lag, connections, disk, WAL stats, per-database cache hit ratios, table stats, and slow queries from pg_stat_statements.

🛠

Planned Switchover

Satellite-controlled 9-step orchestration with real-time progress tracking via WebSocket. Fence with drain, checkpoint, promote, with point-of-no-return indicator.

📦

Profiles & Rules

Reusable cluster templates with deployment rules that auto-deploy to satellites matching label selectors. Fleet-scale management.

💾

Managed Backups

Per-pod backup sidecar with GCS and SFTP storage. Physical (base + incremental + WAL archiving) and logical (pg_dump) backups scheduled via internal cron. PITR support. Being reimplemented from scratch.

🌐

No CRDs Required

Builds StatefulSets, Services, ConfigMaps, Secrets, and RBAC from scratch. No Custom Resource Definitions, no operator frameworks, minimal footprint.

🔒

Secure by Default

Token-based auth with SHA-256 hashing. Constant-time comparison. Create-only secrets. Identity stored in K8s Secrets.

⏳

Zero-Restart Recovery

Postgres container runs in a supervisor loop. Demotion, timeline recovery, and restart happen in-place without K8s restart counts.

🔧

Live Config Changes

Three-mode config update system: pg_reload_conf for sighup params, rolling restart for postmaster params, full cluster shutdown for replication-sensitive params. Database-driven parameter classification with admin panel.

🗂

Cluster Databases

Dynamic database and user management at the cluster level. CREATE ROLE + CREATE DATABASE via sidecar command. Per-database IP subnet access control (HBA rules). Zero pod restart.

📜

Recovery Rules

Centrally managed log-based recovery rules with pattern matching. 40+ built-in patterns for WAL issues, replication failures, and crash loops. Admin panel with regex sandbox.

Web Dashboard

Real-time visibility into every PostgreSQL instance across your fleet.

Cluster Overview

Instance table with role badges, ready/WAL status dots, connection bars, disk usage, and timeline IDs. Expand to see per-pod details.

Instance Detail

Drill down to see disk vs WAL breakdown, WAL statistics, per-database sizes with cache hit ratios, table stats, and slow queries.

Profile Editor

6-tab configuration editor: General, Volumes, Resources, PostgreSQL params (50+ parameters), HBA Rules, and Recovery Rules.

Deployment Rules

Map profiles to satellites via label selectors. When a rule matches, clusters are auto-created and pushed. Fleet-scale management.

Switchover Progress

9-step switchover progress modal with real-time WebSocket updates, point-of-no-return indicator, and rollback status for pre-PONR failures.

Update Rules

Admin panel for PostgreSQL parameter classification. Define which params use pg_reload_conf (sighup), rolling restart (postmaster), or full cluster shutdown (replication-sensitive).

Cluster Databases

Per-cluster database management tab. CREATE ROLE + CREATE DATABASE via sidecar command. CIDR-based IP access control (HBA rules) per database. Zero pod restart.

Config Versioning

Version history with revert for profile changes. Per-cluster approval workflow before config updates are applied. Full audit trail of every configuration change.

Event Log

State transitions, failovers, switchovers, backup completions, and errors with severity icons. Per-cluster event filtering.

Satellite Logs

Terminal-style log viewer with SSE streaming, server-side and client-side level filtering, remote log level control, auto-scroll, and clear.

Recovery Rules

Centrally managed recovery rule sets with inline rule editing, pattern sandbox for testing regex against sample log lines, and per-cluster attachment.

Admin Console

5-tab admin page: Storage Tiers with satellite mappings, Image Variants for postgres base images, PG Version registry, Recovery Rules editor, and Update Rules for parameter classification.

Screenshots

The embedded dashboard in action, running against mock data.

Overview

Clusters

Cluster Detail

Switchover Progress

Satellites

Profiles

Deployment Rules

Events

Admin

Recovery Rules

Update Rules

Architecture

Bidirectional gRPC streaming across every layer. Continuous log monitoring with pattern-driven recovery rules. Real-time WebSocket state push. Sidecar command dispatch for zero-latency switchover.

Central

cmd/central

gRPC server for satellite streams. REST API with 60+ endpoints. WebSocket hub for real-time updates. Embedded React dashboard. PostgreSQL metadata store with auto-migrations.

Satellite

cmd/satellite

Lightweight agent per edge cluster. Persistent gRPC stream with auto-reconnection. Kubernetes operator that builds PG clusters from JSON configs.

Failover Sidecar

cmd/failover-sidecar

Per-pod sidecar for leader election via K8s Leases. Detects split-brain, fences writes, demotes old primary, recovers timeline divergence. Log watcher with 40+ recovery patterns. Bidirectional gRPC streaming to satellite for command dispatch.

How Failover Works

From primary failure to full recovery in under 20 seconds, with zero manual intervention.

1

Primary Dies

The primary pod crashes or becomes unreachable. The failover sidecar's lease renewal stops.

2

Lease Expires (15s)

After 15 seconds without renewal, the leader lease expires. All replica sidecars detect this on their next tick.

3

Replica Acquires Lease

One replica wins the lease via optimistic locking (resourceVersion). The others see the conflict and back off.

4

pg_promote()

The winning replica calls pg_promote(), transitions to primary, and labels its pod pg-swarm.io/role=primary.

5

RW Service Routes

The Kubernetes RW Service selector picks up the new primary label. Applications reconnect transparently.

6

Timeline Recovery

Other replicas detect the timeline divergence, run pg_rewind (or re-basebackup), and start streaming from the new primary. No container restarts.

Backup & Restore

Per-pod backup sidecar with role-aware scheduling. Primary handles WAL archiving, replicas run scheduled backups. Internal cron, gzip compression, and point-in-time recovery.

💾

Physical Backups

Full base backups and incremental backups (PG 17+ changed blocks). WAL archiving on the primary with archive_command auto-configured via pg_reload_conf. No pod restart needed.

📂

Logical Backups

pg_dump-based logical backups with gzip compression. Per-database or full cluster. Scheduled independently from physical backups via internal cron.

🔄

Role-Aware Scheduling

Primary pod handles WAL archiving and metadata. Replica pods run scheduled base, incremental, and logical backups. Automatic role detection at sidecar startup.

📋

WAL Auto-Config

Backup configuration auto-sets archive_mode, archive_command, and summarize_wal in postgresql.conf. HBA changes applied via pg_reload_conf with zero downtime.

⏲

Point-in-Time Recovery

Restore to any point in time from the dashboard. Select a base backup and target timestamp. The satellite handles StatefulSet scaling and recovery setup.

📦

Gzip Compression

All backups are compressed before upload. Base backups use pg_basebackup -z. Logical dumps are gzipped. Retention managed by the sidecar scheduler.

Destinations

gcs / sftp

Google Cloud Storage and SFTP servers. Credentials stored as K8s Secrets on each satellite. Destination configured per cluster.

Backup Sidecar

cmd/backup-sidecar

Per-pod sidecar container injected by the satellite operator. Runs alongside PostgreSQL with shared volume access. Internal cron scheduler for all backup types. HTTP API for status and on-demand triggers.

Tech Stack

LanguageGo 1.26

CommunicationgRPC + Protobuf v3

DatabasePostgreSQL (pgx/v5)

REST APIGoFiber v2

Loggingzerolog

DashboardReact 19 + Vite + JSX

K8s Clientclient-go v0.35

Quick Start

Get PG-Swarm running locally in under a minute.

Build from source

make build        # Compile all binaries
make test         # Run unit tests
make lint         # golangci-lint

Minikube (local dev)

make minikube-build-all
make k8s-deploy-all
make k8s-status

Production / Other K8s

# Push images to your registry
DOCKER_REPO=your.registry/pg-swarm make docker-push-all

# Deploy central (with metadata PG)
kubectl apply -k deploy/k8s/central/base/

# Deploy satellite per edge cluster — edit configmap to set CENTRAL_ADDR, K8S_CLUSTER_NAME, and REGION
kubectl apply -k deploy/k8s/satellite/base/

# Or create a custom kustomize overlay: deploy/k8s/satellite/overlays/prod/kustomization.yaml

PostgreSQL HAat the Edge

The Problem

The Solution

Features

Automatic Failover

Split-Brain Prevention

Timeline Recovery

Deep Observability

Planned Switchover

Profiles & Rules

Managed Backups

No CRDs Required

Secure by Default

Zero-Restart Recovery

Live Config Changes

Cluster Databases

Recovery Rules

Web Dashboard

Cluster Overview

Instance Detail

Profile Editor

Deployment Rules

Switchover Progress

Update Rules

Cluster Databases

Config Versioning

Event Log

Satellite Logs

Recovery Rules

Admin Console

Screenshots

Architecture

Central

Satellite

Failover Sidecar

How Failover Works

Primary Dies

Lease Expires (15s)

Replica Acquires Lease

pg_promote()

RW Service Routes

Timeline Recovery

Backup & Restore

Physical Backups

Logical Backups

Role-Aware Scheduling

WAL Auto-Config

Point-in-Time Recovery

Gzip Compression

Destinations

Backup Sidecar

Tech Stack

Quick Start

PostgreSQL HA
at the Edge