Concepts - PG-Swarm

PG-Swarm

Central Control Plane

PG-Swarm consists of three binaries: central (cmd/central), satellite (cmd/satellite), and failover sidecar (cmd/failover-sidecar). The central server is the brain of the system, exposing three interfaces:

gRPC server (:9090) — accepts persistent bidirectional streams from satellites. Pushes cluster configurations, receives health reports, backup statuses, and event notifications.
REST API (:8080) — 60+ endpoints for managing satellites, cluster profiles, deployment rules, backup profiles, recovery rule sets, storage tiers, image variants, and restore operations. WebSocket endpoint for real-time state push. Used by the dashboard and any external tooling.
Embedded dashboard — React SPA served from the same binary. No separate web server needed.

Central stores all metadata in a PostgreSQL database with auto-migrations. When it starts, it runs any pending migration files in order, so schema updates are automatic.

Satellite Agent

A satellite (cmd/satellite) is a lightweight Go binary that runs on each edge Kubernetes cluster. It does three things:

Registers with central on first run. Central places it in a "pending" state until an admin approves it via the dashboard.
Connects a persistent gRPC stream to central. The satellite initiates the connection outward, so no inbound firewall rules or port forwarding are needed. If the connection drops, it reconnects with exponential backoff (max 30 seconds).
Operates PostgreSQL clusters. When it receives a ClusterConfig message from central, its built-in Kubernetes operator creates or updates StatefulSets, Services, ConfigMaps, Secrets, and RBAC resources.

The satellite's identity (ID + auth token) is stored in a Kubernetes Secret (pg-swarm-satellite-identity) so it survives pod restarts.

Kubernetes Operator

The operator is embedded in the satellite binary — it's not a separate CRD-based controller. Instead, it receives JSON cluster configurations over the gRPC stream and materializes them as Kubernetes resources.

The reconcile loop has 9 steps:

Ensure namespace exists
Store received config as a ConfigMap (for inspection)
Create or preserve Secret (never overwrites — passwords survive updates)
Create or update ConfigMap (postgresql.conf + pg_hba.conf)
Create or update Services (headless, read-only, read-write)
Create failover RBAC if enabled
Create or update StatefulSet
Reconcile PVC finalizers (for deletion protection)
Label pods with roles (best-effort)

Each config has a config_version number. The operator tracks the last applied version per cluster and skips duplicate pushes — this makes reconciliation idempotent.

Cluster Profiles

A cluster profile is a reusable template that defines everything about a PostgreSQL cluster: replicas, PostgreSQL version, storage, resources, pg_params, HBA rules, failover configuration, and recovery rules.

Profiles can now be edited even with active clusters. Every edit creates a new version, and you can revert to any previous version. Changes to a profile do not automatically apply to running clusters — each cluster requires per-cluster approval before the new configuration is pushed. This gives operators full control over when and how changes roll out across edge sites.

The profile editor in the dashboard has 7 tabs: General, Volumes, Resources, PostgreSQL, HBA Rules, Databases, and Backups.

Configuration Management

PG-Swarm provides a structured system for managing PostgreSQL configuration changes across clusters, with safety mechanisms to prevent downtime from misconfigured parameters.

Three-mode config updates:

Reload — for parameters that take effect via pg_reload_conf() (sighup parameters like work_mem, log_min_duration_statement). Applied live with zero downtime.
Sequential restart — for postmaster parameters that require a PostgreSQL restart (like shared_buffers, max_connections). Pods are restarted one at a time in a rolling update to maintain availability.
Full restart — for parameters like wal_level that require all pods to restart simultaneously. The StatefulSet is scaled to 0 and back up.

Config versioning: Every profile change creates a new version. The version history is preserved, and operators can revert to any previous version through the dashboard or API.

Per-cluster approval: When a profile is updated, the new configuration does not automatically apply to running clusters. Each cluster must be individually approved before the change is pushed, giving operators control over staged rollouts.

Database-driven parameter classification: PG-Swarm maintains a database of PostgreSQL parameters and their restart requirements. An admin panel allows managing which parameters need which restart mode, so the system automatically determines the correct update strategy for any configuration change.

Cluster Databases

Databases and users are managed at the cluster level, not in profiles. This allows individual clusters deployed from the same profile to have different application databases and access rules.

How it works:

An operator creates a database entry for a cluster via the dashboard or API, specifying the database name, owner role, and optional IP subnet access restrictions.
PG-Swarm sends a command to the failover sidecar on the primary pod.
The sidecar executes CREATE ROLE and CREATE DATABASE directly against the local PostgreSQL instance.
If per-database IP subnet access control is specified, the sidecar generates HBA rules for that database and runs pg_reload_conf() to apply them.

Zero pod restart: The entire process — role creation, database creation, and HBA rule application — happens live without any pod restarts. The failover sidecar handles the SQL commands, and pg_reload_conf() applies HBA changes immediately.

Deployment Rules

A deployment rule connects a cluster profile to satellites via label selectors. When you create a rule like:

Profile: "production-standard"
Labels:  region=us-east, tier=edge
Name:    "prod-pg"
Namespace: "databases"

PG-Swarm finds all approved satellites whose labels contain region=us-east AND tier=edge, and creates a ClusterConfig for each one. If a new satellite is approved later with matching labels, it automatically gets the cluster.

This is how PG-Swarm scales to hundreds of edge sites — you define the intent once, and the system fans it out.

Failover Sidecar

The failover sidecar (cmd/failover-sidecar) runs as a container alongside PostgreSQL in every pod of the StatefulSet. It handles:

Leader election via Kubernetes Leases. The primary pod holds the lease and renews it every few seconds. If the primary dies, the lease expires after 15 seconds.
Promotion: the first replica to acquire the expired lease calls pg_promote() and relabels its pod as pg-swarm.io/role=primary.
SQL fencing: if a split-brain is detected (two pods think they're primary), the sidecar immediately blocks writes on the old primary by revoking connection permissions and terminating active sessions.
Demotion: the old primary is stopped, its timeline is recovered via pg_rewind, and it restarts as a standby — all without a Kubernetes container restart.
Log watcher: real-time PostgreSQL log monitoring via K8s log API. 40+ recovery patterns across 9 categories (data corruption, OOM, WAL issues, replication failures, etc.) with configurable actions (restart, rewind, rebasebackup, event).
Sidecar streaming: bidirectional gRPC connection to the satellite agent. Receives commands (fence, checkpoint, promote, unfence, status) during switchover and returns typed results. Persistent connection with exponential backoff.

The entire failover process — detection, promotion, service rerouting — takes under 5 seconds after the lease expires.

Sidecar Streaming

Each failover sidecar maintains a bidirectional gRPC stream to the satellite agent running on the same edge cluster. This enables the satellite to dispatch commands directly to specific pods during switchover and other operations.

How it works:

On startup, the sidecar connects to the satellite's SidecarStreamService and sends a SidecarIdentity message with its pod name, cluster name, and namespace.
The satellite's stream manager registers the connection, keyed by pod name.
The sidecar sends heartbeats every 10 seconds to keep the connection alive.
During switchover, the satellite looks up the target sidecar by pod name and sends commands (fence, checkpoint, promote, unfence, status).
The sidecar executes the command locally and returns a typed CommandResult with success/error and output.

If the connection drops, the sidecar reconnects with exponential backoff (1s to 30s). This replaces the previous approach of K8s exec for switchover commands, providing lower latency and a typed command/result protocol.

Log Watcher

The log watcher (internal/failover/logwatcher.go) monitors PostgreSQL's log output in real-time via the Kubernetes log API. It matches log lines against 40+ recovery patterns across 9 categories and triggers automated recovery actions.

Pattern categories:

Data corruption — checksum failures, invalid page headers → re-basebackup
OOM — out of memory, kill process → restart
WAL issues — WAL segment not found, timeline mismatch → rewind
Replication failures — replication terminated, primary connection lost → event
Configuration, connection, storage, tablespace, extension errors → event

Action types: restart (stop/start PG), rewind (pg_rewind), rebasebackup (full re-sync), event (report to central), exec (custom command).

Safety: Cooldown periods prevent action storms, pattern deduplication prevents the same log line from triggering twice, and an action mutex ensures only one recovery action runs at a time.

Recovery Rules

Recovery rules are centrally managed sets of log patterns and their associated actions. PG-Swarm ships with 40+ built-in patterns covering data corruption, OOM, WAL issues, replication failures, and more. Administrators can create, edit, and attach recovery rule sets to cluster profiles via the dashboard.

Each rule set contains rules with:

Pattern — a regex to match against PostgreSQL log lines
Action — what to do when the pattern matches (restart, rewind, rebasebackup, event, exec)
Cooldown — minimum time between action triggers for this pattern

Admin panel with regex sandbox: The dashboard's Recovery Rules editor (in the Admin page) provides inline rule editing and a regex sandbox for testing patterns against sample log lines before deploying them. You can paste real PostgreSQL log output and verify that your patterns match the expected lines and only those lines.

Per-profile attachment: Rule sets are attached to cluster profiles, so all clusters deployed from the same profile share the same recovery behavior. Rule sets are managed via REST API (/api/v1/recovery-rule-sets) and pushed to satellites as part of the ClusterConfig.

Storage Tiers

Storage tiers provide an abstraction layer between cluster profiles and Kubernetes StorageClasses. Instead of hardcoding a StorageClass name in a profile (which varies across edge clusters), you reference a storage tier like "fast-ssd" or "standard".

Each satellite has tier mappings that map abstract tier names to actual StorageClasses available on that cluster. When the satellite receives a ClusterConfig referencing a storage tier, the operator resolves it to the mapped StorageClass.

This enables a single profile to work across hundreds of edge clusters, each with different storage providers and class names.

gRPC Streaming

PG-Swarm uses bidirectional gRPC streaming between central and each satellite. A single persistent stream carries all communication in both directions:

Central → Satellite:

ClusterConfig — create or update a cluster
DeleteCluster — remove all resources
SwitchoverRequest — promote a specific replica
RestoreCommand — initiate a backup restore
SetLogLevel — change satellite log verbosity

Satellite → Central:

Heartbeat — every 10 seconds, keeps the connection alive
ClusterHealthReport — per-instance health metrics
EventReport — state transitions and errors
BackupStatusReport — backup completion/failure
RestoreStatusReport — restore progress
ConfigAck — confirms config was applied

Satellites initiate the connection outward, making PG-Swarm work behind NAT, firewalls, and VPNs without any inbound port requirements.

PostgreSQL Internals

Write-Ahead Log (WAL)

The Write-Ahead Log is the foundation of PostgreSQL's durability and replication. Before any change is written to the actual data files (tables, indexes), it is first recorded in the WAL. This guarantees that:

If PostgreSQL crashes, it can replay the WAL to recover any committed transactions that hadn't been flushed to disk yet.
Replicas can receive and replay the same WAL records to maintain an exact copy of the primary.

How it works:

A transaction modifies data (INSERT, UPDATE, DELETE).
PostgreSQL writes a WAL record describing the change to the WAL buffer in shared memory.
At commit time, the WAL buffer is flushed to disk (pg_wal/ directory). The transaction is now durable.
Later, the background writer or checkpointer flushes the actual data pages to the table files.

WAL files are 16 MB segments by default, named like 000000010000000000000001. The name encodes the timeline ID and the segment number.

Key parameters:

wal_level = replica — PG-Swarm always sets this. It includes enough information in WAL for streaming replication.
max_wal_senders = 10 — maximum concurrent WAL sender processes (one per replica + archiver).
wal_keep_size = 512MB — minimum WAL retained on disk for standby servers to catch up after brief disconnections.

Log Sequence Number (LSN)

An LSN is a pointer to a position in the WAL stream. It's represented as two 32-bit hex numbers separated by a slash, like 0/16B3F80. Every WAL record has an LSN, and they increase monotonically.

Why LSNs matter:

Replication lag: the difference between the primary's current LSN (pg_current_wal_lsn()) and a replica's replayed LSN (pg_last_wal_replay_lsn()) tells you how far behind the replica is, in bytes.
Backup tracking: a base backup records the start and end LSN. For PITR, you need all WAL segments covering the LSN range from the backup's start through your target recovery point.
pg_rewind: uses LSNs to find the exact point where two timelines diverged, so it can rewind only the necessary changes.

PG-Swarm's health monitor reports replication lag in both bytes (LSN difference) and seconds (time since last replayed transaction).

Timelines

A timeline is a sequence of WAL records that forms a linear history. Every PostgreSQL cluster starts on timeline 1. When a promotion happens (a replica becomes the new primary), PostgreSQL increments the timeline to prevent WAL conflicts.

Example:

Timeline 1: primary writes WAL segments 1, 2, 3, 4, 5...
  (primary crashes at segment 5)

Timeline 2: replica promotes, continues from segment 5 on timeline 2
  Timeline 2: segments 5, 6, 7, 8...

Old primary restarts — it has segments 5+ on timeline 1,
but the new primary is on timeline 2. This is a "timeline divergence".

Why this matters for PG-Swarm:

After a failover, the old primary has WAL records on the old timeline that the new primary doesn't have. It cannot simply reconnect as a replica — it needs to rewind.
PG-Swarm's init container and main container both check for timeline divergence on startup. If the local timeline doesn't match the primary's, they run pg_rewind to bring the node back in sync.
The health monitor reports timeline_id for each instance, making it visible in the dashboard when timelines diverge.

WAL file names encode the timeline: 00000002000000000000000A — the leading 00000002 means timeline 2.

Streaming Replication

Streaming replication is how replicas stay in sync with the primary in real time. A replica connects to the primary using a special replication connection and receives WAL records as they are generated — no polling, no delay.

The flow:

Replica starts with a base backup of the primary's data directory.
It connects to the primary using primary_conninfo in postgresql.auto.conf with the repl_user role.
The primary's WAL sender process streams WAL records to the replica's WAL receiver.
The replica applies (replays) the WAL records against its local data files.
The replica is always slightly behind — the difference is the replication lag.

PG-Swarm sets up streaming replication automatically. When the operator creates a StatefulSet with replicas > 1, the init container on ordinal 1+ runs pg_basebackup against the primary and configures the standby signal.

Key signals:

standby.signal — a file in PGDATA that tells PostgreSQL to start as a hot standby (read-only replica).
primary_conninfo — connection string to the primary, written to postgresql.auto.conf.
recovery_target_timeline = 'latest' — follow the latest timeline after promotion. PG-Swarm always sets this.

WAL Archiving

WAL archiving is the process of copying completed WAL segments to an external location (GCS, SFTP, etc.) for disaster recovery and point-in-time recovery. It's separate from streaming replication:

	Streaming Replication	WAL Archiving
Purpose	Real-time replica sync	Backup + PITR
Destination	Other PG instance	External storage
Granularity	Per WAL record (bytes)	Per WAL segment (16 MB)
Latency	Sub-second	Up to archive_timeout
Configured via	primary_conninfo	archive_command

How PG-Swarm configures it:

When a physical backup profile with wal_archive_enabled: true is attached, the operator auto-sets these parameters in postgresql.conf:

archive_mode = on
archive_command = 'pg-swarm-backup wal-push --dest s3 --bucket $BUCKET --prefix $PREFIX %p %f'
archive_timeout = 60

archive_timeout forces a WAL switch after 60 seconds of inactivity, ensuring WAL gets archived even during idle periods.

PostgreSQL calls archive_command once for each completed WAL segment. If the command fails (exit code != 0), PostgreSQL retries indefinitely — it never deletes an unarchived segment.

pg_rewind

pg_rewind is a tool that synchronizes a PostgreSQL data directory with another copy that has diverged. It's used after a failover to bring the old primary back in sync with the new primary without a full re-basebackup.

How it works:

Reads the WAL to find the exact point where the two timelines diverged (the "fork point").
Copies only the data pages that were modified on the old primary after the fork point.
Replaces them with the corresponding pages from the new primary.
The result is a data directory that matches the new primary's state at the fork point, ready to catch up via streaming replication.

Requirements:

wal_log_hints = on — PG-Swarm always sets this. It makes PostgreSQL log full-page images of hint bit changes, which pg_rewind needs to identify modified pages.
The old primary must have shut down cleanly (or be run through single-user crash recovery first).

PG-Swarm's container wrapper script automatically runs pg_rewind when it detects timeline divergence. If pg_rewind fails (e.g., too much divergence), it falls back to a full re-basebackup.

recovery.signal and standby.signal

These are empty files placed in PGDATA that control how PostgreSQL starts up:

standby.signal — tells PostgreSQL to start as a hot standby (streaming replica). It stays in this mode indefinitely, continuously receiving WAL from the primary.
recovery.signal — tells PostgreSQL to enter recovery mode, replay WAL up to a target point, and then promote to a read-write primary. Used for PITR.

When recovery.signal is present, PostgreSQL reads recovery parameters from postgresql.auto.conf:

restore_command = 'pg-swarm-backup wal-fetch --dest s3 --bucket $BUCKET %f %p'
recovery_target_time = '2025-03-15 14:30:00 UTC'
recovery_target_action = 'promote'

After replaying WAL to the target time, PostgreSQL removes recovery.signal and starts accepting writes.

Backups

Note: The backup sidecar (cmd/backup-sidecar) is currently being rebuilt from scratch. The design and concepts described below reflect the target architecture. This section will be updated as the reimplementation progresses.

Base Backup (Full Physical)

A base backup is a complete binary copy of the PostgreSQL data directory, taken using pg_basebackup. It captures everything: tables, indexes, configuration, transaction logs.

PG-Swarm will run this via the backup sidecar on a configured schedule (e.g., daily at 4 AM). The sidecar (running on the replica pod) will:

Record the backup start in the SQLite metadata DB (backups.db).
Run pg_basebackup -Ft -z -Xs -P against localhost — tar format, gzip compressed, streaming WAL, with progress.
Upload the compressed backup to the configured destination (GCS or SFTP).
Store the backup_manifest (PG 13+) in the metadata DB for future incremental references.
Record completion (size, status) and notify the primary sidecar for metadata sync.

A full base backup on its own lets you restore to the exact point the backup was taken. To restore to any arbitrary point in time, you also need WAL archives.

Incremental Backup (PG 17+)

Incremental backups, introduced in PostgreSQL 17, use pg_basebackup --incremental to capture only the data blocks that changed since a previous backup. This requires:

summarize_wal = on — PG-Swarm will auto-set this when an incremental schedule is configured.
A backup_manifest from a prior backup — used as the reference point.

How the chain works:

Day 1, 4:00 AM  →  Full base backup (manifest stored in metadata DB)
Day 1, 5:00 AM  →  Incremental (only blocks changed since 4 AM)
Day 1, 6:00 AM  →  Incremental (only blocks changed since 5 AM)
...
Day 2, 4:00 AM  →  New full base backup (resets the chain)
Day 2, 5:00 AM  →  Incremental (relative to Day 2 full)

To restore from an incremental, you use pg_combinebackup to merge the full base with all incrementals in the chain.

First-run behavior: If no previous manifest exists in the metadata DB (first run), the backup sidecar will automatically take a full base backup instead.

Logical Backup

Logical backups use pg_dump (per-database) or pg_dumpall (all databases) to export data as SQL or custom binary format. Unlike physical backups, logical backups are:

Portable — can be restored to a different PostgreSQL version or architecture.
Selective — you can specify which databases to dump.
Slower — must read and serialize every row.
No PITR — a point-in-time snapshot with no WAL continuity.

Compression: Custom format (-Fc) uses built-in zlib. Plain SQL and pg_dumpall are piped through gzip before upload.

Point-in-Time Recovery (PITR)

PITR lets you restore a database to any specific moment — for example, "right before the accidental DROP TABLE at 2:47 PM." It requires:

A base backup taken before the target time.
All WAL segments from the base backup's start LSN through the target time.

The restore process (to be automated by PG-Swarm):

User selects a base backup and target time in the dashboard.
Central sends a RestoreCommand to the satellite.
The satellite creates a Kubernetes Job that scales the StatefulSet to 0, downloads and extracts the backup, writes recovery.signal with recovery_target_time, and scales back up.
PostgreSQL replays WAL to the target time, then promotes.

Retention matters: WAL retention days must cover the span between your oldest base backup and any potential recovery target.

Metadata Store (backups.db)

PG-Swarm stores backup metadata in a SQLite database alongside the backup files at the destination. Each cluster gets its own <cluster>/metadata/backups.db.

Tables:

Table	Purpose
`backups`	Every backup with type, parent_id (incremental chains), timestamps, size, path, status
`manifests`	Binary `backup_manifest` blobs for incremental reference
`wal_segments`	Archived WAL segment inventory

Workflow: The backup sidecar will maintain backups.db locally at each destination, updating it after each backup operation. The metadata is co-located with the data — no external database dependency.

The parent_id column tracks the incremental chain: full-1 ← incr-2 ← incr-3. For restore, walk the chain back to the base.

Retention

Retention policies will control how many backups are kept before old ones are deleted, enforced after each successful backup.

Setting	Default	Applies to
Base Backup Count	7	Full base backups
Incremental Backup Count	6	Incrementals per full cycle
WAL Retention Days	14	Archived WAL segments
Logical Backup Count	7	Logical dumps

Storage backends: The backup sidecar supports GCS (Google Cloud Storage) and SFTP as storage destinations.