Introduction
A typical fail over method for any application is to have two identical
machines with all data stored on a shared SAN. This falls short on ensuring the
integrity of the database. Do we know that the database was properly shutdown
on the primary node before failing over? If not, data loss can occurs and the
only recovery method will be restoring from backup. This method also
introduces a single point of failure for the database in the shared storage.
Luckily, postgres provides a transaction logging method known as write
ahead logs (WAL). We can exploit these feature to create a live backup
of the production system on the fail over node. We’ll do this by
having the primary node copy it’s wal logs to the secondary node. The
secondary node will be in continuous recovery mode, reading the wal logs
and applying them to it’s database.
Pre-requisites
Before we begin, a few requirements:
1. Both machines must be the same postgresql version
This document assumes postgresql 8.x. Different solutions are available for postgresql 7.x and 9.x.
2. Both machines must be the same architecture (64 or 32 bit).
3. The primary node must be able to access the secondary node via ssh.
4. Install postgresql-contrib on secondary machine
Configure the primary node
1. Generate ssh key and place public key in authorized_hosts on secondary
# ssh-keygen -t dsa
# ssh-copy-id -i ~/.ssh/id_dsa.pub postgres@secondary
2. Enable WAL logging by editing postgresql.conf and setting the following values.
archive_mode = on
archive_command = 'rsync --delete-after -a %p postgres@secondary:/var/lib/pgsql/walfiles/%f'
Configure the secondary node
1. Create a directory for the WAL files
# mkdir /var/lib/pgsql/walfiles
2. Create /var/lib/pgsql/recovery.conf with the following contents:
restore_command = '/usr/bin/pg_standby -l -d -s 2 -t /tmp/pgsql.trigger.5432 \ /var/lib/pgsql/walfiles %f %p %r 2>>standby.log'
Initialize the cluster
1. On the primary run the following:
Please note that ‘dhreplication’ is an arbitrary tag, and can be set to anything useful to your configuration.
# psql -U postgres -c “SELECT pg_start_backup(‘dhreplication’);”
# rsync -avz /var/lib/pgsql/data/* secondary:/var/lib/pgsql/data/
# psql -U postgres -c “SELECT pg_stop_backup();”
2. On secondary run the following:
Edit /var/lib/pgsql/data/postgresql.conf and set ‘archive_mode = off’
# ln -s /var/lib/pgsql/recovery.conf /var/lib/pgsql/data/recovery.conf
A symlink is used above, as /var/lib/pgsql/recovery.conf will be removed when recovery is disabled.
# service postgresql start
3. Monitor for problems by watching /var/pgsql/pgsql.log and /var/lib/pgsql/data/standby.log on secondary
Testing fail over
1. On primary edit /var/lib/pgsql/data/postgresql.conf and set “archive_command = /bin/true”
then reload postgresql to make the changes active.
# service postgresql reload
2. On secondary
# tail /var/pgsql/pgsql.log /var/lib/pgsql/data/standby.log
# touch /tmp/pgsql.trigger.5432
3. You may now connect clients to the secondary server.
To resume running on primary
1. On secondary
# psql -U postgres -c “SELECT pg_start_backup(‘dhreplication’);”
# rsync -avz /var/lib/pgsql/data/* primary:/var/lib/pgsql/data/
# psql -U postgres -c “SELECT pg_stop_backup();”
2. Preform steps above to initialize the secondary server again.
External sources
http://www.xtuple.org/replication-how-to
http://www.postgresql.org/docs/8.4/static/high-availability.html
http://www.postgresql.org/docs/8.4/static/warm-standby.html