EC2, Confluence, S3 and PostgreSQL

Update: EC2 will have persistent storage!

My side projects at the moment are both work related — that is, they’re related to Confluence, our enterprise Wiki. Even though I don’t work on that team any more the product still has a lot of mindshare with me.

My latest spare time project is running Confluence on EC2.

Amazon announced SimpleDB recently. I haven’t decided what it’s good for yet — its lack of transactions mean that it isn’t a drop in replacement for a relational database, but rather a component of massively scalable distributed systems.

While looking at SimpleDB I had another look at EC2. EC2 provides virtual machines, running Linux (or other operating systems virtualised under Linux, as far as I know) which can be created and destroyed at will, and cost 10 cents an hour to run.

Getting Confluence running on EC2 was very easy — I picked one of Amazon’s pre-built Fedora Core 4 images, installed PostgreSQL and a JDK, and that was that. After you’ve customised an image you save it to S3, and start it (or many copies of it) again when you wish.

Confluence feels as though it’s running on a machine with the specs Amazon specify, with the stated 1.7GB of physical memory. I haven’t yet tried the high end instances — quad 2GHz Xeon equivalent.

The catch with EC2 is that while a virtual machine has a 150GB disk, this data isn’t persisted when you shut the machine down — or when it dies unexpectedly due to either a software problem in the hosted operating system, or because Amazon shuts down your instance without you asking. It isn’t clear how often instances die unexpectedly, but Amazon do say “We recommend you should not rely on a single instance to provide reliability for your data.”

So simply saving the image to S3 once a day isn’t the way to go:

  • You aren’t guarding against unexpected crashes — you may lose a day’s data.
  • You have to shut down your application and DB down to get a consistent image.
  • You are storing your entire OS image in each backup. This is particularly wasteful if you have many instances of the same application, as these should all be able to start from identical virtual machine images, simply being passed a parameter to tell them where to get their data from.

Confluence can be configured to keep essentially all its dynamic data in the database, so what we need to do is to keep an ‘up to date’ backup of our PostgreSQL database on S3 at ‘all times’ — that is, a backup which contains or current data minus at most n minutes of transactions. (OK, there’s also a Lucene index — if you deliberately shut down your machine every day, e.g. if you save money by keeping the instance up only during business hours, you would need to copy that index to S3 too, but if you are recovering from rare unexpected restarts you probably don’t mind reindexing)

Fortunately Amazon provides ample free (as in beer) bandwidth between EC2 and S3, and PostgreSQL provides good on-line backup facilities, so this task is relatively easy.

PostgreSQL on line backup allows you to specify a command to archive each log segment as it is filled — to restore you can roll these forward from a full backup.

To summarise:

  • Use PostgreSQL 8.2 — on-line backups have some significant improvements over 8.1 and 8.0.
  • Read the excellent documentation

PostgreSQL must be configured to use Write Ahead Log (WAL) archiving, so postgresql.conf contains an archive command like:

archive_command = '/Users/tomd/bin/archivewal %p %f'

where archivewal is:

#!/bin/sh
bzip2 -c $PGDATA/$1 >/tmp/$$ &&
s3put confluence/local-db-wal-$2.bz2 /tmp/$$ &&
rm /tmp/$$

s3put is a command from Tim Kay’s aws utility.

The backup procedure is:

  • Tell PostgreSQL that you are starting a backup
  • Copy your PGDATA directory to S3 — a simple tar archive is fine, because we are also copying the logs the backup doesn’t have to be consistent.
  • Tell PostgreSQL that you have finished the backup. At this point the log files used during the backup will also be archived.
  • Check that those logs have made it to S3 and mark your backup as OK to restore from
  • Optionally delete older backups and older log file archives.

The restore procedure (which you would use when restarting an instance) is:

  • Provide the EC2 virtual machine with a startup parameter which tells it which S3 bucket to look in for backups.
  • Unpack PGDATA from your tar file (because we are starting a fresh copy of our EC2 image, PGDATA won’t exist when we start up, and postgres won’t be running)
  • Create a recovery.conf file in PGDATA which tells postgres which command to use to copy log files back from S3.
  • Start PostgreSQL. PostgreSQL will request all the log files it needs, that is, all the log files which were archived since the backup.

Once I have all this working and tested I’ll post a followup with the details, but I can tell you now that EC2 is the most fun you can have for 10 cents an hour!

10 Comments

  1. Scott Royston
    Posted January 10, 2008 at 8:42 pm | Permalink

    I’m very interested in hearing how this goes. From poking around google it looks like ‘GoPlan’ is already doing it. http://comments.deasil.com/2007/05/24/goplan-interview-ec2/

  2. Tom Davies
    Posted January 13, 2008 at 3:16 am | Permalink

    Thanks for the link, Scott. Interesting that they have 6 months uptime for their DB server.

  3. Posted February 15, 2008 at 4:54 am | Permalink

    Any news on this project? I’m working on a project now that is PostgreSQL based and needs to be scaled.

    A.

  4. Ram
    Posted March 19, 2008 at 2:47 am | Permalink

    I would be interested in updates too :)

  5. Matthew Arrott
    Posted April 12, 2008 at 7:21 am | Permalink

    Did you run multiple instance of Conflunence as a fault-tolerant cluster? Did you consider running a MySQL NDBIO cluster for the Fault-Tolerant DB?

    Great to see that some has precede us. Many thanks,

    Matthew

  6. Tom Davies
    Posted April 12, 2008 at 9:11 pm | Permalink

    I never installed a complete setup on an EC2 node, although I did test Postgres backups to S3 from my computer. I’m happy that Postgres backup/restore to S3 is practical.

    I didn’t set up a Confluence cluster, but EC2 does allow UDP, so Confluence clustering will work.

    I would recommend that you use the largest machine EC2 provides before resorting to clustering, which introduces some overhead.

    I don’t now what the worst case UDP latency between two EC2 nodes is — you would need to test this (or ask Amazon) before creating a cluster.

  7. Tom Davies
    Posted April 12, 2008 at 9:12 pm | Permalink

    PS: I’m not familiar with MySQL NDBIO…

  8. Patrick
    Posted November 2, 2008 at 9:01 am | Permalink

    A much better idea would be to mount the S3 as a drive on your EC2 instance and point the data directory of postgres to that mount point. Voila! No hassle consistent database.

    This how to shows you how to do it with MySQL – postgerSQL is basically the same method.

    http://www.sunsetlakesoftware.com/2008/09/13/running-drupal-website-amazon-ec2

  9. Tom Davies
    Posted November 2, 2008 at 2:33 pm | Permalink

    @Patrick — Yes, EBS makes this sort of complex solution redundant.

  10. Posted November 29, 2008 at 5:35 am | Permalink

    I am a bit undecided on EBS,

    EBS seems to me like another VM with its instance store exposed as an iSCCI block device, but you pay for the allocation regardless of the usage (0.10 / gb / month). The reason I say this is that while data may persist- Amazon says that at any point in time it could fail (AFAIK) … so you would still need to make incremental snapshots out to S3.

    I guess EBS is allows you to pull the logic of keeping data consistent etc. away from machine instances, also EBS blocks cannot be mounted by many instances @ once, I guess the primary instance could export it as nfs.

Post a Comment

Your email is never shared. Required fields are marked *

*
*