[Beowulf] Re: High Performance for Large Database

Andrew Piskorski atp at piskorski.com
Wed Oct 27 10:48:19 PDT 2004


On Tue, Oct 26, 2004 at 01:08:00PM -0600, Joshua Marsh wrote:
> Hi all,
> 
> I'm currently working on a project that will require fast access to
> data stored in a postgreSQL database server.  I've been told that a
> Beowulf cluster may help increase performance.

Unlikely in general, although possible in certain cases.  For example,
look into Clusgres, memcached, and Backplane.  I've previously given
links and discussion here:

  http://openacs.org/forums/message-view?message_id=128060
  http://openacs.org/forums/message-view?message_id=179348

> Since I'm not very familar with Beowulf clusters, I was hoping that

It is more important that you are extensively familiar with RDBMSs in
general, and PostgreSQL in particular.  Are you?

> you might have some advice or information on whether a cluster would
> increase performance for a PostgreSQL database.  The major tables
> accessed are around 150-200 million records.  On a stand alone
> server, it can take several minutes to perform a simple select
> query.

200 million rows is not that big.  What's the approximate total size
of your database on disk?

Your "several minutes for a simple select" query performance is
abysmal, and this is unlikely to be because of your hardware.  Most
likely, your queries just suck, and you need to do some serious SQL
tuning work before even considering big huge fancy hardware.

Once you have tables with tens or hundreds of millions of rows, doing
ANY full table scans of that table at all sucks really badly, so you
MUST profile and tune your queries.  And eliminating full table scans
of large tables is just the first and most obvious step, it is not
unusual to still have very sucky queries after that.

> It seems like once we start pricing for servers with 16+ processors
> and 64+ GB of RAM, the prices sky rocket.  If I can acheive high
> performance with a cluster, using 15-20 dual processor machines, that
> would be great.

If you are even thinking about buying an 8-way or larger box, then you
are certainly a candidate for several 2-way boxes with an (expensive)
SCI interconnect, so see Clusgres in my links above.  Or, if want want
to spend a lot less money, your access is read-mostly, and you DON'T
need full ACID transactional support for your read-only queries, look
into using memcached to cache query results in other machines' RAM.

However, VERY few people need such large RDBMS boxes.  What makes you
think you do?  What exactly is your application doing, and what sort
of load do you need it to sustain?

Have you profiled and tuned all your SQL?  Tuned your PostgreSQL and
Linux kernel settings?  Have you read and worked through all the
PostgreSQL docs on tuning?  (You didn't install PostgreSQL with its
DEFAULT settings, did you?  Those are intended to just get it up and
running on ALL the platforms PostgreSQL supports, not to give good
performance.)

Investing hundreds of thousands of dollars in fancy server hardware
without first doing your basic RDBMS homework makes no sense at all.
If your database is dog slow because of poor data modeling or grossly
untuned queries, throwing $300k of hardware at the problem may not
help much at all.

-- 
Andrew Piskorski <atp at piskorski.com>
http://www.piskorski.com/



More information about the Beowulf mailing list