[Beowulf] High Performance for Large Database

Tue Oct 26 14:21:57 PDT 2004

On Tue, 26 Oct 2004, Joshua Marsh wrote:

> Hi all,
> 
> I'm currently working on a project that will require fast access to
> data stored in a postgreSQL database server.  I've been told that a
> Beowulf cluster may help increase performance.  Since I'm not very
> familar with Beowulf clusters, I was hoping that you might have some
> advice or information on whether a cluster would increase performance
> for a PostgreSQL database.  The major tables accessed are around
> 150-200 million records.  On a stand alone server, it can take several
> minutes to perform a simple select query.
> 
> It seems like once we start pricing for servers with 16+ processors
> and 64+ GB of RAM, the prices sky rocket.  If I can acheive high
> performance with a cluster, using 15-20 dual processor machines, that
> would be great.

This sort of cluster isn't a "beowulf" cluster; rather it is a variant
of a high availability cluster.  It's Extreme Linux, just not beowulf.
The beowulf design (and focus of this list) is "high performance
computing" clusters, aka supercomputing clusters.

With that said, there may be some resources out there that can help you,
and listening in on this list and learning how HPC clusters work will
certainly help you with other kinds, as the issues are in many cases
similar.

The first/best place to look is the September issue of Cluster World
Magazine (www.clusterworld.com/issues.html).  Its cover focus is on
"Database Clusters".  My copy is at Duke (and I'm at home:-) so although
I'm pretty sure it covers mysql used in a cluster environment I cannot
recall if it discusses alternatives such as oracle or postgres.

Other CWM issues will also be pertinent, regardless.  One major issue
associated with any kind of file access is assembling a large, shared
file store that avoids the file and communications bottlenecks that are
as much an issue in HPC as they are in HA.  A series of articles just
begun by Jeff Layton deals with SAN's and massive scalable storage in
general -- he's only done a couple of articles so far, so if there are
still September/October issues around you'd be in great shape.  CWM also
abounds with ads for large and scalable and blindingly fast storage
solutions.  We just had an extensive discussion on this very list on
storage (I kicked it off as we have a big proposal out that had a very
large storage component and I needed to learn -- fast!).  The recent
list archives should show you the thread.  Finally, there are some
companies out there that make their bread and butter by assembling
custom clusters to accomplish very specific tasks at a cost (as you
note) far less than the cost of a big multiprocessor machine even though
they make a healthy (and well earned) profit on the deal.  Some of them
have employees or owners on this list -- if any of them can help you I
expect they'll talk to you offline.

That's about all the help I personally can offer; I haven't built a
large database cluster and only have listened halfheartedly when they
were discussed on list in the past (although there have been previous
discussions you can also google for in the list archives, I think).  The
problem is a fairly complex one -- not just various file latency and
bandwidth issues (these are likely the "easy part") but the issue of
sharing the underlying DB brings up locking.  It is one thing to provide
lots of nodes read-only access to a DB on a SAN engineered for fast,
cached, read-only access; it is another to provide all the nodes with
read AND write access, as writing requires a lock, and a lock
effectively serializes access.  This (and related problems) are serious
issues with speeding up databases through parallelism.  I vaguely recall
that big companies like Oracle have dumped pretty serious money into
this kind of thing looking for solutions that scale well.

Maybe somebody else on list knows more than I do, though, and maybe
they'll tell all of us!

   rgb

> 
> Thanks for any help you may have!
> 
> -Josh
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu