[Beowulf] zfs tuning for HJPC/cluster workloads?
landman at scalableinformatics.com
Sun Jul 6 13:47:59 PDT 2008
Loic Tortay wrote:
> Joe Landman wrote:
> We have seen the same issue on (non Sun) high density storage servers
> which performed correctly with RHEL5 & XFS but comparatively poorly with
> Solaris 10 & ZFS.
> ZFS seems to be extremely sensitive to the quality/behaviour of the
> driver for the HBA or RAID/disk controller, especially with SATA disks
> (for NCQ support). Having a driver is not enough, a good one is required.
> Another point is that ZFS requires a different configuration "mindset" than
> "ordinary" RAID.
Here is what I like. Setting up a raid is painless. Really painless.
Here is what I don't like. I can't tune that raid. Well, I can, by
tearing it down and starting again. I tried turning off checksum,
compression, even zil.
The thing I wanted to do was to put the log onto another device, and
following the man pages on this resulted in errors. zpool would not
> Have you noticed the "small vdev" advice on the Solaris Internals Wiki ?
Yeah, they mention 10 drives or less. I tried it with two 8-drive
vdevs, 1x 16-drive vdev, and a few other things.
> This is probably the single most important hint for ZFS configuration.
> IOW, most of the time you can't just use the same underlying
> configuration with ZFS as the one you (would) use with Linux.
> This means that you may need to trade usable space for performance,
> sometimes in more drastic ways than with ordinary RAID.
Tried a few methods. Understand, we have a preference to show the
fastest possible speed on our units. So we want to figure out how to
tune/tweak zfs for these systems.
> Finally, like it or not, ZFS is often more happy/efficient when it does
> the RAID itself (no "hardware" RAID controller or LVM involved).
The performance on pure zfs sw-only raid was lower (significantly) than
the hardware RAID running solaris. I tried several variations on this.
That and the crashing (driver related I believe) concern me. I would
like to be able to get the performance that some imply I can get out of it.
I certainly would like to be able to tune it.
> PS: regarding your other message in this thread (and your blog), you
> seem confused: the "open source" OS is OpenSolaris, not Solaris 10.
Hmmm .... we keep hearing that "Solaris is open source" without
providing any distinction between Sun Solaris and Open Solaris. Maybe
it is marketing not being precise on this. Ask your Sun sales rep if
Solaris is open source, without specifying which one. The answer will
be "yes". Ambiguity? Yes. On purpose? I dunno.
> The benchmark publishing restriction only applies to Solaris 10 (see
Yup. Will eventually try OpenSolaris on this gear.
> PPS: while I dislike Sun's policy, I specifically remember being told by
> someone from a DOE lab (who did actually evaluate your product about 18
> months ago) that you didn't want their unfavorable benchmarks results to
> be published. You can't have it both ways.
Owie ... no one is having it "both ways" Luc. Everything we are doing
in testing is in the open, and we invite both comment and criticism ...
like "Hey buddy, turn up read-ahead" or "luser, turn off compression."
Our tests and results are open. Others can run them, and report
back results. If they give me permission to publish them, I will. If
they publish them, I may critique them (we reserve the right to respond).
As a note also, you just dragged an external group into this discussion,
and I am guessing that they really didn't want to be. So I am going to
tread carefully here.
We published a critique of the published "evaluation", pointing to the
faults, and doing a thorough job of analyzing the same. We didn't deny
them the right to publish their results. As a result of this, we got in
return, a rather nasty email/blog post trail. I still have it in my
mail archives, and it is hidden in the blog archives. I won't rehash
it, other than to point out that some on this list would take issue with
I removed my critique after they asked me to, with them promising in
return to amend and address my criticisms. As far as I can tell, they
withdrew their report, and did not amend or address my criticisms.
More curious are the reports that the group responsible for this report,
has run away from their (formerly) preferred platform towards a BlueArc
platform. There was a nice quote from the principal author of the
report to this effect (moving forward with BlueArc) last year in
HPCWire, for what they were considering the other unit (thumper) for.
This said, they were free to use the unit and publish benchmark results,
which they did. We criticized the benchmark they did for its flaws in
analysis, in execution, and setup, as we were free to do.
Nobody is having it "both ways" Luc. We reserve the right to respond,
and we did. We did not ask them to take down the report. They did ask
us to take our criticisms of their report down.
FWIW: I will not name or divulge the group's name in public or private.
I ask that anyone with knowledge of this group also keep their
names/affiliation out of the discussion. Luc dragged them in here, and
I would like to accord them some measure of privacy, no matter whether I
agree or disagree with them.
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax : +1 866 888 3112
cell : +1 734 612 4615
More information about the Beowulf