[Beowulf] I/O bound simulation
landman at scalableinformatics.com
Fri Nov 30 15:33:40 PST 2007
Mark Kozikowski wrote:
> When I approach the higher fidelity levels, the simulation starts
> to choke on the quantity of data being processed.
> It appears that the system is failing on I/O. Transferring large
> amounts of time critical data between process elements.
Could you describe what you mean by "choke" and what you mean by
"failing on I/O"? This would help enormously.
Also, could you tell us what
reports, as well as what type of network you are using, and the NIC and
switch type for laughs ? (Intel, broadcom, SMC, ...)
> I a running on a mostly standard Red Hat distro, no special
> compiling or running architectures are in place.
Is this something you built from source? Using MPI?
> Do any of you have suggestions as to how I might start
> getting control of this I/O problem?
First is problem identification, which you may have gotten a good start
on. It would help to know what I indicated above. Also, it might be
worth it if you grab a copy of dstat
(http://dag.wieers.com/rpm/packages/dstat/) and atop
(http://dag.wieers.com/rpm/packages/atop/) and install them. Dstat is
your friend (though it does make mistakes on aggregate IO calculations,
it is useful at figuring out other relevant information). Atop is your
friend on your file server node.
Run atop on the head node, and dstat on the compute nodes while running
your job. Try to capture some of this output ... simple cut and paste
is fine. If you can show a "choked" versus "non-choked" run, this would
help immensely in diagnostics.
Once we are sure where the point of pain is, the next steps would be
planning for remediation of the same.
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax : +1 866 888 3112
cell : +1 734 612 4615
More information about the Beowulf