[Beowulf] Partial OT: CPU grouping control for Windows 2008 R2 x64 server for big calcs

Joe Landman landman at scalableinformatics.com
Thu Jan 12 12:45:16 PST 2012


Ok, this one is fun.  For some definitions of fun.  Unusual definitions 
of fun...  And there is a question towards the end.  This is for folks 
who've been administrating clusters and HPC systems with big windows 
machines (32+ CPUs and large RAM).

Imagine you have a machine as part of a very loose computing cluster. 
End user wants to run Windows (2008R2 x64 enterprise) on it.  This 
machine has 32 processor cores (real ones, no hyperthreading), 1TB ram.

Yeah, its a fun machine to work on.  I won't discuss the OS choice here. 
  You can see some of my playing with it here: 
http://scalability.org/?p=3541 and http://scalability.org/?p=3515

Windows machines can let up to 64 logical processors be part of a 
"group".  A group is a scheduling artifice, and not necessarily directly 
related to the NUMA system ... think of it as a layer abstraction above 
this.

Ok, still with me?

This scheduling artifice, these groups, require at minimum a 
recompilation to work properly with.  Its actually more than that, they 
do require some additional processor affinity bits be handled.  If you 
have a code which doesn't handle this correctly, it will probably crash. 
  Or not work well.  Or both.

Matlab appears to be such a beast.  This isn't necessarily a Matlab 
issue per se, it appears to be something of a design compromise issue in 
Windows.  Windows wasn't designed with large processor counts in mind. 
The changes they'd need to make in order to enable a single large 
spanning entity across all CPUs at once are quite likely not in the 
companies best interests, as there are very few customers with such 
machines.

Still with me?  Here's the problem.

Matlab seems to crash (according to the user) if run on a unit with more 
than one group.  I've not been able to verify on the machine yet myself, 
but I have no reason to disbelieve this.  The issue as its been stated 
to me is that if there is more than one group of processors, Matlab 
crashes.  This is the symptom.

When the unit boots by default, we have 2 16 processor groups.  So 
looking at bcdedit examples, I see how to turn off groups.

One minor problem.

It doesn't work.

I can do an

	bcdedit /set groupaware off

reboot.  Which should completely disable groups, so that all 32 
processor are in one group.  Still 2 groups.

I can do an

	bcdedit /set groupsize 64

reboot.  Still 2 groups.

So far, the only thing that seems to change this is if I install the 
hyperV role.  With that, there is now 1 group.

Looking at all the boot options with

	bcdedit /enum

there's only one config for boot, and its the default.

So ... my questions

1) Does Windows really ignore its approximate equivalent to its boot 
options on a grub line?

2) Is there any way to compel Windows to do the right thing?

As noted, this is for a computing cluster.  Our recommended OS isn't 
feasible right now for them and their application.

Definitely annoying.  I'd love there to be a bios setting to help 
windows past its desire to ignore my requested number of groups.  Not 
sure if adding in the hyperV will impact performance (did some base 
testing with Scilab to see, and I didn't see anything I'd call significant).

Will be bugging Microsoft about this as well (pretty obviously a bug in 
2008R2 x64).

And related to this, I read something about limits in the different 
windows editions.  Is anyone using Windows HPC cluster on big memory 
machines with lots of cores?  Looking at the Microsoft docs, they 
indicate some relatively low limits on ram and processor count.  So does 
this mean that they won't be supporting Interlagos 4 socket machines 16 
cores per socket and 1/2 TB ram in compute nodes for Windows HPC ?  I am 
just imagining someone buying a few of those nodes and being required to 
buy Enterprise or Data center licenses for those machines (which clearly 
would not be used for anything more than HPC).




-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615



More information about the Beowulf mailing list