[Beowulf] Nvidia K20 + Supermicro mobo

Fri Jul 19 10:40:51 PDT 2013

I'd like to pipe in and say that I could not get NVidia drivers working 
with RHEL 6.x until I added rdblacklist=nouveau to my kernel args, too.

Prentice

On 07/16/2013 02:43 PM, Alex Chekholko wrote:
> I see on our GPU compute nodes, configured by a colleague, we use this
> kernel line during install:
>
> # rocks list bootaction | grep gpu
> gpuinstall:       vmlinuz-6.0-x86_64    initrd.img-6.0-x86_64 ks
> ramdisk_size=150000 lang= devfs=nomount pxe kssendmac selinux=0 noipv6
> ksdevice=bootif xdriver=vesa rdblacklist=nouveau nouveau.modeset=0
>
> This is RHEL6 (Rocks 6.0) on HP SL250s hardware.  I think they didn't
> boot correctly without blacklisting nouveau.
>
> Hope that helps.
>
> Regards,
> Alex
>
> On Tue, Jul 16, 2013 at 10:44 AM, Adam DeConinck <ajdecon at ajdecon.org> wrote:
>> Hi Mikhail,
>>
>> I've seen similar messages on CentOS when the Nouveau drivers are
>> loaded and a Tesla K20 is installed. You should make sure that nouveau
>> is blacklisted so the kernel won't load it.
>>
>> Note that it hasn't always been enough for me to have nouveau listed
>> in /etc/modprobe.d/blacklist; sometimes I've had to actually put
>> "rdblacklist=nouveau" on the kernel line.
>>
>> Disclaimer: I work at NVIDIA, but I haven't touched OpenSUSE in forever.
>>
>> Cheers,
>> Adam
>>
>> On Tue, Jul 16, 2013 at 10:29 AM, Mikhail Kuzminsky <mikky_m at mail.ru> wrote:
>>> I want to test NVIDIA GPU (PNY Tesla K20c) w/our own application for future using in our cluster. But I found problems w/NVIDIA driver (v.319.32) installation (OpenSUSE 12.3, kernel 3.7.10-1.1).
>>>
>>> 1st of all, before start of driver installation I've strange for me messages about BAR registers:
>>> -----------------------from /var/log/messages------
>>> 2013-07-04T01:43:43.666022+04:00 c6ws4 kernel: [ 0.421559] pci 0000:00:01.0: BAR 15: can't assign mem pref (size 0x18000000)
>>> 2013-07-04T01:43:43.666024+04:00 c6ws4 kernel: [ 0.421563] pci 0000:00:01.0: BAR 14: assigned [mem 0xe1000000-0xe1ffffff]
>>> 2013-07-04T01:43:43.666025+04:00 c6ws4 kernel: [ 0.421566] pci 0000:00:16.1: BAR 0: assigned [mem 0xe0001000-0xe000100f 64bit]
>>> 2013-07-04T01:43:43.666026+04:00 c6ws4 kernel: [ 0.421576] pci 0000:01:00.0: BAR 1: can't assign mem pref (size 0x10000000)
>>> 2013-07-04T01:43:43.666027+04:00 c6ws4 kernel: [ 0.421579] pci 0000:01:00.0: BAR 3: can't assign mem pref (size 0x2000000)
>>> 2013-07-04T01:43:43.666027+04:00 c6ws4 kernel: [ 0.421581] pci 0000:01:00.0: BAR 0: assigned [mem 0xe1000000-0xe1ffffff]
>>> 2013-07-04T01:43:43.666028+04:00 c6ws4 kernel: [ 0.421584] pci 0000:01:00.0: BAR 6: can't assign mem pref (size 0x80000)
>>> 2013-07-04T01:43:43.666029+04:00 c6ws4 kernel: [ 0.421586] pci 0000:00:01.0: PCI bridge to [bus 01]
>>> -----------------------------------------------------------------------------------------------
>>>
>>> May be it's hardware/BIOS (Supermicro X9SCA-F, last BIOS v.2.0b) error symptoms ? I tried both BIOS modes - "above 4G Decoding" enabled and disabled.
>>>
>>> It looks for me that NVIDIA driver uses BAR 1 (see below). Although it was also some unclear for me messages in nvidia-installer.log, installer shows that kernel interface of nvidia.ko was compiled, but then nvidia-installer.log contains
>>>
>>> --------------------------from nvidia-installer.log ----------------------------------
>>> -> Kernel module load error: No such device
>>> -> Kernel messages:
>>> ...[ 25.286079] IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
>>> [ 1379.760532] nvidia: module license 'NVIDIA' taints kernel.
>>> [ 1379.760536] Disabling lock debugging due to kernel taint
>>> [ 1379.765158] nvidia 0000:01:00.0: enabling device (0140 -> 0142)
>>> [ 1379.765165] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
>>> [ 1379.765165] NVRM: BAR1 is 0M @ 0x0 (PCI:0000:01:00.0)
>>> [ 1379.765166] NVRM: The system BIOS may have misconfigured your GPU.
>>> [ 1379.765169] nvidia: probe of 0000:01:00.0 failed with error -1
>>> [ 1379.765177] NVRM: The NVIDIA probe routine failed for 1 device(s).
>>> [ 1379.765178] NVRM: None of the NVIDIA graphics adapters were initialized!
>>> ---------------------------------------------------------------------------------------------
>>>
>>> I add also lspci -v extraction :
>>>
>>> 01:00.0 3D controller: NVIDIA Corporation GK107 [Tesla K20c] (rev a1)
>>>          Subsystem: NVIDIA Corporation Device 0982
>>>          Flags: fast devsel, IRQ 11
>>>          Memory at e1000000 (32-bit, non-prefetchable) [disabled] [size=16M]
>>>          Memory at <unassigned> (64-bit, prefetchable) [disabled]
>>>          Memory at <unassigned> (64-bit, prefetchable) [disabled]
>>>
>>> Does this kernel messages above means that I have hardware/BIOS problems or it may be some NVIDIA driver problems ?
>>>
>>> Mikhail Kuzminsky
>>> Computer Assistance to Chemical Research Center
>>> Zelinsky Institute of Organic Chemistry
>>> Moscow
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>>> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf