[Beowulf] AMD is looking for expert HPC/AI sysadmins/SRE

Joe Landman joe.landman at gmail.com
Thu Jun 12 01:00:46 UTC 2025


Hi folks:

    Quick post for the day job.  AMD (my employer) is looking for expert 
systems administrators for a mix of our internal HPC systems, and 
helping customers stand up their AI and HPC clusters.

    AMD systems include a small version of Frontier, some El Cap 
adjacent nodes, and a variety of large GPU accelerator based nodes.  
Customer systems range from smaller 64 node systems through multiple 
orders of magnitude larger systems.

    Needed skills/attributes include:

  * 5+ years in an HPC systems admin/HPC SRE role
  * expert Linux knowledge, debugging, problem resolution
  * strong hardware debugging experience
  * SLURM management, setup, configuration
  * development experience in Python, Bash, C/C++
  * RDMA network setup/config/testing
  * Benchmarking and performance measurement
  * Monitoring systems
  * Storage systems, including Lustre, NFS, BeeGFS, etc.
  * Installing and configuring device drivers for advanced hardware:
    GPUs and networks
  * Modules and configuration (HPE/Cray and lmod)
  * capability to work in/around AMD and customer data centers, and
    occasional travel to those DCs

     Desired experience/attributes include:

  * Proximity to Austin Tx, or Santa Clara/San Jose offices, though
    remote is possible
  * CUDA and/or ROCM experience
  * HPE/Cray programming environment and modules
  * familiarity with AI frameworks
  * US Citizenship or green card

   I don't have a job req to point to yet, but should have this soon.  
You can reach me here, or on https://linkedin.com/in/joelandman .  I am 
the hiring manager.

   Regards

Joe
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20250611/4e85809e/attachment.htm>


More information about the Beowulf mailing list