[Beowulf] AMD is looking for expert HPC/AI sysadmins/SRE
Joe Landman
joe.landman at gmail.com
Thu Jun 12 01:00:46 UTC 2025
Hi folks:
Quick post for the day job. AMD (my employer) is looking for expert
systems administrators for a mix of our internal HPC systems, and
helping customers stand up their AI and HPC clusters.
AMD systems include a small version of Frontier, some El Cap
adjacent nodes, and a variety of large GPU accelerator based nodes.
Customer systems range from smaller 64 node systems through multiple
orders of magnitude larger systems.
Needed skills/attributes include:
* 5+ years in an HPC systems admin/HPC SRE role
* expert Linux knowledge, debugging, problem resolution
* strong hardware debugging experience
* SLURM management, setup, configuration
* development experience in Python, Bash, C/C++
* RDMA network setup/config/testing
* Benchmarking and performance measurement
* Monitoring systems
* Storage systems, including Lustre, NFS, BeeGFS, etc.
* Installing and configuring device drivers for advanced hardware:
GPUs and networks
* Modules and configuration (HPE/Cray and lmod)
* capability to work in/around AMD and customer data centers, and
occasional travel to those DCs
Desired experience/attributes include:
* Proximity to Austin Tx, or Santa Clara/San Jose offices, though
remote is possible
* CUDA and/or ROCM experience
* HPE/Cray programming environment and modules
* familiarity with AI frameworks
* US Citizenship or green card
I don't have a job req to point to yet, but should have this soon.
You can reach me here, or on https://linkedin.com/in/joelandman . I am
the hiring manager.
Regards
Joe
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20250611/4e85809e/attachment.htm>
More information about the Beowulf
mailing list