What are the responsibilities and job description for the Senior HPC Systems Engineer position at Corvid Technologies LLC?
Corvid Technologies is seeking a Sr. HPC Systems Engineer with a strong background and enthusiasm for Linux to support our Linux based High Performance Computer consisting of 80,000 processor cores. If you enjoy learning, playing with hardware, optimizing performance, efficiency, and spend most of your time on the command line, this is the job for you.
This candidate will be responsible for the following:
- Compile (with icc optimizations), install, and test HPC software for internal and external customers
- Manage Linux license servers and licensed applications for internal and external HPC customers
- Write and troubleshoot custom job scheduler submission scripts for internal and external HPC customers
- Troubleshoot slow, hanging, or failing HPC jobs on internal or customer HPC clusters
- Automate repetitive tasks and implement custom solutions using scripting/programming languages such as bash or python
- Configure and troubleshoot a heterogeneous (FDR, EDR, HDR, NDR) InfiniBand network and associated subnet manager
- Provide guidance and support on HPC best practices and solutions for internal and external customers
- Design, test and implement an HPC environment consisting of a provisioner (e.g. xcat, warewulf), scheduler (e.g. Slurm, SGE, PBS),
- RDMA connections (e.g InfiniBand), a subnet manager, and 5 compute nodes
- Troubleshoot and monitor resource utilization/availability on Linux servers
- Configure, maintain, and troubleshoot HPC scheduler issues
- Install and configure cluster nodes on internal HPC cluster
- Troubleshoot hardware and software issues on HPC cluster nodes
Requirements:
- Bachelor’s degree in engineering or related STEM field (Masters Preferred)
- Hands-on experience of at least one distributed file system (Spectrum Scale-GPFS, Lustre, BeeGFS, Gluster, IMRIX, PVFS, etc.)
- Experience installing, configuring, and maintaining job management tools (such as SLURM, Moab, TORQUE, PBS, etc.)
- Experience with configuration management tools such as Ansible or Puppet
- Experience with operating system deployment tools (e.g. XCAT, ROCKS)
- Direct experience working with InfiniBand
- 8 yrs scripting experience
- 8 yrs professional experience using command line Linux (RHEL derivatives preferred)
- Experience in one or more engineering computational code OR 2 years of IT-related experience (e.g., user support, basic networking, Linux server administration, a home Linux environment)
- Obtain and maintain a U.S. security clearance
Preferred Skills:
- Past experience with facility and system architecture work within a large HPC environment
- Past experience managing information systems within a classified environment
- Experience configuring, installing, and troubleshooting MPI and OpenMP applications
- Experience configuring, installing, tuning, and maintaining scientific software on large-scale systems
- Experience supporting HPC compilers and libraries
- Experience configuring, installing, maintaining, and using performance monitoring and optimization tools
- Active TS/SCI U.S. security clearance
- Active CompTIA Security certification
Why Corvid:
Founded in 2004, we are a group of over 300 engineers and scientists, about 3/4 with master’ degrees or Ph.D.'s, that provide end-to-end solutions including concept development, design and optimization, prototype build, test and manufacture. We leverage the predictive capability of our high-fidelity computational physics solvers, indigenous massively parallel supercomputer system, prototyping plant, and ballistics and mechanics lab to investigate a variety of high-rate physics phenomena.
The results are complex engineering solutions for a variety of applications; aircraft, ballistic missile defense, cybersecurity, motorsports, armor development, biological systems, and missile and warhead design and development. These results are achieved with optimal design and cost efficiency due to the predictive capability of Corvid’s tools and our in-house, end-to-end integrated approach, which differentiates Corvid from the market.
We value our people and offer employees a broad range of benefits. Benefits for full-time employees include:
- Paid gym membership
- Flexible schedules
- Blue Cross Blue Shield insurance including Medical, Dental and Vision
- 401k match up to 6%
- PTO based on years of experience (3 weeks minimum starting)
- Continued education and training opportunities
- Uncapped incentive opportunities.