What are the responsibilities and job description for the HPC Engineer/Architect position at TekWissen LLC?
Overview:
TekWissen is a global workforce management provider headquartered in Ann Arbor, Michigan that offers strategic talent solutions to our clients world-wide. Our client is an American multinational information technology services and consulting company and is a leading provider of information technology, consulting, and business process outsourcing services, dedicated helping the world's leading companies build stronger businesses.
Job Title: HPC Engineer/Architect
Work Location: New York, NY 10001
Job Type: Contract
Work Type: Hybrid
Duration: 6 Months
Job Summary:
- You will support day-to-day operations of large-scale parallel file systems, deploy and maintain Linux HPC infrastructure across multiple data centers, and assist HPC engineers and architects with day-to-day operations and tickets.
- Support day-to-day operations of large-scale parallel file systems
- Deploy and Maintain Linux HPC infrastructure across multiple datacenters
- Assist HPC engineers and architects with day-to-day operations and tickets
Experience:
- 16 to 20 years
Required Skills:
- Linux Operating Systems (RHEL/CentOS), Parallel file system (GPFS), Job Scheduler LSF/Slrm
- Anxible, Python, Shell scripting
- GPU-based compute infrastructure (including CUDA)
- CentOS 4.5
- HPCC
Responsibilities:
- Design, architect and oversee implementation of Linux based HPC clusters and storage
- Deploy physical hardware using HPC deployment tools and configuration and orchestration tools (Ansible)
- Parallel file system (GPFS) performance tuning, monitoring and troubleshooting
- Perform systems benchmarking, and developing automated tests for the HPC environment, ensuring the reliability and efficiency of our computational infrastructure
- Infiniband network maintenance and troubleshooting
- Automate and monitor the HPC user lifecycle process
- Slurm installation, configuration, performance tuning and troubleshooting
- Plan, design and implement a transition from the LSF scheduler to Slurm
- Manage the Slurm scheduler and translate Research policies into scheduler configurations
- Consult with faculty and students to develop research pipelines for use on the HPC cluster
- Develop and maintain user lifecycle software suite in Python, implement CI/CD pipeline
- Test and automate upgrades of critical system applications using Ansible and shell scripts.
- The ability to communicate effectively with clinicians, researchers, and other team members to develop technological solutions is key
Qualifications:
- Experience working in a large-scale research based HPC environment
- Proven experience working with distributed file storage solutions (i.e., GPFS)
- Experience with deploying and troubleshooting Linux Operating Systems (RHEL/CentOS)
- Experience with Scripting and Automation (Ansible, Python, Shell Scripting)
- Solid understanding of job schedulers (LSF/SLURM)
- Experience with GPU-based compute infrastructure (including CUDA)
TekWissen Group is an equal opportunity employer supporting workforce diversity.