What are the responsibilities and job description for the Senior System Engineer (HPC) position at Laine Recruiting, LLC?
Must have experience in a High-Performance Computing (HPC) environment!
Experience should include :
- Infiniband
- GPU's
- Parallel File Systems
- SLURM
- Leads analysis of research use case requirements, design solutions, and deployment of on premise or cloud based advanced infrastructure for organization-wide research computing that involves application areas of existing and emerging areas of high performance computing, including artificial intelligence, big data, modeling, and simulation.
- Leads the design, development, deployment, and maintenance of systems that require creative assembly of specialized computing (e.g. GPUs and accelerators), advanced network (e.g. InfiniBand) and parallel file system (e.g. GPFS) infrastructure. Leverages specialized infrastructure to deploy research computing solutions that are typically unconventional or unorthodox in traditional information technology environments.
- Leads, advises, and completes proactive performance monitoring of high-performance computing and supporting resources, including the analysis, alerting, reporting, and tuning of computational accelerators, high-bandwidth and low-latency networks, parallel file systems, and scheduling / resource management software. Provides capacity analysis, maintenance and troubleshooting activities for advanced infrastructure in the research computing environment. Thinks creatively to respond to performance issues, system errors, and maintenance that are outside of standard information technology process controls and procedures.
- Leads the creation, review, and maintenance of technical documentation including solution designs and reference guides for institutional-wide research computing infrastructure. Prepares necessary paperwork and documentation to ensure compliance to federal funding agency standards and procedures. Leads discussions of new products and services to enhance the delivery of research computing infrastructure; engages vendors as appropriate.
- Maintains a broad knowledge of advanced technology, specialized equipment, and security and research compliance requirements. Remains mindful and vigilant of risks to the research enterprise while consulting with staff, performing work, and planning activities.
- 7 years of technical experience
- 2 years of experience leading a team of System Engineers
- Knowledge of high-performance computing hardware and software
- 1 years of experience with Infiniband, GPU's, Parallel File Systems and SLURM
- Experience with research networking and research storage solutions
- Expertise with virtualization technology and cloud providers such as Amazon Web Services and Microsoft Azure
- Excellent verbal and written communications skills and exemplary speaking and presentation skills, as well as the ability to interact with staff, as appropriate, to communicate, and to process communications from others on technical change and research computing solution recommendations
- Ability to provide on-call support as required, as well as ability to perform after-hours and weekend maintenance and implementation activities
- Ability to travel to and from Data Center facilities
- Strong scripting skills
- Expertise with server operating systems and orchestration and automation tools
This is a remote role but the person will have to travel to Rochester, NY twice per year for 4 business days . Candidates must be located in the United States.
If you don't meet the above requirements, you will not be considered for the role. Please only read on / apply if you meet this criteria.
Laine Recruiting has been engaged by one of the most respected higher education institutions and largest employers in the Rochester area! We’ve partnered to fill several roles within their research center. This team provides hardware, software, training and support to over 110 departments across the organization.
About the Role
The Sr. Systems Engineer manages and administers advanced on-premise and cloud-based computing, networking, and storage for research. In addition to administrating servers and virtual systems, the position requires specialized skills for managing advanced computing architectures (e.g. high-performance computing systems and accelerators such as GPUs), specialized high-speed network technology (e.g. low-latency InfiniBand networks and research cluster topologies), and massive parallel file systems engineered for high-volume and high-velocity data for research.
Responsible for deploying and managing specialized tools for configuring and controlling research computing environments (e.g. SLURM, service nodes, etc.). Responsible for the design, setup and maintenance of a organization-wide research computing infrastructure with monitoring and security, and communicating about advanced research technology solutions with staff from several departments.
Overview of Responsibilities
Qualifications