What are the responsibilities and job description for the High-Performance Computing System Administrator position at Yale School of Medicine?
Position Focus :
The Yale Center for Research Computing (YCRC) seeks a High-Performance Computing System Administrator to join the center’s Engineering team to provide hardware and software administration for a growing number of high-performance computing (HPC) clusters used in faculty research. The center is a computational core facility under the Office of the Provost created to support the advanced computing needs of the research community. The YCRC provides support that spans the Yale School of Medicine and Faculty of Arts & Sciences and encompasses Yale’s HPC clusters, multiple petabytes of high-performance storage, and technologies for computational science and the analysis, sharing, and management of large-scale research data.
The successful candidate will support the infrastructure behind all of the above, including hardware, system and resource-management software, networking, storage, monitoring and security measures. This is a highly-collaborative effort, so frequent interaction with other system administrators, research-support staff, management, vendors and researchers is a regular part of the role. The successful candidate will also participate in designing, recommending and vetting architectures, specifications, and configurations of new systems, especially those using computational accelerators such as Graphics Processing Units (GPUs) to support Artificial Intelligence (AI) and Machine Learning (ML). To support this, the candidate will research developments in HPC architectures and new technologies, processes, and methodologies, especially those involving accelerators (such as GPUs). This position also involves on-site maintenance at the data centers where the equipment is located. (Currently, the equipment is in two data centers, one in Holyoke, MA and the other in West Haven, CT.)
Essential Duties
Configure, deploy, and support HPC clusters to support university research.
Install, administer and maintain hardware, system software, networking, accounts, and security measures to maintain performance, stability, and security.
Troubleshoot and fix issues with HPC hardware.
Deploy and support large-scale data storage and backup for critical research data.
Diagnose and correct system issues, whether these be issues with correct operation or performance.
Reinstate integrity of systems as quickly as possible following an outage in order to minimize downtime.
Manage end-user accounts.
Triage and solve user-submitted tickets related to HPC infrastructure.
Track system health and resource usage using monitoring software, and respond to issues.
Develop and maintain documentation for team members and occasionally for end users.
Research developments in HPC architectures and new technologies, processes, and methodologies.
Update and patch system software and firmware and software as needed to maintain performance and security.
Participate in determination of specifications for new systems, and tailor these to meet research needs.
Perform on-site installations and maintenance at data centers.
Apply technical expertise to identify and resolving system deficiencies.
Provide system services and analyze system performance for stakeholders and intended end users.
Perform other duties as assigned.
Required Education and Experience
Bachelor's Degree in a related field and a minimum of four years of related work experience or an equivalent combination of education and experience.
Required Skill / Ability 1 :
Proven expertise with Linux operating system distributions.
Required Skill / Ability 2 :
Expertise with bash and at least one other scripting language. Demonstrated expertise with Linux system administration, including OS, networking, storage, and security.
Required Skill / Ability 3 :
Proven ability to work in team environment in fast-moving technology field.
Required Skill / Ability 4 :
Excellent verbal and writing skills. Ability to interact well with team members and end users. Ability to work independently and across units.
Required Skill / Ability 5 :
Attention to detail with the proven ability to take the care necessary to be entrusted with a system that hundreds of users depend on for research computation and the storage of research data.
Preferred Education, Experience and Skills :
- HPC clusters, preferably with administration thereof
- Computational accelerators such as GPUs
- Cluster provisioning and management tools
- Batch schedulers
- Technology in a research environment
- High-speed networking, , InfiniBand
- Large storage systems and parallel file systems such as GPFS and Lustre
- Server hardware component replacement
- Working in a data-center or server-room environment
Weekend Hours Required?
Occasional
Evening Hours Required?
Occasional
Drug Screen
Health Screening
Background Check Requirements
All candidates for employment will be subject to pre-employment background screening for this position, which may include motor vehicle, DOT certification, drug testing and credit checks based on the position description and job requirements. All offers are contingent upon the successful completion of the background check. For additional information on the background check requirements and process visit "Learn about background checks" under the Applicant Support Resources section of Careers on the It's Your Yale website.
Posting Disclaimer
The intent of this job description is to provide a representative summary of the essential functions that will be required of the position and should not be construed as a declaration of specific duties and responsibilities of the particular position. Employees will be assigned specific job-related duties through their hiring departments.