What are the responsibilities and job description for the HPC Engineer - Hybrid position at Caris Life Sciences?
Position Summary
An HPC (High Performance Computing) Engineer is responsible for implementing, and maintaining a High Performance Computing (HPC) systems primarily running on Linux operating systems, which involves tasks like installing, configuring, optimizing, and troubleshooting hardware and software components within a complex cluster environment, often requiring expertise in parallel processing, network architecture, and job scheduling tools like LSF, while ensuring optimal system performance and user support.
Job Responsibilities
This job description reflects management’s assignment of essential functions. Nothing in this job description restricts management’s right to assign or reassign duties and responsibilities to this job at any time.
Caris Life Sciences is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, religion, color, national origin, gender, gender identity, sexual orientation, age, status as a protected veteran, among other things, or status as a qualified individual with disability.
An HPC (High Performance Computing) Engineer is responsible for implementing, and maintaining a High Performance Computing (HPC) systems primarily running on Linux operating systems, which involves tasks like installing, configuring, optimizing, and troubleshooting hardware and software components within a complex cluster environment, often requiring expertise in parallel processing, network architecture, and job scheduling tools like LSF, while ensuring optimal system performance and user support.
Job Responsibilities
- Installing and configuring Linux operating systems on HPC clusters, including network settings, storage systems, and parallel file systems like GPFS.
- Monitoring system performance, identifying bottlenecks, and tuning system parameters to maximize computational efficiency.
- Managing user job submissions and queues using tools like LSF or SLURM, ensuring fair allocation of computing resources.
- Implementing security measures to protect HPC systems and data from unauthorized access.
- Diagnosing and resolving hardware and software issues, applying updates and patches, and performing routine system maintenance.
- Providing technical assistance to researchers and other users on the HPC system, including account management and application support.
- Forecasting future computing needs and planning for system upgrades or expansions
- 4 years managing Linux servers, direct experience managing HPC clusters preferred.
- Technical experience with system configuration, implementation, management and user support.
- Strong understanding of Linux system administration
- Expertise in parallel computing concepts and programming paradigms.
- Knowledge of high-performance networking technologies
- Familiarity with cluster management tools (e.g., LSF, Slurm, PBS)
- Experience with distributed file systems (Lustre, Ceph, GPFS)
- Scripting languages like Python and Shell scripting (e.g.,bash,ksh) for automation
- Understanding of computer architecture and performance optimization techniques
- Strong Linux system administration skills: Expertise in Linux commands, system configuration, and troubleshooting.
- HPC cluster knowledge: Understanding of cluster architectures, network topologies (like InfiniBand), and parallel processing concepts.
- Job scheduling tools: Proficiency with job scheduling systems like LSF or SLURM
- Performance analysis tools: Familiarity with tools to monitor and analyze system performance
- Scripting languages: Ability to write scripts (e.g., Bash, Python) for automation and system management
- Networking expertise: Understanding of network protocols, network troubleshooting, and high-speed networking technologies
- Storage management: Knowledge of parallel file systems and data management strategies
- Experience with HPC schedulers and resource managers
- Experience writing user documentation
- Experience developing and delivering training for users
- Strong technical and analytical skills
- Strong verbal and written communication skills
- Always maintains the highest level of professionalism when interacting with internal and external customers
- Demonstrates a high-energy, positive attitude and commitment to quality customer service
- Contributes to a positive team environment within the center by demonstrating a strong work ethic, effectively communicating with others, and proactively anticipating center and user needs
- Experience coordinating and running support teams
- Ability to lift, move and install HPC data center hardware and supplies.
- Standing for extended periods while performing data center related tasks.
- All job specific, safety, and compliance training are assigned based on the job functions associated with this employee.
- This position requires periodic travel and some evenings, weekends, and/or holidays.
- Job may require after-hours response to emergency issues.
- Periodically scheduled on-call may require after-hours response for technical emergencies not explicitly related to assigned job responsibilities
This job description reflects management’s assignment of essential functions. Nothing in this job description restricts management’s right to assign or reassign duties and responsibilities to this job at any time.
Caris Life Sciences is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, religion, color, national origin, gender, gender identity, sexual orientation, age, status as a protected veteran, among other things, or status as a qualified individual with disability.