What are the responsibilities and job description for the Sr. HPC Architect - Hybrid position at Caris Life Sciences?
Position Summary
A Senior HPC Architect is responsible for designing and optimizing high-performance computing (HPC) systems, leveraging their expertise in parallel programming, performance analysis, and hardware architecture to create scalable, efficient solutions for demanding computational workloads, often collaborating with software developers and hardware engineers to achieve optimal performance across complex scientific or data-intensive applications.
Job Responsibilities
This job description reflects management’s assignment of essential functions. Nothing in this job description restricts management’s right to assign or reassign duties and responsibilities to this job at any time.
Caris Life Sciences is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, religion, color, national origin, gender, gender identity, sexual orientation, age, status as a protected veteran, among other things, or status as a qualified individual with disability.
A Senior HPC Architect is responsible for designing and optimizing high-performance computing (HPC) systems, leveraging their expertise in parallel programming, performance analysis, and hardware architecture to create scalable, efficient solutions for demanding computational workloads, often collaborating with software developers and hardware engineers to achieve optimal performance across complex scientific or data-intensive applications.
Job Responsibilities
- System Design and Implementation:
- Architecting and designing high-performance computing clusters, selecting appropriate hardware components like CPUs, GPUs, storage systems, and networking infrastructure.
- Installing and configuring operating systems (typically Linux) on cluster nodes.
- Setting up and managing distributed file systems (like Lustre, Ceph, GPFS) for large data storage and access.
- Implementing job scheduling systems (e.g., LSF, Slurm, PBS) to manage workload distribution across the cluster.
- Performance Optimization:
- Monitoring system performance metrics (CPU utilization, memory usage, network bandwidth) to identify bottlenecks and optimize resource allocation.
- Benchmarking applications and performing performance analysis to identify areas for improvement.
- Tuning application code for parallel processing to leverage the power of the HPC cluster.
- User Support:
- Providing technical support to researchers and users on how to access and utilize the HPC system
- Training users on best practices for submitting jobs and optimizing their applications for the HPC environment
- Troubleshooting user issues related to application execution, data management, and system access
- System Administration:
- Managing system updates, patching, and security configurations to maintain a stable and secure HPC environment
- Implementing backup and disaster recovery procedures for critical data and system configurations
- Monitoring system health and proactively addressing potential issues through alerts and logging systems
- Minimum of five years’ experience in Linux systems administration.
- Bachelor's degree in computer science, engineering, math, or scientific discipline with 2 years of systems engineering; or 6 years’ experience in HPC architecture.
- Hands-on architecture design experience with HPC to include storage, file system, InfiniBand, security, authentication, and compute architecture
- Experience using Git to manage shared software configuration code bases
- Hands-on experience with cloud-based services (e.g. Azure, AWS, GCP).
- Good understanding of storage administration and optimization, such as performing upgrades and defining RAID configurations.
- Deep understanding of parallel computing concepts and programming paradigms (MPI, OpenMP, CUDA).
- Expertise in performance analysis tools and techniques to identify and address performance bottlenecks.
- Knowledge of HPC hardware architectures, including processors, memory subsystems, network fabrics, and interconnects
- Familiarity with HPC software stack components like compilers, runtime systems, job schedulers, and scientific libraries
- Good understanding of storage administration and optimization, such as performing upgrades and defining RAID configurations.
- Strong programming skills in languages commonly used in HPC (C, C , Fortran)
- Strong skills with scripting languages like Python and Shell scripting (e.g.,bash,ksh, Perl, Python) for automation
- Experience with system administration and cluster management tools (e.g., LSF, Slurm, PBS)
- Experience with distributed file systems (Lustre, Ceph, GPFS)
- Excellent communication and problem-solving abilities to effectively collaborate with cross-functional teams
- Experience in life sciences, healthcare and/or research institutions highly preferred
- Experience building and installing scientific software and other 3rd party software applications on HPC systems
- Experience with HPC schedulers and resource managers
- Experience executing scientific software on HPC systems
- Experience writing user documentation
- Strong technical and analytical skills
- Strong verbal and written communication skills
- Always maintains the highest level of professionalism when interacting with internal and external customers
- Demonstrates a high-energy, positive attitude and commitment to quality customer service
- Contributes to a positive team environment within the center by demonstrating a strong work ethic, effectively communicating with others, and proactively anticipating center and user needs
- Experience coordinating and running support teams
- Related industry certifications preferred.
- Ability to lift, move and install HPC data center hardware and supplies.
- Standing for extended periods while performing data center related tasks.
- All job specific, safety, and compliance training are assigned based on the job functions associated with this employee.
- This position requires periodic travel and some evenings, weekends, and/or holidays.
- Job may require after-hours response to emergency issues.
- Periodically scheduled on-call may require after-hours response for technical emergencies not explicitly related to assigned job responsibilities
This job description reflects management’s assignment of essential functions. Nothing in this job description restricts management’s right to assign or reassign duties and responsibilities to this job at any time.
Caris Life Sciences is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, religion, color, national origin, gender, gender identity, sexual orientation, age, status as a protected veteran, among other things, or status as a qualified individual with disability.