Demo

Staff Machine Learning Engineer | Large Scale AI Infrastructure

Glocomms
Palo Alto, CA Full Time
POSTED ON 1/13/2025
AVAILABLE BEFORE 4/2/2025

This position will sit within a company that is pioneering a new era of Biomedicine!

Role Overview :

  • GPU Cluster Management : Architect, deploy, and sustain high-performance GPU clusters, ensuring they are stable, reliable, and scalable. Oversee and manage cluster resources to maximize efficiency and utilization.
  • Distributed / Parallel Training : Apply distributed computing techniques to facilitate parallel training of extensive deep learning models across multiple GPUs and nodes. Optimize data distribution and synchronization for faster convergence and reduced training times.
  • Performance Optimization : Enhance GPU clusters and deep learning frameworks to achieve peak performance for specific workloads. Identify and resolve performance bottlenecks through profiling and system analysis.
  • Deep Learning Framework Integration : Work closely with data scientists and machine learning engineers to incorporate distributed training capabilities into the company's model development and deployment frameworks.
  • Scalability and Resource Management : Ensure GPU clusters can scale effectively to meet growing computational demands. Develop strategies for resource management to prioritize and allocate computing resources based on project needs.
  • Troubleshooting and Support : Diagnose and resolve issues related to GPU clusters, distributed training, and performance anomalies. Provide technical support to users and efficiently resolve technical challenges.
  • Documentation : Develop and maintain documentation on GPU cluster configuration, distributed training workflows, and best practices to facilitate knowledge sharing and smooth onboarding of new team members.

Qualifications :

  • Master's or Ph.D. in computer science or a related field, with a focus on High-Performance Computing, Distributed Systems, or Deep Learning.
  • Over 2 years of proven experience in managing GPU clusters, including installation, configuration, and optimization.
  • Strong expertise in distributed deep learning and parallel training techniques.
  • Proficiency in popular deep learning frameworks such as PyTorch, Megatron-LM, and DeepSpeed.
  • Programming skills in Python and experience with GPU-accelerated libraries (e.g., CUDA, cuDNN).
  • Knowledge of performance profiling and optimization tools for HPC and deep learning.
  • Familiarity with resource management and scheduling systems (e.g., SLURM, Kubernetes).
  • Solid background in distributed systems, cloud computing (AWS, GCP), and containerization (Docker, Kubernetes).
  • Currently or previously holding a Staff or equivalent title | Currently sitting within a Senior leveled title for 3 years
  • The company will provide a relocation package for candidates open to relocate!

    If your compensation planning software is too rigid to deploy winning incentive strategies, it’s time to find an adaptable solution. Compensation Planning
    Enhance your organization's compensation strategy with salary data sets that HR and team managers can use to pay your staff right. Surveys & Data Sets

    What is the career path for a Staff Machine Learning Engineer | Large Scale AI Infrastructure?

    Sign up to receive alerts about other jobs on the Staff Machine Learning Engineer | Large Scale AI Infrastructure career path by checking the boxes next to the positions that interest you.
    Income Estimation: 
    $119,030 - $151,900
    Income Estimation: 
    $149,493 - $192,976
    Income Estimation: 
    $101,387 - $124,118
    Income Estimation: 
    $119,030 - $151,900
    Income Estimation: 
    $119,030 - $151,900
    Income Estimation: 
    $149,493 - $192,976
    Income Estimation: 
    $149,493 - $192,976
    Income Estimation: 
    $184,796 - $233,226
    Income Estimation: 
    $77,900 - $95,589
    Income Estimation: 
    $101,387 - $124,118
    View Core, Job Family, and Industry Job Skills and Competency Data for more than 15,000 Job Titles Skills Library

    Job openings at Glocomms

    Glocomms
    Hired Organization Address CA Full Time
    Senior Cyber Security Engineer Location : San Mateo, CA Compensation : $180,000 - $200,000 Glocomms are partnered with a...
    Glocomms
    Hired Organization Address Orlando, FL Full Time
    Associate Director, Global Technology Internal Controls & Compliance Location : Orlando, FL (hybrid) Glocomms are partne...
    Glocomms
    Hired Organization Address Orlando, FL Full Time
    The IT Security team is responsible for the oversight and execution of a "cloud-first" Information Security, Business Co...
    Glocomms
    Hired Organization Address Orlando, FL Full Time
    We are partnered with a global hospitality company to bring on a Senior Cyber Security Analyst to join their fast growin...

    Not the job you're looking for? Here are some other Staff Machine Learning Engineer | Large Scale AI Infrastructure jobs in the Palo Alto, CA area that may be a better fit.

    Staff Machine Learning Engineer

    Inworld AI, Mountain View, CA

    AI Assistant is available now!

    Feel free to start your new journey!