What are the responsibilities and job description for the Staff Machine Learning Engineer | Large Scale AI Infrastructure position at Glocomms?

This position will sit within a company that is pioneering a new era of Biomedicine!

Role Overview :

GPU Cluster Management : Architect, deploy, and sustain high-performance GPU clusters, ensuring they are stable, reliable, and scalable. Oversee and manage cluster resources to maximize efficiency and utilization.
Distributed / Parallel Training : Apply distributed computing techniques to facilitate parallel training of extensive deep learning models across multiple GPUs and nodes. Optimize data distribution and synchronization for faster convergence and reduced training times.
Performance Optimization : Enhance GPU clusters and deep learning frameworks to achieve peak performance for specific workloads. Identify and resolve performance bottlenecks through profiling and system analysis.
Deep Learning Framework Integration : Work closely with data scientists and machine learning engineers to incorporate distributed training capabilities into the company's model development and deployment frameworks.
Scalability and Resource Management : Ensure GPU clusters can scale effectively to meet growing computational demands. Develop strategies for resource management to prioritize and allocate computing resources based on project needs.
Troubleshooting and Support : Diagnose and resolve issues related to GPU clusters, distributed training, and performance anomalies. Provide technical support to users and efficiently resolve technical challenges.
Documentation : Develop and maintain documentation on GPU cluster configuration, distributed training workflows, and best practices to facilitate knowledge sharing and smooth onboarding of new team members.

Qualifications :

Master's or Ph.D. in computer science or a related field, with a focus on High-Performance Computing, Distributed Systems, or Deep Learning.

Over 2 years of proven experience in managing GPU clusters, including installation, configuration, and optimization.

Strong expertise in distributed deep learning and parallel training techniques.

Proficiency in popular deep learning frameworks such as PyTorch, Megatron-LM, and DeepSpeed.

Programming skills in Python and experience with GPU-accelerated libraries (e.g., CUDA, cuDNN).

Knowledge of performance profiling and optimization tools for HPC and deep learning.

Familiarity with resource management and scheduling systems (e.g., SLURM, Kubernetes).

Solid background in distributed systems, cloud computing (AWS, GCP), and containerization (Docker, Kubernetes).

Currently or previously holding a Staff or equivalent title | Currently sitting within a Senior leveled title for 3 years

The company will provide a relocation package for candidates open to relocate!

Apply for this job

Receive alerts for other Staff Machine Learning Engineer | Large Scale AI Infrastructure job openings

Job openings at Glocomms

Senior Cyber Security Engineer

Glocomms

CA Full Time

Senior Cyber Security Engineer Location : San Mateo, CA Compensation : $180,000 - $200,000 Glocomms are partnered with a...

Associate Director, GT Internal Controls & Compliance

Glocomms

Orlando, FL Full Time

Associate Director, Global Technology Internal Controls & Compliance Location : Orlando, FL (hybrid) Glocomms are partne...

Director, Security Operations

Glocomms

Orlando, FL Full Time

The IT Security team is responsible for the oversight and execution of a "cloud-first" Information Security, Business Co...

Senior Cyber Security Analyst - Threat Hunter

Glocomms

Orlando, FL Full Time

We are partnered with a global hospitality company to bring on a Senior Cyber Security Analyst to join their fast growin...

Not the job you're looking for? Here are some other Staff Machine Learning Engineer | Large Scale AI Infrastructure jobs in the Palo Alto, CA area that may be a better fit.

Staff Machine Learning Engineer | Large Scale AI Infrastructure

What are the responsibilities and job description for the Staff Machine Learning Engineer | Large Scale AI Infrastructure position at Glocomms?

What is the career path for a Staff Machine Learning Engineer | Large Scale AI Infrastructure?

Job openings at Glocomms

Not the job you're looking for? Here are some other Staff Machine Learning Engineer | Large Scale AI Infrastructure jobs in the Palo Alto, CA area that may be a better fit.

We don't have any other Staff Machine Learning Engineer | Large Scale AI Infrastructure jobs in the Palo Alto, CA area right now.

AI Assistant is available now!