What are the responsibilities and job description for the Staff Machine Learning Engineer | Large Scale AI Infrastructure position at Glocomms?
This position will sit within a company that is pioneering a new era of Biomedicine!
Role Overview :
- GPU Cluster Management : Architect, deploy, and sustain high-performance GPU clusters, ensuring they are stable, reliable, and scalable. Oversee and manage cluster resources to maximize efficiency and utilization.
- Distributed / Parallel Training : Apply distributed computing techniques to facilitate parallel training of extensive deep learning models across multiple GPUs and nodes. Optimize data distribution and synchronization for faster convergence and reduced training times.
- Performance Optimization : Enhance GPU clusters and deep learning frameworks to achieve peak performance for specific workloads. Identify and resolve performance bottlenecks through profiling and system analysis.
- Deep Learning Framework Integration : Work closely with data scientists and machine learning engineers to incorporate distributed training capabilities into the company's model development and deployment frameworks.
- Scalability and Resource Management : Ensure GPU clusters can scale effectively to meet growing computational demands. Develop strategies for resource management to prioritize and allocate computing resources based on project needs.
- Troubleshooting and Support : Diagnose and resolve issues related to GPU clusters, distributed training, and performance anomalies. Provide technical support to users and efficiently resolve technical challenges.
- Documentation : Develop and maintain documentation on GPU cluster configuration, distributed training workflows, and best practices to facilitate knowledge sharing and smooth onboarding of new team members.
Qualifications :
The company will provide a relocation package for candidates open to relocate!