What are the responsibilities and job description for the Member of Technical Staff position at Impax Recruitment?
We are partnering with a startup that is building highly advanced physics foundation models for climate prediction and control.
Key Responsibilities:
- Build and Manage Large-Scale ML Infrastructure: Architect and maintain distributed systems to support training and inference of large machine learning models, ensuring optimal performance across all stages.
- Design Scalable Pipelines: Develop and implement end-to-end data processing pipelines capable of handling massive datasets, from ingestion and transformation to model training and deployment.
- Explore and Test New Training Techniques: Research cutting-edge training methods, including parallelization strategies and precision trade-offs, to improve the performance and scalability of model training.
- Optimize GPU Performance: Analyze and enhance low-level GPU operations to improve efficiency, reduce latency, and maximize hardware utilization in complex ML tasks.
- Stay Updated on Industry Trends: Continuously monitor advancements in ML research to incorporate new ideas and techniques into our systems.
What We're Looking For:
- Strong Problem-Solving and Fast Execution: You should thrive on tackling complex problems with speed and creativity, and adapt quickly to new technologies or challenges.
- Expertise in Optimizing ML Workloads: Proven experience in optimizing training and inference for large models, including leveraging advanced techniques like mixed-precision training and hardware optimization.
- Experience with Distributed Training Frameworks: Deep familiarity with distributed systems for training large models, such as FSDP or DeepSpeed.
- Cloud Platform Knowledge: Hands-on experience with major cloud services (e.g., GCP, AWS, or Azure) and their AI/ML offerings for deploying and scaling models.
- Containerization and Orchestration Skills: Proficient in tools like Docker and Kubernetes for deploying and managing containerized machine learning workloads in cloud environments.
- Distributed Systems and Scalable Serving Expertise: Experience in building scalable task management systems and deploying machine learning models in production environments.
- Monitoring and Observability Practices: Knowledge of best practices for monitoring, logging, and tracking performance in machine learning systems to ensure reliability and efficient version control.
This position requires working fully onsite in San Francisco with startup hours.