What are the responsibilities and job description for the Large Scale ML Infrastructure Architect position at Impax Recruitment?
About the Position
This is an exciting opportunity to join our team at Impax Recruitment as a Large Scale ML Infrastructure Architect. You will be responsible for building and managing large-scale ML infrastructure, designing scalable pipelines, and exploring new training techniques.
Key Responsibilities:
- Architect and maintain distributed systems for training and inference of large machine learning models, ensuring optimal performance across all stages.
- Develop and implement end-to-end data processing pipelines capable of handling massive datasets, from ingestion and transformation to model training and deployment.
- Research and implement cutting-edge training methods, including parallelization strategies and precision trade-offs, to improve the performance and scalability of model training.
- Analyze and enhance low-level GPU operations to improve efficiency, reduce latency, and maximize hardware utilization in complex ML tasks.
- Stay updated on industry trends and advancements in ML research to incorporate new ideas and techniques into our systems.
What We're Looking For:
- A strong problem-solver with fast execution and adaptability in tackling complex problems with speed and creativity.
- Expertise in optimizing ML workloads, including leveraging advanced techniques like mixed-precision training and hardware optimization.
- Experience with distributed training frameworks, such as FSDP or DeepSpeed, and cloud platforms, including GCP, AWS, or Azure.
- Hands-on experience with containerization and orchestration tools like Docker and Kubernetes.
- Distributed systems and scalable serving expertise, including building task management systems and deploying ML models in production environments.
- Knowledge of monitoring and observability practices, including logging and tracking performance in ML systems.
Other Benefits:
- Fully onsite position in SF with startup hours.