What are the responsibilities and job description for the AI engineer - Distributed Systems position at Luma AI?
Responsibilities
- Work closely with the rest of the research team on experiment tracking and tooling to ensure large-scale training runs can be logged and analyzed with low overhead
- Automate & evolve the handling of python environments using tools such as docker and uv, as well as handling compilation of custom packages to ensure experiments and training runs can be reproduced
- Set up & maintain CI/CD pipelines to automatically test large codebases, optimizing for test coverage
- Work on documentation & type correctness
- Define test suites to automatically test cluster stability & performance for distributed ML workloads
- Debug and resolve systems issues, ensuring that they are triaged & handled in a timely manner
- Excellent software engineering skills, particularly with experience in maintaining & working on typed & tested Pytorch code bases
- Experience with PyTorch
- Experience with Slurm
- Experience with Github CI/CD
- Experience with Docker
Salary : $220,000 - $300,000