What are the responsibilities and job description for the Staff Software Engineer, ML Training position at Stack AV?
About the Role:
The ML Training Team’s core mandate is training models as fast as possible for the company. The team’s main focus is ensuring our models have 100% gpu utilization and can scale linearly from 8 gpus -> 256 gpus. We also invest in tooling to empower our MLEs, by building profiling/debugging tools, setting up efficiency monitoring and integrating our trainer into our experiment management system.
Responsibilities:
- Setup efficiency monitoring for all our training jobs to identify models that need improvement
- Work with customer teams to benchmark/profile their jobs and make improvements
- Create standardized APIs for stack-wide abstractions like training datasets, bulk inference jobs, evaluation metrics
- Optimize dataloaders / training data formats to ensure high gpu utilization
- Optimize distributed training configurations (network topologies, sharding strategies, pipelines, etc).
Qualifications:
- Experience: 5 years as a SWE, ideally building infrastructure/customer facing product, experience in AV or robotics is also great.
- Ideal Skills:
- Experience with both ML Platforms and building ML-based applications (bonus point if you have modeling experience).
- Experience building scalable, reliable infra at a fast-paced environment.
- Experience building or using ML infra built for a large number of customer teams.
- A deep understanding of design tradeoffs and ability to articulate those tradeoffs and work with others on getting alignment.
- Experience with building ML models or ML infra in the domains of autonomous vehicles, perception, and decision making (desirable but not required).
- Experience with model training, model optimization, or large data processing pipelines.
- Machine Learning Expertise is preferred but not necessary.
- Knows how to push the GPU to its limit from Python to CUDA kernel level.
- Built the inference or training loop for a large model (ideally with LLM flavor).
- Shipped ML products (NLP, computer vision, recommender systems, etc.) at scale to make business impact.
- Knows how to build low latency / high throughput batch or stream processing pipelines.
- Knows how to write (readable) high performance C .
- Prior AV experience.
- Desired Attributes:
- High customer empathy, able to communicate with customers well
- Comfortable reading papers / keeping up with SOTA ML literature
#LI-AW1