What are the responsibilities and job description for the Staff Software Engineer, ML Training position at Stack AV?

About the Role:

The ML Training Team’s core mandate is training models as fast as possible for the company. The team’s main focus is ensuring our models have 100% gpu utilization and can scale linearly from 8 gpus -> 256 gpus. We also invest in tooling to empower our MLEs, by building profiling/debugging tools, setting up efficiency monitoring and integrating our trainer into our experiment management system.

Responsibilities:

Setup efficiency monitoring for all our training jobs to identify models that need improvement
Work with customer teams to benchmark/profile their jobs and make improvements
Create standardized APIs for stack-wide abstractions like training datasets, bulk inference jobs, evaluation metrics
Optimize dataloaders / training data formats to ensure high gpu utilization
Optimize distributed training configurations (network topologies, sharding strategies, pipelines, etc).

Qualifications:

Experience: 5 years as a SWE, ideally building infrastructure/customer facing product, experience in AV or robotics is also great.
Ideal Skills:
- Experience with both ML Platforms and building ML-based applications (bonus point if you have modeling experience).
- Experience building scalable, reliable infra at a fast-paced environment.
- Experience building or using ML infra built for a large number of customer teams.
- A deep understanding of design tradeoffs and ability to articulate those tradeoffs and work with others on getting alignment.
- Experience with building ML models or ML infra in the domains of autonomous vehicles, perception, and decision making (desirable but not required).
- Experience with model training, model optimization, or large data processing pipelines.
- Machine Learning Expertise is preferred but not necessary.
- Knows how to push the GPU to its limit from Python to CUDA kernel level.
- Built the inference or training loop for a large model (ideally with LLM flavor).
- Shipped ML products (NLP, computer vision, recommender systems, etc.) at scale to make business impact.
- Knows how to build low latency / high throughput batch or stream processing pipelines.
- Knows how to write (readable) high performance C .
- Prior AV experience.
Desired Attributes:
- High customer empathy, able to communicate with customers well
- Comfortable reading papers / keeping up with SOTA ML literature

#LI-AW1

Apply for this job

Receive alerts for other Staff Software Engineer, ML Training job openings

Job openings at Stack AV

Commercial Counsel

Stack AV

Pittsburgh, PA Full Time

About Stack Stack is developing revolutionary AI and advanced autonomous systems designed to enhance safety, reliability...

Director of Engineering, ML Platform

Stack AV

Pittsburgh, PA Full Time

About the Role: The ML Platform team is responsible for the infrastructure to support the entire machine learning lifecy...

Not the job you're looking for? Here are some other Staff Software Engineer, ML Training jobs in the Pittsburgh, PA area that may be a better fit.

Staff Software Engineer, ML Training

What are the responsibilities and job description for the Staff Software Engineer, ML Training position at Stack AV?

What is the career path for a Staff Software Engineer, ML Training?

Job openings at Stack AV

Not the job you're looking for? Here are some other Staff Software Engineer, ML Training jobs in the Pittsburgh, PA area that may be a better fit.

We don't have any other Staff Software Engineer, ML Training jobs in the Pittsburgh, PA area right now.

AI Assistant is available now!