Demo

Staff Machine Learning Infrastructure Engineer

Dyna Robotics
Redwood, CA Full Time
POSTED ON 3/3/2025
AVAILABLE BEFORE 5/28/2025

Company Overview :

Dyna Robotics is at the forefront of revolutionizing robotic manipulation with cutting-edge foundation models. Our mission is to empower businesses by automating repetitive, stationary tasks with affordable, intelligent robotic arms. Leveraging the latest advancements in foundation models, we're driving the future of general-purpose robotics-one manipulation skill at a time.

Dyna Robotics was founded by industry leaders who previously achieved a $350 million exit in grocery deep tech as well as top robotics researchers from DeepMind and Nvidia. Our team blends world-class research, engineering, and product innovation to drive the future of robotic manipulation. With $20mil in funding, we're positioned to redefine the landscape of robotic automation. Join us to shape the next frontier of AI-driven robotics.

Position Overview :

We are seeking an experienced Machine Learning Infrastructure Engineer to join our team and help scale our ML training platform. In this role, you will be responsible for designing, implementing, and maintaining large-scale ML infrastructure to accelerate model iteration and improve training performance across an expanding GPU ecosystem. You will work on cutting-edge high-performance computing systems, optimizing distributed training environments, and ensuring system reliability as we scale.

Key Responsibilities :

  • Infrastructure Design & Scalability :

Architect and implement large-scale ML training pipelines that leverage parallel GPU processing on platforms like GCP or AWS.

  • Enhance our existing infrastructure to fully exploit parallelism and design for future expansion, ensuring that our system is ready to support growth.
  • High-Performance ML Computing & Distributed Systems :
  • Manage and optimize high-performance computing resources.

  • Develop robust distributed computing solutions, addressing challenges like race conditions, memory optimization, and resource allocation.
  • Optimize model training with techniques like mixed precision, ZeRO, Lora, etc.
  • Job Scheduling & Reliability :
  • Design systems for job rescheduling, automated retries, and failure recovery to maximize uptime and training efficiency.

  • Implement intelligent job queuing mechanisms to optimize training workloads and resource utilization.
  • Storage & Data Handling :
  • Evaluate and implement tradeoffs between different local and networked storage solutions to improve data throughput and access.

  • Develop strategies for caching training data to optimize performance.
  • Collaboration & Continuous Improvement :
  • Work closely with ML researchers and data scientists to understand training requirements and bottlenecks.

  • Continuously monitor system performance, identify areas for improvement, and implement best practices to enhance scalability and reliability.
  • Required Qualifications :

  • Bachelor's degree or higher in Computer Science or a related field.
  • At least 7 years of professional experience in the software industry, with a minimum of 2 years in a tech lead role.
  • Proven experience with high-performance computing environments and distributed systems.
  • Demonstrated ability to scale ML training systems and optimize resource utilization.
  • Hands-on experience with job scheduling systems and managing cloud GPU environments (GCP, AWS, etc.).
  • Deep understanding of distributed computing concepts, including race conditions, memory optimization, and parallel processing.
  • Hands-on experience in ML model tuning for performance.
  • Experience with common ML training and inference tools including PyTorch, TensorRT, Triton, Accelerate, etc.
  • Strong analytical and problem-solving skills with the ability to troubleshoot complex system issues.
  • Excellent communication skills to collaborate effectively with cross-functional teams.
  • Preferred Qualifications :

  • Experience with container orchestration tools (e.g., Kubernetes) and infrastructure-as-code frameworks.
  • Benefits

  • Competitive salary and equity in a seed-stage venture-backed startup
  • Comprehensive health, dental, and vision insurance
  • Flexible work arrangements
  • Professional growth and development through training, mentorship, and challenging projects
  • Daily catered lunches and dinner with a fully stocked kitchen
  • If you're passionate about building scalable ML systems and optimizing high-performance computing infrastructures, we'd love to hear from you.

    Salary : $20

    If your compensation planning software is too rigid to deploy winning incentive strategies, it’s time to find an adaptable solution. Compensation Planning
    Enhance your organization's compensation strategy with salary data sets that HR and team managers can use to pay your staff right. Surveys & Data Sets

    What is the career path for a Staff Machine Learning Infrastructure Engineer?

    Sign up to receive alerts about other jobs on the Staff Machine Learning Infrastructure Engineer career path by checking the boxes next to the positions that interest you.
    Income Estimation: 
    $258,641 - $455,625
    Income Estimation: 
    $884,710 - $2,266,655
    Income Estimation: 
    $70,310 - $88,223
    Income Estimation: 
    $88,950 - $110,401
    Income Estimation: 
    $84,958 - $111,603
    Income Estimation: 
    $80,853 - $105,041
    Income Estimation: 
    $113,640 - $142,321
    Income Estimation: 
    $101,952 - $131,428
    Income Estimation: 
    $114,502 - $144,630
    Income Estimation: 
    $101,952 - $131,428
    Income Estimation: 
    $161,645 - $210,079
    Income Estimation: 
    $125,425 - $164,196
    Income Estimation: 
    $130,162 - $165,530
    Income Estimation: 
    $88,950 - $110,401
    Income Estimation: 
    $109,186 - $139,009
    Income Estimation: 
    $115,336 - $159,446
    View Core, Job Family, and Industry Job Skills and Competency Data for more than 15,000 Job Titles Skills Library

    Job openings at Dyna Robotics

    Dyna Robotics
    Hired Organization Address Redwood, CA Full Time
    Company Overview : Dyna Robotics is at the forefront of revolutionizing robotic manipulation with cutting-edge foundatio...
    Dyna Robotics
    Hired Organization Address Redwood, CA Full Time
    Company Overview : Dyna Robotics is at the forefront of revolutionizing robotic manipulation with cutting-edge foundatio...
    Dyna Robotics
    Hired Organization Address Redwood, CA Full Time
    Company Overview : Dyna Robotics is at the forefront of revolutionizing robotic manipulation with cutting-edge foundatio...

    Not the job you're looking for? Here are some other Staff Machine Learning Infrastructure Engineer jobs in the Redwood, CA area that may be a better fit.

    Staff Machine Learning Engineer

    1st. Creative Learning Academy Inc., Palo Alto, CA

    AI Assistant is available now!

    Feel free to start your new journey!