Demo

Senior Machine Learning Engineer - Hardware Abstractions & Performance Optimization

Luma AI
Palo Alto, CA Full Time
POSTED ON 4/17/2025
AVAILABLE BEFORE 5/15/2025
Luma’s mission is to build multimodal AI to expand human imagination and capabilities. We believe that multimodality is critical for intelligence. To go beyond language models and build more aware, capable and useful systems, the next step function change will come from vision. So, we are working on training and scaling up multimodal foundation models for systems that can see and understand, show and explain, and eventually interact with our world to effect change.

We are looking for engineers with significant experience maintaining & designing highly efficient systems and code that can be optimized to run on multiple hardware platforms, bringing our state-of-the-art models to as many people at the best performance per dollar.

Responsibilities

  • Ensure efficient implementation of models & systems with a focus on designing, maintaining, and writing abstractions that scale beyond NVIDIA/CUDA hardware.
  • Identify and remedy efficiency bottlenecks (memory, speed, utilization, communication) by profiling and implementing high-performance PyTorch code, deferring to Triton or similar kernel-level languages as necessary.
  • Benchmarking our products across a variety of hardware & software to help the product team understand the optimal tradeoffs between latency, throughput and cost at various degrees of parallelism.
  • Work together with our partners to help them identify bottlenecks and push forward new iterations of hardware and software.
  • Work closely together with the rest of the research team to ensure systems are planned to be as efficient as possible from start to finish and raise potential issues for hardware integration.

Must have experience

  • Experience optimizing for memory, latency and throughput in Pytorch.
    • Bonus: experience with non-NVIDIA systems
  • Experience using torch.compile / torch.XLA.
  • Experience benchmarking and profiling GPU & CPU code in Pytorch for optimal device utilization (examples: torch profiler, memory profilers, trace viewers, custom tooling).
  • Experience building tools & abstractions to ensure models run optimally on different hardware and software stacks .
  • Experience working with transformer models and attention implementations.
  • Experience with parallel inference, particularly with tensor parallelism, pipeline parallelism.
Good to have experience

  • Experience with high-performance Triton/CUDA and writing custom PyTorch kernels and ops. Top candidates will be able to write fused kernels for common hot paths, understand when to make use of lower level features like tensor cores or warp intrinsics, and will understand where these tools can be most impactful.
  • Experience writing high-performance parallel C . Bonus if done within an ML context with PyTorch, like for data loading, data processing, inference code
  • Experience building inference / demo prototype code (incl. Gradio, Docker etc.)

Compensation Range: $220K - $300K

Salary : $220,000 - $300,000

If your compensation planning software is too rigid to deploy winning incentive strategies, it’s time to find an adaptable solution. Compensation Planning
Enhance your organization's compensation strategy with salary data sets that HR and team managers can use to pay your staff right. Surveys & Data Sets

What is the career path for a Senior Machine Learning Engineer - Hardware Abstractions & Performance Optimization?

Sign up to receive alerts about other jobs on the Senior Machine Learning Engineer - Hardware Abstractions & Performance Optimization career path by checking the boxes next to the positions that interest you.
Income Estimation: 
$119,030 - $151,900
Income Estimation: 
$149,493 - $192,976
Income Estimation: 
$119,030 - $151,900
Income Estimation: 
$149,493 - $192,976
Income Estimation: 
$149,493 - $192,976
Income Estimation: 
$184,796 - $233,226
Income Estimation: 
$77,900 - $95,589
Income Estimation: 
$101,387 - $124,118
Income Estimation: 
$101,387 - $124,118
Income Estimation: 
$119,030 - $151,900
View Core, Job Family, and Industry Job Skills and Competency Data for more than 15,000 Job Titles Skills Library

Job openings at Luma AI

Luma AI
Hired Organization Address Palo Alto, CA Full Time
Luma’s mission is to build multimodal AI to expand human imagination and capabilities. We believe that multimodality is ...
Luma AI
Hired Organization Address Stanford, CA Full Time
We are looking for people with strong ML & Distributed systems backgrounds. This role will work within our Research team...
Luma AI
Hired Organization Address Stanford, CA Full Time
Luma is looking for a Technical Artist to join our Applied team. Luma's Applied team takes our underlying foundation mod...
Luma AI
Hired Organization Address Stanford, CA Full Time
We are looking for people with strong Backend Data Engineering capabilities to build highly efficient, resilient systems...

Not the job you're looking for? Here are some other Senior Machine Learning Engineer - Hardware Abstractions & Performance Optimization jobs in the Palo Alto, CA area that may be a better fit.

Senior Machine Learning Performance Engineer

xpengmotors, Santa Clara, CA

Machine Learning Hardware Engineer

Mihira AI, Campbell, CA

AI Assistant is available now!

Feel free to start your new journey!