Demo

Senior Distributed Training Research Engineer (AI Labs)

Krutrim
Palo Alto, CA Full Time
POSTED ON 2/21/2025
AVAILABLE BEFORE 5/19/2025

Senior Distributed Training Research Engineer (Frontier LLMs)

Location : Palo Alto (CA, US)

Type of Job : Full-time

About Krutrim :

Krutrim is building AI computing for the future. Our envisioned AI computing stack encompasses the AI computing infrastructure, AI Cloud, multilingual and multimodal foundational models, and AI-powered end applications. We are India’s first AI unicorn and built the first foundation model from the country.

Our AI stack is empowering consumers, startups, enterprises and scientists across India and the world to build their end AI applications or AI models. While we are building foundational models across text, voice, and vision relevant to our focus markets, we are also developing AI training and inference platforms that enable AI research and development across industry domains. The platforms being built by Krutrim have the potential to impact millions of lives in India, across income and education strata, and across languages.

The team at Krutrim represents a convergence of talent across AI research, Applied AI, Cloud Engineering, and semiconductor design. Our teams operate from three locations : Bangalore, Singapore & San Francisco.

Job Description :

We are seeking an experienced Senior Generative AI Model Research Engineer to efficiently train frontier and foundation multimodal large language models. In this critical role, you will be responsible for scalable training methodologies to develop a variety of generative AI models such as large language models, voice / speech foundation models, vision and multi-modal foundation models using cutting-edge techniques and frameworks. In this hands-on role, you will optimize and implement state of art neural architecture, robust training and inference infrastructure to efficiently take complex models with hundreds of billions and trillions of parameters to production while optimizing for low latency, high throughput, and cost efficiency.

Key Responsibilities :

  • Architect Distributed Training Systems : Design and implement highly scalable distributed training pipelines for LLMs and frontier models, leveraging model parallelism (tensor, pipeline, expert) and data parallelism techniques.
  • Optimize Performance : Utilize deep knowledge of CUDA, C , and low-level optimizations to enhance model training speed and efficiency across diverse hardware configurations.
  • Implement Novel Techniques : Research and apply cutting-edge parallelism techniques like Flash Attention to accelerate model training and reduce computational costs.
  • Framework Expertise : Demonstrate proficiency in deep learning frameworks such as PyTorch, TensorFlow, and JAX, and tailor them for distributed training scenarios.
  • Scale to Hundreds of Billions of Parameters : Work with massive models, ensuring stable and efficient training across distributed resources.
  • Evaluate Scaling Laws : Design and conduct experiments to analyze the impact of model size, data, and computational resources on model performance.
  • Collaborate : Partner closely with research scientists and engineers to integrate research findings into production-ready training systems.

Qualifications :

  • Advanced Degree : Ph.D. or Master's degree in Computer Science, Machine Learning, or a related field.
  • Proven Experience : 5 years of experience in distributed training of large-scale deep learning models, preferably LLMs or similar models.
  • Deep Learning Expertise : Strong theoretical and practical understanding of deep learning algorithms, architectures, and optimization techniques.
  • Parallelism Mastery : Extensive experience with various model and data parallelism techniques, including tensor parallelism, pipeline parallelism, and expert parallelism.
  • Framework Proficiency : Expert-level knowledge of PyTorch, TensorFlow, or JAX, with a demonstrated ability to extend and customize these frameworks.
  • Performance Optimization : Proven track record of optimizing deep learning models for speed and efficiency using CUDA, C , and other performance-enhancing tools.
  • Research Acumen : Familiarity with current research trends in large model training and the ability to apply new techniques to real-world problems.
  • Join Krutrim to shape the future of AI and make a significant impact on 100s of millions of lives across India and the world. If you're passionate about pushing the boundaries of AI and want to work with a team at the forefront of innovation, we want to hear from you!

    If your compensation planning software is too rigid to deploy winning incentive strategies, it’s time to find an adaptable solution. Compensation Planning
    Enhance your organization's compensation strategy with salary data sets that HR and team managers can use to pay your staff right. Surveys & Data Sets

    What is the career path for a Senior Distributed Training Research Engineer (AI Labs)?

    Sign up to receive alerts about other jobs on the Senior Distributed Training Research Engineer (AI Labs) career path by checking the boxes next to the positions that interest you.
    Income Estimation: 
    $56,489 - $71,327
    Income Estimation: 
    $70,310 - $88,223
    Income Estimation: 
    $66,679 - $90,237
    Income Estimation: 
    $70,310 - $88,223
    Income Estimation: 
    $88,950 - $110,401
    Income Estimation: 
    $84,958 - $111,603
    Income Estimation: 
    $88,950 - $110,401
    Income Estimation: 
    $109,186 - $139,009
    Income Estimation: 
    $115,336 - $159,446
    Income Estimation: 
    $109,186 - $139,009
    Income Estimation: 
    $117,059 - $151,769
    Income Estimation: 
    $115,336 - $159,446
    View Core, Job Family, and Industry Job Skills and Competency Data for more than 15,000 Job Titles Skills Library

    Job openings at Krutrim

    Krutrim
    Hired Organization Address Palo Alto, CA Full Time
    Job Title : AI Cloud Platform System Engineer Location : US-San Francisco Bay Area Position Type : Full-Time Job Summary...
    Krutrim
    Hired Organization Address Palo Alto, CA Full Time
    Location : Palo Alto (CA, US) Type of Job : Full-time About Krutrim : Krutrim is building AI computing for the future. O...
    Krutrim
    Hired Organization Address Palo Alto, CA Full Time
    Multimodal and Vision AI Research Engineer / Scientist Location : Palo Alto (US) Type of Job : Full-time About Krutrim :...

    Not the job you're looking for? Here are some other Senior Distributed Training Research Engineer (AI Labs) jobs in the Palo Alto, CA area that may be a better fit.

    Senior Distributed Systems Engineer

    PIKA Inc, Stanford, CA

    AI Assistant is available now!

    Feel free to start your new journey!