Demo

Multimodal Research Engineer (AI Labs)

Krutrim
Palo Alto, CA Full Time
POSTED ON 2/21/2025
AVAILABLE BEFORE 5/19/2025

Multimodal and Vision AI Research Engineer / Scientist

Location : Palo Alto (US)

Type of Job : Full-time

About Krutrim :

Krutrim is building AI computing for the future. Our envisioned AI computing stack encompasses AI infrastructure, AI Cloud, multilingual and multimodal foundational models, and AI-powered applications. As India’s first AI unicorn, we built the country’s first foundation models in LLM and VLM domains, empowering consumers, startups, enterprises, and researchers to develop AI applications. We focus on foundational models across text, voice, and vision while developing AI training and inference platforms to drive innovation. Our teams, spanning Bangalore, Singapore, and San Francisco, bring expertise across AI research, applied AI, cloud engineering, and semiconductor design.

Job Description : We are seeking experienced Multimodal and Vision AI Engineers / Scientists to research, develop, optimize, and deploy Vision-Language Models (VLMs) , multimodal generative models , diffusion models , and traditional computer vision techniques . You will work on foundational models integrating vision, language, and audio, optimize AI architectures, and push the boundaries of multimodal AI research.

Responsibilities :

  • Research, design, and train multimodal vision-language models (VLMs) , integrating deep learning , transformers , and attention mechanisms .
  • Develop and optimize small-scale distillation of VLMs for efficient deployment on resource-constrained devices.
  • Implement state-of-the-art object detection (YOLO, Faster R-CNN) , segmentation (Panoptic Segmentation) , classification (ResNets, Vision Transformers) , and image generation (Stable Diffusion, Stable Cascade) .
  • Train or fine-tune vision models for representation (e.g., Vision Transformers, Q-Former, CLIP, SigLIP) , generation , and video representation (e.g., Video-Swin Transformer) .
  • Work with diffusion models and generative models for conditional image generation and multimodal applications .
  • Optimize CNN-based architectures for computer vision tasks like recognition , tracking , and feature extraction .
  • Implement and optimize audio models for representation (e.g., W2V-BERT) and generation (e.g., Hi-Fi GAN, SeamlessM4T) .
  • Innovate with multimodal fusion techniques such as early fusion , deep fusion , Mixture-of-Experts (MoE) , FlashAttention , MQA , GQA , MLA , and other transformer architectures .
  • Advance video analysis , video summarization , and video question-answering models to enhance multimedia understanding .
  • Implement optimization techniques like quantization , distillation , sparsity , streaming , and caching for scalable model deployment .
  • Integrate and tailor deep learning frameworks like PyTorch , TensorFlow , DeepSpeed , Lightning , Habana , and FSDP .
  • Deploy large-scale distributed AI models using MLOps frameworks such as AirFlow , MosaicML , Anyscale , Kubeflow , and Terraform .
  • Publish research in top-tier conferences (NeurIPS, CVPR, ICCV, ICLR, ICML) and contribute to open-source AI projects .
  • Collaborate with engineering teams to productionize research advancements into scalable services and products .

Qualifications :

  • Ph.D. or Master’s degree with 2 years of experience in Vision-Language Models (VLMs) , multimodal AI , diffusion models , CNNs , ResNets , computer vision , and generative models .
  • Demonstrated expertise in high-performance computing , proficiency in Python , C / C , CUDA , and kernel-level programming for AI applications .
  • Experience in optimizing training and inference of large-scale AI models , with knowledge of quantization , distillation , and LLMOps .
  • Hands-on experience with object detection (YOLO, Faster R-CNN) , image segmentation (Panoptic Segmentation) , and video understanding (Swin Transformer, Timesformer) .
  • Experience in generative models , including diffusion models (Stable Diffusion, Stable Cascade) , and conditional image generation .
  • Familiarity with audio models for representation and generation is a plus.
  • Research contributions in multimodal AI , vision-language integration , NLP , or generative modeling , demonstrated through publications and products .
  • Proficiency in AI toolkits like PyTorch, TensorFlow, OpenCV , and familiarity with MLOps frameworks .
  • Strong programming skills and practical experience with distributed AI model deployment .
  • Excellent communication and collaboration skills to work across interdisciplinary teams.
  • If your compensation planning software is too rigid to deploy winning incentive strategies, it’s time to find an adaptable solution. Compensation Planning
    Enhance your organization's compensation strategy with salary data sets that HR and team managers can use to pay your staff right. Surveys & Data Sets

    What is the career path for a Multimodal Research Engineer (AI Labs)?

    Sign up to receive alerts about other jobs on the Multimodal Research Engineer (AI Labs) career path by checking the boxes next to the positions that interest you.
    Income Estimation: 
    $113,077 - $147,784
    Income Estimation: 
    $135,356 - $164,911
    Income Estimation: 
    $153,902 - $198,246
    Income Estimation: 
    $113,077 - $147,784
    Income Estimation: 
    $135,356 - $164,911
    Income Estimation: 
    $153,902 - $198,246
    Income Estimation: 
    $98,763 - $126,233
    Income Estimation: 
    $116,330 - $143,011
    Income Estimation: 
    $113,077 - $147,784
    Income Estimation: 
    $135,356 - $164,911
    Income Estimation: 
    $153,053 - $187,211
    Income Estimation: 
    $153,902 - $198,246
    Income Estimation: 
    $74,029 - $94,382
    Income Estimation: 
    $91,459 - $117,736
    View Core, Job Family, and Industry Job Skills and Competency Data for more than 15,000 Job Titles Skills Library

    Job openings at Krutrim

    Krutrim
    Hired Organization Address Palo Alto, CA Full Time
    Job Title : AI Cloud Platform System Engineer Location : US-San Francisco Bay Area Position Type : Full-Time Job Summary...
    Krutrim
    Hired Organization Address Palo Alto, CA Full Time
    Location : Palo Alto (CA, US) Type of Job : Full-time About Krutrim : Krutrim is building AI computing for the future. O...
    Krutrim
    Hired Organization Address Palo Alto, CA Full Time
    Senior Distributed Training Research Engineer (Frontier LLMs) Location : Palo Alto (CA, US) Type of Job : Full-time Abou...

    Not the job you're looking for? Here are some other Multimodal Research Engineer (AI Labs) jobs in the Palo Alto, CA area that may be a better fit.

    AI Assistant is available now!

    Feel free to start your new journey!