Demo

Platform System Engineer

Krutrim
Palo Alto, CA Full Time
POSTED ON 2/28/2025
AVAILABLE BEFORE 3/26/2025

Job Title: AI Cloud Platform System Engineer

Location: US-San Francisco Bay Area

Position Type: Full-Time


Job Summary

We seek an AI Cloud Platform System Engineer to build, scale and optimize LLM training/inference/Data Platform. This role spans distributed training systems, GPU/CPU compute optimization, inference frameworks optimization and data platform for training/inferencing. You will ensure a resilient, cost-efficient platform for both training and production inference workloads, leveraging Kubernetes-native solutions.


Key Responsibilities

Distributed Training/Inference Platform Development

  • Design and maintain scalable platforms for distributed AI/ML training and serverless inference.
  • Optimize workload distribution across GPU clusters (e.g., model parallelism, mixed-precision training) for performance and cost.
  • Integrate frameworks like PyTorch, DeepSpeed, Triton, vLLM, and NVIDIA NeMo.
  • Collaborate with AI researchers to optimize model architectures for training/inference latency and throughput.


Platform & System Optimization

  • Compute: Profile and debug bottlenecks using tools like PyTorch Profiler and NVIDIA Nsight.
  • Storage/Caching: Build high-throughput data pipelines using S3, PVC, or distributed streaming (e.g., Kafka).
  • Networking: Reduce bottlenecks via RDMA/InfiniBand, NCCL, and TCP/IP tuning.
  • GPU Utilization: Implement kernel fusion, memory optimization, and auto-scaling.


Kubernetes-Centric Development

  • Develop Kubernetes Custom Resource Definitions (CRDs) to automate deployment, scaling, fault recovery, and monitoring of AI workloads.
  • Build operators for intelligent resource scheduling, Auto-Scaling (HPA/VPA), and fault tolerance for distributed training/inference jobs.
  • Build observability tools for GPU utilization, model latency, and system health
  • Leverage tools like Kubeflow, Kserve, KubeRay or Skylab for workflow orchestration.


Preferred Qualifications

Technical Skills

  • 2 years of experience in ML infrastructure (LLM training/inference platforms preferred).
  • Proficiency in Kubernetes (CRDs, Operators, Helm, Knative, Kserve), PyTorch, and cloud-native systems (AWS/GCP/Azure).
  • Expertise in distributed training optimizations (e.g, Nemo, Pytorch, DeepSpeed) and inference frameworks (e.g. Triton, vLLM, Sglang).
  • LLM-specific optimizations (e.g., MoE architectures, speculative decoding).
  • Networking (InfiniBand, NCCL) and storage solutions (S3, Ceph/MinIO, PVC)
  • Education & Soft Skills

    • MS/PhD in Computer Science, AI/ML, or equivalent hands-on experience.
    • Strong collaboration skills to interface with research and engineering teams.
    • Problem-solving agility to balance performance, cost, and scalability.

    If your compensation planning software is too rigid to deploy winning incentive strategies, it’s time to find an adaptable solution. Compensation Planning
    Enhance your organization's compensation strategy with salary data sets that HR and team managers can use to pay your staff right. Surveys & Data Sets

    What is the career path for a Platform System Engineer?

    Sign up to receive alerts about other jobs on the Platform System Engineer career path by checking the boxes next to the positions that interest you.
    Income Estimation: 
    $77,657 - $95,021
    Income Estimation: 
    $97,257 - $120,701
    Income Estimation: 
    $87,093 - $107,335
    Income Estimation: 
    $111,725 - $147,313
    Income Estimation: 
    $112,673 - $137,290
    Income Estimation: 
    $140,233 - $181,029
    Income Estimation: 
    $161,209 - $233,553
    Income Estimation: 
    $86,680 - $110,316
    Income Estimation: 
    $110,730 - $135,754
    Income Estimation: 
    $117,033 - $148,289
    Income Estimation: 
    $110,730 - $135,754
    Income Estimation: 
    $128,617 - $162,576
    Income Estimation: 
    $117,033 - $148,289
    Income Estimation: 
    $112,673 - $137,290
    Income Estimation: 
    $139,945 - $168,577
    Income Estimation: 
    $140,233 - $181,029
    Income Estimation: 
    $161,209 - $233,553
    View Core, Job Family, and Industry Job Skills and Competency Data for more than 15,000 Job Titles Skills Library

    Job openings at Krutrim

    Krutrim
    Hired Organization Address Palo Alto, CA Full Time
    Location : Palo Alto (CA, US) Type of Job : Full-time About Krutrim : Krutrim is building AI computing for the future. O...
    Krutrim
    Hired Organization Address Palo Alto, CA Full Time
    Senior Distributed Training Research Engineer (Frontier LLMs) Location : Palo Alto (CA, US) Type of Job : Full-time Abou...
    Krutrim
    Hired Organization Address Palo Alto, CA Full Time
    Multimodal and Vision AI Research Engineer / Scientist Location : Palo Alto (US) Type of Job : Full-time About Krutrim :...

    Not the job you're looking for? Here are some other Platform System Engineer jobs in the Palo Alto, CA area that may be a better fit.

    AI Assistant is available now!

    Feel free to start your new journey!