What are the responsibilities and job description for the Platform System Engineer position at Krutrim?

Job Title: AI Cloud Platform System Engineer

Location: US-San Francisco Bay Area

Position Type: Full-Time

Job Summary

We seek an AI Cloud Platform System Engineer to build, scale and optimize LLM training/inference/Data Platform. This role spans distributed training systems, GPU/CPU compute optimization, inference frameworks optimization and data platform for training/inferencing. You will ensure a resilient, cost-efficient platform for both training and production inference workloads, leveraging Kubernetes-native solutions.

Key Responsibilities

Distributed Training/Inference Platform Development

Design and maintain scalable platforms for distributed AI/ML training and serverless inference.
Optimize workload distribution across GPU clusters (e.g., model parallelism, mixed-precision training) for performance and cost.
Integrate frameworks like PyTorch, DeepSpeed, Triton, vLLM, and NVIDIA NeMo.
Collaborate with AI researchers to optimize model architectures for training/inference latency and throughput.

Platform & System Optimization

Compute: Profile and debug bottlenecks using tools like PyTorch Profiler and NVIDIA Nsight.
Storage/Caching: Build high-throughput data pipelines using S3, PVC, or distributed streaming (e.g., Kafka).
Networking: Reduce bottlenecks via RDMA/InfiniBand, NCCL, and TCP/IP tuning.
GPU Utilization: Implement kernel fusion, memory optimization, and auto-scaling.

Kubernetes-Centric Development

Develop Kubernetes Custom Resource Definitions (CRDs) to automate deployment, scaling, fault recovery, and monitoring of AI workloads.
Build operators for intelligent resource scheduling, Auto-Scaling (HPA/VPA), and fault tolerance for distributed training/inference jobs.
Build observability tools for GPU utilization, model latency, and system health
Leverage tools like Kubeflow, Kserve, KubeRay or Skylab for workflow orchestration.

Preferred Qualifications

Technical Skills

2 years of experience in ML infrastructure (LLM training/inference platforms preferred).
Proficiency in Kubernetes (CRDs, Operators, Helm, Knative, Kserve), PyTorch, and cloud-native systems (AWS/GCP/Azure).
Expertise in distributed training optimizations (e.g, Nemo, Pytorch, DeepSpeed) and inference frameworks (e.g. Triton, vLLM, Sglang).
LLM-specific optimizations (e.g., MoE architectures, speculative decoding).
Networking (InfiniBand, NCCL) and storage solutions (S3, Ceph/MinIO, PVC)

Education & Soft Skills

MS/PhD in Computer Science, AI/ML, or equivalent hands-on experience.
Strong collaboration skills to interface with research and engineering teams.
Problem-solving agility to balance performance, cost, and scalability.

Apply for this job

Receive alerts for other Platform System Engineer job openings

Job openings at Krutrim

Lead Research Engineer, Speech Foundation Models (AI Labs)

Krutrim

Palo Alto, CA Full Time

Location : Palo Alto (CA, US) Type of Job : Full-time About Krutrim : Krutrim is building AI computing for the future. O...

Senior Distributed Training Research Engineer (AI Labs)

Krutrim

Palo Alto, CA Full Time

Senior Distributed Training Research Engineer (Frontier LLMs) Location : Palo Alto (CA, US) Type of Job : Full-time Abou...

Multimodal Research Engineer (AI Labs)

Krutrim

Palo Alto, CA Full Time

Multimodal and Vision AI Research Engineer / Scientist Location : Palo Alto (US) Type of Job : Full-time About Krutrim :...

Not the job you're looking for? Here are some other Platform System Engineer jobs in the Palo Alto, CA area that may be a better fit.

Platform System Engineer

What are the responsibilities and job description for the Platform System Engineer position at Krutrim?

What is the career path for a Platform System Engineer?

Job openings at Krutrim

Not the job you're looking for? Here are some other Platform System Engineer jobs in the Palo Alto, CA area that may be a better fit.

We don't have any other Platform System Engineer jobs in the Palo Alto, CA area right now.

AI Assistant is available now!