What are the responsibilities and job description for the Platform System Engineer position at Krutrim?
Job Title: AI Cloud Platform System Engineer
Location: US-San Francisco Bay Area
Position Type: Full-Time
Job Summary
We seek an AI Cloud Platform System Engineer to build, scale and optimize LLM training/inference/Data Platform. This role spans distributed training systems, GPU/CPU compute optimization, inference frameworks optimization and data platform for training/inferencing. You will ensure a resilient, cost-efficient platform for both training and production inference workloads, leveraging Kubernetes-native solutions.
Key Responsibilities
Distributed Training/Inference Platform Development
- Design and maintain scalable platforms for distributed AI/ML training and serverless inference.
- Optimize workload distribution across GPU clusters (e.g., model parallelism, mixed-precision training) for performance and cost.
- Integrate frameworks like PyTorch, DeepSpeed, Triton, vLLM, and NVIDIA NeMo.
- Collaborate with AI researchers to optimize model architectures for training/inference latency and throughput.
Platform & System Optimization
- Compute: Profile and debug bottlenecks using tools like PyTorch Profiler and NVIDIA Nsight.
- Storage/Caching: Build high-throughput data pipelines using S3, PVC, or distributed streaming (e.g., Kafka).
- Networking: Reduce bottlenecks via RDMA/InfiniBand, NCCL, and TCP/IP tuning.
- GPU Utilization: Implement kernel fusion, memory optimization, and auto-scaling.
Kubernetes-Centric Development
- Develop Kubernetes Custom Resource Definitions (CRDs) to automate deployment, scaling, fault recovery, and monitoring of AI workloads.
- Build operators for intelligent resource scheduling, Auto-Scaling (HPA/VPA), and fault tolerance for distributed training/inference jobs.
- Build observability tools for GPU utilization, model latency, and system health
- Leverage tools like Kubeflow, Kserve, KubeRay or Skylab for workflow orchestration.
Preferred Qualifications
Technical Skills
- 2 years of experience in ML infrastructure (LLM training/inference platforms preferred).
- Proficiency in Kubernetes (CRDs, Operators, Helm, Knative, Kserve), PyTorch, and cloud-native systems (AWS/GCP/Azure).
- Expertise in distributed training optimizations (e.g, Nemo, Pytorch, DeepSpeed) and inference frameworks (e.g. Triton, vLLM, Sglang).
- LLM-specific optimizations (e.g., MoE architectures, speculative decoding).
- Networking (InfiniBand, NCCL) and storage solutions (S3, Ceph/MinIO, PVC)
Education & Soft Skills
- MS/PhD in Computer Science, AI/ML, or equivalent hands-on experience.
- Strong collaboration skills to interface with research and engineering teams.
- Problem-solving agility to balance performance, cost, and scalability.