Demo

Site Reliability Engineer (SRE) / Production Engineer (PE) - Kubernetes & Cloud Infrastructure

Fireworks
Redwood, CA Full Time
POSTED ON 2/21/2025
AVAILABLE BEFORE 5/19/2025

About Us :

Here at Fireworks, we’re building the future of generative AI infrastructure. Fireworks offers the generative AI platform with the highest-quality models and the fastest, most scalable inference. We’ve been independently benchmarked to have the fastest LLM inference and have been getting great traction with innovative research projects, like our own function calling and multi-modal models. Fireworks is funded by top investors, like Benchmark and Sequoia, and we’re an ambitious, fun team composed primarily of veterans from Pytorch and Google Vertex AI.

The Role :

We’re seeking a highly skilled SRE / PE with deep expertise in Kubernetes (k8s), cloud networking, and infrastructure automation. This role will focus on reducing incident response time, implementing auto-remediation, optimizing auto-scaling, and improving cluster efficiency and service health. You’ll design systems that balance performance, cost, and reliability while working onsite with our Redwood City team.

Key Responsibilities :

Incident Response & Reliability Engineering :

Drive initiatives to reduce incident response time through improved monitoring, alerting, and automated remediation.

Build self-healing systems and playbooks for common failure scenarios.

Lead blameless post-mortems and implement preventative measures.

Kubernetes & GPU Cluster Optimization :

Manage and optimize GPU-enabled Kubernetes clusters for AI / ML workloads, focusing on cost-performance efficiency, auto-scaling, and resource utilization.

Debug performance bottlenecks in distributed systems (e.g., network, storage, GPU scheduling).

Cloud Networking & Service Health :

Strengthen service health by refining cloud networking stacks (VPCs, load balancers, service meshes) and ensuring low-latency communication.

Design fault-tolerant architectures to minimize downtime.

Monitoring & Observability :

Enhance service monitoring with tools like Prometheus, Grafana, and custom metrics pipelines.

Implement predictive analytics to proactively address system health risks.

Automation & Infrastructure-as-Code (IaC) :

Build automation for cluster provisioning, scaling, and recovery using Terraform, Argo, and CI / CD pipelines.

Develop tools to streamline operational workflows (e.g., automated rollbacks, canary deployments).

Minimum Qualifications :

3 years in SRE / PE / DevOps roles with production-grade Kubernetes experience.

Proficiency in cloud networking (AWS / GCP / Azure VPCs, firewalls, DNS) and service monitoring (Prometheus, Alertmanager, Grafana).

Hands-on experience with incident management and improving system reliability / SLOs.

Strong scripting / coding skills (Python / Go / Bash) for automation and tooling.

Familiarity with object storage (S3, GCS) and data pipeline integration.

Preferred Qualifications

Experience with GPU clusters (NVIDIA GPUs, MIG, CUDA) and AI / ML workloads.

Knowledge of auto-scaling technologies (K8s HPA / VPA) and auto-remediation frameworks.

Expertise in service meshes (Istio)

Why Fireworks AI?

Solve Hard Problems : Tackle challenges at the forefront of AI infrastructure, from low-latency inference to scalable model serving.

Build What’s Next : Work with bleeding-edge technology that impacts how businesses and developers harness AI globally.

Ownership & Impact : Join a fast-growing, passionate team where your work directly shapes the future of AI—no bureaucracy, just results.

Learn from the Best : Collaborate with world-class engineers and AI researchers who thrive on curiosity and innovation.

J-18808-Ljbffr

If your compensation planning software is too rigid to deploy winning incentive strategies, it’s time to find an adaptable solution. Compensation Planning
Enhance your organization's compensation strategy with salary data sets that HR and team managers can use to pay your staff right. Surveys & Data Sets

What is the career path for a Site Reliability Engineer (SRE) / Production Engineer (PE) - Kubernetes & Cloud Infrastructure?

Sign up to receive alerts about other jobs on the Site Reliability Engineer (SRE) / Production Engineer (PE) - Kubernetes & Cloud Infrastructure career path by checking the boxes next to the positions that interest you.
Income Estimation: 
$114,618 - $136,401
Income Estimation: 
$144,264 - $191,312
Income Estimation: 
$140,435 - $166,410
Income Estimation: 
$92,877 - $110,401
Income Estimation: 
$120,933 - $155,034
Income Estimation: 
$114,618 - $136,401
Income Estimation: 
$114,618 - $136,401
Income Estimation: 
$144,264 - $191,312
Income Estimation: 
$140,435 - $166,410
Income Estimation: 
$82,762 - $100,977
Income Estimation: 
$95,852 - $118,073
Income Estimation: 
$120,143 - $165,703
Income Estimation: 
$76,670 - $90,826
Income Estimation: 
$91,609 - $118,978
Income Estimation: 
$92,877 - $110,401
View Core, Job Family, and Industry Job Skills and Competency Data for more than 15,000 Job Titles Skills Library

Job openings at Fireworks

Fireworks
Hired Organization Address Keene, NH Full Time
Fireworks Restaurant, located in the heart of downtown Keene, offers a vibrant dining experience with a focus on excepti...
Fireworks
Hired Organization Address Keene, NH Full Time
We are a fast-paced, full-service restaurant that serves dinner 6 nights a week, located on Main St right in downtown Ke...

Not the job you're looking for? Here are some other Site Reliability Engineer (SRE) / Production Engineer (PE) - Kubernetes & Cloud Infrastructure jobs in the Redwood, CA area that may be a better fit.

Site Reliability Engineer - Kubernetes

Ajmera Infotech Inc., San Jose, CA

AI Assistant is available now!

Feel free to start your new journey!