Demo

Senior Cloud Operations Engineer - PyTorch

Linux Foundation
San Francisco, CA Full Time
POSTED ON 2/16/2025
AVAILABLE BEFORE 4/15/2025

Company Description

The Linux Foundation is a driving force in fostering open-source collaboration and supporting communities across a range of projects, including PyTorch. We're dedicated to enhancing and expanding our infrastructure to meet the growing demands of PyTorch and related AI projects. We are seeking a Senior Cloud Operations Engineer who will focus on the infrastructure operations of the PyTorch project, automating processes, optimizing cloud-native tools, and ensuring a robust and scalable cloud environment.

Job Description

The Senior Cloud Operations Engineer will play a crucial role in managing and optimizing our multi-cloud infrastructure and DevOps practices. This position is essential for maintaining and scaling our cloud operations across multiple cloud provider platforms and accelerator technologies. The ideal candidate will combine deep expertise in cloud technologies, hardware accelerators, and DevOps methodologies to ensure our infrastructure remains robust, efficient, and future-proof.

Responsibilities:

Cloud Infrastructure Management

  • Design and manage multi-cloud environments across AWS, GCP, and Azure
  • Optimize instance selection and utilization across various compute types including AMD and Intel CPU-based instances
  • Configure and manage GPU-accelerated instances (AMD and NVIDIA) and specialized accelerators (TPUs, NPUs)
  • Implement and maintain infrastructure-as-code using Terraform and other IaC tools
  • Optimize cloud resource utilization and implement FinOps practices for cost management
  • Design and implement high-availability solutions across multiple cloud providers

CI/CD and DevOps

  • Design, implement, and maintain CI/CD pipelines using GitHub Actions
  • Configure and manage both github-hosted and self-hosted runners
  • Implement and maintain non-blocking and out-of-tree CI jobs
  • Design and implement matrix testing strategies across different hardware configurations
  • Develop and maintain automated testing frameworks for various testing types (unit, integration, performance)
  • Implement best practices for version control management and branching strategies
  • Experience with agile methodologies and scrum practices

Performance Optimization and Testing

  • Develop and implement performance testing frameworks for various hardware accelerators
  • Optimize workload distribution across different types of compute instances
  • Implement automated performance regression testing
  • Design and maintain benchmarking systems for various hardware configurations

Infrastructure Security and Monitoring

  • Implement security best practices across multi-cloud environments
  • Develop comprehensive monitoring solutions using cloud-native tools
  • Participate in on-call rotations supporting operations and incident response
  • Establish and maintain escalation procedures and resolution processes
  • Manage access control and security policies across cloud platforms

Qualifications

Required:

  • Bachelor's degree in Computer Science, Engineering, or related field
  • 7 years of experience in cloud operations with extensive multi-cloud expertise (AWS, GCP, Azure)
  • Demonstrated experience with GPU computing (AMD and NVIDIA) and specialized accelerators (TPUs, NPUs)
  • Strong knowledge of CPU architectures and instance type optimization (AMD, Intel)
  • Advanced experience with GitHub Actions, including custom runner configuration and management
  • Expertise in implementing non-blocking and out-of-tree CI jobs
  • Strong background in version control systems and branching strategies
  • Experience with agile methodologies and scrum practices
  • Proficiency in infrastructure-as-code tools, particularly Terraform
  • Strong scripting abilities (Python, Bash, PowerShell, Typescript)
  • Experience with containerization and orchestration (Docker, Kubernetes)
  • Demonstrated experience in implementing automated testing frameworks

Preferred:

  • Experience optimizing workloads across different hardware accelerators
  • Background in performance testing and optimization
  • Contributions to open-source projects
  • Experience mentoring other engineers
  • Background in machine learning infrastructure
  • Experience with Datadog is a plus

Benefits:

  • Competitive salary
  • Comprehensive health, dental, and vision insurance
  • Flexible PTO policy
  • Remote work environment
  • Professional development opportunities
  • 401(k) matching
  • Home office stipend

Additional Information

Open to US-based employees only. Preference for West Coast candidates.

Salary $125,000 - $165,000 USD

About Us:

We maintain a predominantly remote workforce and are committed to hiring top-notch talent. We are passionate about providing a flexible and supportive work culture. Our team values collaboration, innovation, and continuous learning. We embrace diversity and believe in creating an inclusive environment where all team members can thrive.

The Linux Foundation is an Equal Opportunity Employer.

Salary : $125,000 - $165,000

If your compensation planning software is too rigid to deploy winning incentive strategies, it’s time to find an adaptable solution. Compensation Planning
Enhance your organization's compensation strategy with salary data sets that HR and team managers can use to pay your staff right. Surveys & Data Sets

What is the career path for a Senior Cloud Operations Engineer - PyTorch?

Sign up to receive alerts about other jobs on the Senior Cloud Operations Engineer - PyTorch career path by checking the boxes next to the positions that interest you.
Income Estimation: 
$120,777 - $151,022
Income Estimation: 
$145,845 - $177,256
Income Estimation: 
$147,836 - $182,130
Income Estimation: 
$154,597 - $194,610
Income Estimation: 
$86,891 - $130,303
Income Estimation: 
$92,369 - $122,605
Income Estimation: 
$117,024 - $149,811
Income Estimation: 
$117,024 - $149,811
Income Estimation: 
$137,568 - $176,908
Income Estimation: 
$137,568 - $176,908
Income Estimation: 
$158,960 - $205,707
Income Estimation: 
$129,363 - $167,316
Income Estimation: 
$145,845 - $177,256
Income Estimation: 
$147,836 - $182,130
Income Estimation: 
$154,597 - $194,610
Income Estimation: 
$86,891 - $130,303
View Core, Job Family, and Industry Job Skills and Competency Data for more than 15,000 Job Titles Skills Library

Job openings at Linux Foundation

Linux Foundation
Hired Organization Address San Francisco, CA Full Time
Company Description The Linux Foundation is the organization of choice for the world’s top developers and companies to b...
Linux Foundation
Hired Organization Address Vancouver, WA Full Time
Job Details Job Description Job Description Company Description The Linux Foundation is the world s leading home for col...
Linux Foundation
Hired Organization Address San Francisco, CA Full Time
The Linux Foundation is the organization of choice for the world’s top developers and companies to build ecosystems that...
Linux Foundation
Hired Organization Address San Francisco, CA Contractor
Company Description The Linux Foundation is the organization of choice for the world's top developers and companies to b...

Not the job you're looking for? Here are some other Senior Cloud Operations Engineer - PyTorch jobs in the San Francisco, CA area that may be a better fit.

Senior Cloud Operations Engineer - PyTorch

The Linux Foundation, San Francisco, CA

Senior Software Engineer

Molten Cloud, San Francisco, CA

AI Assistant is available now!

Feel free to start your new journey!