Demo

Senior Site Reliability Engineer

San Francisco Compute Co.
San Francisco, CA Full Time
POSTED ON 2/21/2025
AVAILABLE BEFORE 5/18/2025

AboutWe're the San Francisco Compute Company. We're building the first real-time trading platform for compute. Everyone from startups to enterprises to research labs and individuals can buy and sell compute, from 1 to 1000 nodes for an hour to over a year. With our liquid market to resell unused compute hours, buyers no longer need to worry about contract lock-in and providers have less idle nodes. Over the next decade, we anticipate thousands of enterprises, governments, startups, and labs will be training and serving large models, and we’re building a team to scale our market.About the Role ML training clusters are some of the most high performance computers on the planet. Even relatively small clusters would have been in the TOP500 5 years ago. Our supercomputing team is responsible for keeping our compute clusters running smoothly, monitoring hardware health, and fixing things when they go wrong. We believe strongly in automation — code is the only reliable way to manage hardware at scale. As we scale, this will become a more data-driven role, predicting failures before they happen. We’re a small team, so you’ll be spending time talking to customers as well.About You You’ve managed at least one GPU training cluster in the past ( ideally a cluster with >

1k GPU’s but not required )You appreciate and value good documentationYou have experience provisioning and managing Kubernetes clustersYou deeply understand Linux, networking fundamentals, CUDA, NCCL, and InfinibandYou enjoy creating large self-correcting systems that keep hardware hummingYou meet at least two of the nice-to-haves belowSome Nice to Haves Experience with Go or Rust (>

2 years)Experience with distributed storage systems (Weka, VAST, Ceph, etc.)Experience with HPC network architectures (eBGP, fat-tree, VXLAN, MCLAG, etc.)Experience with Linux virtualization (KVM, QEMU, libvirt, etc.)Experience with performance optimization of machine learning kernelsBenefits Unlimited office book budget : You can buy as many books for the office as you want. You’re encouraged to spend time during the workday reading!Generous equity grant : Team members are offered a competitive salary along with equity in the companyRetirement matching : We match 401(k) plans up to 4%Medical, dental & vision : We offer competitive medical, dental, vision insurance for employees and dependents and cover 100% of premiumsTime off : We offer unlimited paid time off as well as 10 observed holidaysParental leave : We offer biological, adoptive, and foster parents paid time off to spend quality time with familyDaily lunch : We cover lunch daily for employeesVisa Sponsorships : Yes, we sponsor visas and work permitsThe San Francisco Compute Company is committed to maintaining a workplace free from discrimination and harassment.We make employment decisions based on business needs, job requirements, and individual qualifications, without regard to race, color, religion, belief, national origin, social or ethical origin, age, physical, mental, or sensory disability, sexual orientation, gender identity or expression, marital status, civil union or domestic partnership status, past or present military service, HIV status, family medical history or genetic information, family or parental status including pregnancy, or any other status protected by law.We welcome the opportunity to consider qualified applicants with prior arrest or conviction records. Our commitment to diversity includes hiring talented individuals regardless of their criminal history, in accordance with local, state, and federal laws, including San Francisco’s Fair Chance Ordinance and California’s ban-the-box laws.If you require reasonable accommodation for any reason, please reach out to us at team@sfcompute.com .#J-18808-Ljbffr

If your compensation planning software is too rigid to deploy winning incentive strategies, it’s time to find an adaptable solution. Compensation Planning
Enhance your organization's compensation strategy with salary data sets that HR and team managers can use to pay your staff right. Surveys & Data Sets

What is the career path for a Senior Site Reliability Engineer?

Sign up to receive alerts about other jobs on the Senior Site Reliability Engineer career path by checking the boxes next to the positions that interest you.
Income Estimation: 
$114,618 - $136,401
Income Estimation: 
$144,264 - $191,312
Income Estimation: 
$140,435 - $166,410
Income Estimation: 
$120,777 - $151,022
Income Estimation: 
$145,845 - $177,256
Income Estimation: 
$147,836 - $182,130
Income Estimation: 
$154,597 - $194,610
Income Estimation: 
$86,891 - $130,303
Income Estimation: 
$51,973 - $66,811
Income Estimation: 
$59,277 - $74,994
Income Estimation: 
$94,567 - $126,847
Income Estimation: 
$59,277 - $74,994
Income Estimation: 
$71,735 - $88,895
Income Estimation: 
$94,567 - $126,847
Income Estimation: 
$105,207 - $132,120
Income Estimation: 
$127,470 - $161,562
Income Estimation: 
$94,567 - $126,847
View Core, Job Family, and Industry Job Skills and Competency Data for more than 15,000 Job Titles Skills Library

Job openings at San Francisco Compute Co.

San Francisco Compute Co.
Hired Organization Address San Francisco, CA Full Time
About Compute is a commodity. We think people should buy it like one. Startups shouldn’t be forced to buy a year’s worth...

Not the job you're looking for? Here are some other Senior Site Reliability Engineer jobs in the San Francisco, CA area that may be a better fit.

Senior Site Reliability Engineer

Tekfen Ventures, San Francisco, CA

Senior Site Reliability Engineer

Humane USA, San Francisco, CA

AI Assistant is available now!

Feel free to start your new journey!