What are the responsibilities and job description for the Site Reliability Engineering (SRE) Specialist position at Alibaba Cloud?
Elastic Compute Service (ECS) is a core product of Alibaba Cloud. The Elastic Compute team is dedicated to building world-leading cloud computing infrastructure. As a key component of Alibaba Cloud's self-developed Apsara operating system , Elastic Compute Service (ECS) provides full-stack computing resources covering virtual machine instances, container services and Heterogeneous computing clusters.
Through technological innovation and product optimization, the Alibaba Cloud Elastic Compute team continuously drives advancements in cloud computing technologies, delivering high-quality computing services to users worldwid
e. Our goal is not only to support enterprises in achieving elastic scalability but also to deeply empower infrastructure innovation in the New era . Our mission is to build an intelligent foundation of "Computing as a Service," enabling developers to focus on businesses to concentrate on breakthroughs, without worrying about the complex engineering implementations from chips to clusters
.
SRE Te
am:The Alibaba Cloud Elastic Compute Service (ECS) SRE (Site Reliability Engineering) team is a critical force in ensuring system stability and reliability. The SRE team focuses on guaranteeing the high availability, high performance, and robust stability of ECS products through technical expertise and innovati
on.
The Alibaba Cloud ECS SRE team is not only a core technical safeguard but also a driver of technological innovation and continuous optimization . By leveraging technical capabilities and collaborative teamwork, we ensure the stability and reliability of ECS products, safeguarding global customers' businesses. Additionally, we are committed to advancing cloud computing technologies through knowledge sharing and industry collaborati
on .
Joining the Alibaba Cloud ECS SRE team offers the opportunity to engage in the development and optimization of world-leading cloud computing technologies, while growing alongside a passionate and creative
- team.
Responsible for the delivery and operation/maintenance of various clusters, and participate in the architecture design and construction of the infrastructure operation pla - tform.Establish and optimize operation/maintenance service systems to achieve product stability and SLA
- goals.Develop delivery standards, document maintenance specifications, and enhance daily work efficiency through tool plat
- forms.This position involves on-call responsibilities, requiring timely customer response within Service Level Agreement (SLA) timeframes, driving issue resolution and improving customer exper
ience.
Qualif
- ication:5 years of operation and maintenance (O&M) experience in IT, internet, or cloud computing ind
- ustries;Proficient in Linux operating systems and mainstream protocols (e.g., TCP/IP), with solid hands-on experience in troubleshooting OS and network
- issues.Familiar with containerization and orchestration technologies such as Kubernetes, Slurm,
- and LSF.Ability to analyze and document technical issues systematically, develop tools/systems to optimize workflows, and improve operational efficiency through automation and platform-based so
- lutions.Strong self-driven learning capabilities, excellent communication skills, and experience leading cross-team projects. Results-driven and action-oriented, with a commitment to exc
ellence.
The pay range for this position at commencement of employment is expected to be between $133,200/year and $219,600/year. However, base pay offered may vary depending on multiple individualized factors, including market location, job-related knowledge, skills, and e
xperience.
If hired, employee will be in an “at-will position” and the Company reserves the right to modify base salary (as well as any other discretionary payment or compensation program) at any time, including for reasons related to individual performance, Company or individual department/team performance, and mark
et factors.
Alibaba U.S. based full time regular employees have access to medical, dental, and vision insurance, a 401(k) plan and basic life insurance, and wellbeing benefits like FSA, subject to the terms and conditions of the applicable plans then in effect. U.S. based employees are also eligible to receive up to 12 paid holidays, accrue up to 15 paid vacation days for this position, and receive up to 72 hours paid sick time (front-loaded) per ca
Salary : $133,200 - $219,600