What are the responsibilities and job description for the Senior Site Reliability Engineer position at Microsoft Power Platform Community?

Overview

Are you looking to be at the forefront of Microsoft’s cloud computing transformation? Are you looking to work in an agile environment that ships frequently while maintaining a focus on long-term bets? Do you want to work with state of the art distributed systems that deal with near real time detections on petabyte scale telemetry using Machine Learning and traditional software to deliver on Cloud Availability and Safety goals. Do you want to make an impact in a team of talented engineers delivering world class Software solutions?

Microsoft Cloud Operations & Innovation (CO I) is the engine that powers Microsoft cloud services through the operation of our unified global datacenters enabling ~30% of Microsoft revenue through Commercial Cloud ($38 billion in FY20 Q1). The Cloud Infrastructure Health team in CO IE is focused on improving Cusomer Availability, Data center Safety, Capacity and helping optimize the utilization of Datacenter resources using telemetry and Insights. Our systems analyze petabyte scale telemetry data from Datacenter critical environments and secondary signals in near real time and offline that enable timesensitive insights directly impacting Cloud Operations.Our team is looking for an experienced, competent, and motivated Senior SRE . The Site Reliability Engineering (SRE) team provides leadership, direction and accountability for application architecture, system design, and end-to-end implementation. As a Senior Site Reliability Engineer you will identify and deliver software improvements using your expertise in software development, complexity analysis, and scalable system design. Collaboration skills will be required to work closely with other engineering teams to ensure services/systems are highly stable and performant, meeting the expectations of our customers and users.

As Site Reliability Engineer, primarily responsible in keeping our data services reliable, scalable and participate design reviews. Also takes responsibility for developing code, scripts, systems, and/or tools that reduce operational burden by automating complex and repetitive tasks such as onboarding of system capabilities to newer data centers and upkeep of system capabilities in the existing sites . The SRE enables feature teams to increase the velocity at which they can safely deploy changes to production, and monitor the effects of changes across the footprint. SRE analyzes telemetry data to develop capacity planning models, identify patterns and trends that drive continuous improvement, and highlight opportunities to deploy automation to monitor and manage CIH services across sites. SRE also participates in on-call rotations to resolve live site incidents, minimize customer impact, and document solutions and insights that inform ongoing improvements to infrastructure, code, tools, and/or processes that prevent the recurrence of similar issues.

Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.In alignment with our Microsoft values, we are committed to cultivating an inclusive work environment for all employees to positively impact our culture every day.

Responsibilities

Own deployment, availability, reliability, performance and customer escalation targets for Critical Environment Telemetry solutions
Design, develop, and maintain data pipelines and back-end services for real-time decisioning, reporting, optimization, data collection, and related functions.
Write high quality, maintainable and high-performance code following demonstrated development principles. Manage automated unit and integration test suites.
Work with Project Managers and business stakeholders to design and deliver new features, collaborating with partner teams across the org to ensure successful launches.
Identify opportunities and drive the implementation of monitoring, self-healing, and automation capabilities to improve service manageability and reliability.
Investigate and resolve Customer Reported Incidents, continually looking for ways to minimize or eliminate future incidents and improve customer experiences.

Qualifications

Required Qualifications:

6 years technical experience in software engineering, network engineering, or systems administration

OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 3 years technical experience in software engineering, network engineering, or systems administration
OR Master's Degree in Computer Science, Information Technology, or related field AND 2 years technical experience in software engineering, network engineering, or systems administration.

2 years of experience working in systems uptimes, performance, service monitoring and capacity planning.

Other Requirements

Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to, the following specialized security screenings:

Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.

Additional Qualifications

7 years technical experience in software engineering, network engineering, or systems administration

OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 4 years technical experience in software engineering, network engineering, or systems administration
OR Master's Degree in Computer Science, Information Technology, or related field AND 3 years technical experience in software engineering, network engineering, or systems administration
OR Doctorate Degree in Computer Science, Information Technology, or related field.

Site Reliability Engineering IC4 - The typical base pay range for this role across the U.S. is USD $117,200 - $229,200 per year. There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area, and the base pay range for this role in those locations is USD $153,600 - $250,200 per year.

Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here: https://careers.microsoft.com/us/en/us-corporate-pay

Microsoft will accept applications for the role until February 10, 2025.

Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, color, family or medical care leave, gender identity or expression, genetic information, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran status, race, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable laws, regulations and ordinances. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements. If you need assistance and/or a reasonable accommodation due to a disability during the application or the recruiting process, please send a request via the Accommodation request form .

Benefits/perks listed below may vary depending on the nature of your employment with Microsoft and the country where you work.

#COICareers

#COIEngCareers

#COIE_DPXEcareers

Salary : $117,200 - $250,200

Apply for this job

Receive alerts for other Senior Site Reliability Engineer job openings

Senior Site Reliability Engineer

What are the responsibilities and job description for the Senior Site Reliability Engineer position at Microsoft Power Platform Community?

What is the career path for a Senior Site Reliability Engineer?

Job openings at Microsoft Power Platform Community

Not the job you're looking for? Here are some other Senior Site Reliability Engineer jobs in the Redmond, WA area that may be a better fit.

We don't have any other Senior Site Reliability Engineer jobs in the Redmond, WA area right now.

AI Assistant is available now!