What are the responsibilities and job description for the Principal, Site Reliability Engineer position at Sysco LABS?
Job Summary
Impactful changes across the platform and sustained leadership roles. Responsible for designs and future direction for high availability, performant web/mobile applications, resilient and scalable systems, and metrics and monitoring. Responsible for defining best practices across development, product, architecture, and leadership to collaborate and mentor reliability across the platform. Forward thinking and action to be ahead of issues before they occur through automation and careful analysis. Critical thinking and debugging skills of highly complex environments including networking packet analysis, kubernetes, nginx, streaming (kafka), edge networks, caching, and application layer generalist. Fully accountable for overall system reliability and performance.
Duties And Responsibilities
Bachelor’s degree in computer science, computer engineering or related field, or relevant training.
Education Preferred
Or equivalent combination of experience and education.
Experience Required
8 years experience in Site Reliability Role.
8 years experience with enterprise cloud platforms.
Availability to work extended or off-cycle hours and participate in a 24/7 Site Reliability on-call rotation.
Experience Preferred
8 years’ experience in cloud operations / DevOps role.
Experience with AWS.
Experience with APM tools such as DataDog, New Relic, Nagios or Splunk.
Experience in an agile development environment.
Physical Demands
Reasonable accommodations will be made to enable individuals with disabilities to perform the essential functions of this job.
Impactful changes across the platform and sustained leadership roles. Responsible for designs and future direction for high availability, performant web/mobile applications, resilient and scalable systems, and metrics and monitoring. Responsible for defining best practices across development, product, architecture, and leadership to collaborate and mentor reliability across the platform. Forward thinking and action to be ahead of issues before they occur through automation and careful analysis. Critical thinking and debugging skills of highly complex environments including networking packet analysis, kubernetes, nginx, streaming (kafka), edge networks, caching, and application layer generalist. Fully accountable for overall system reliability and performance.
Duties And Responsibilities
- Develop and refine strategy and process for all reliability tracking across the platform in conjunction with senior members of the team.
- Lead strategic discussions to continue the evolution of flexibility and sustainability of the entire product suite.
- Partner with support teams, DevOps, Engineering, and customers to inform decisions and implement improvements.
- Responsible for RCA findings related to reliability are addressed at initial injection to prevent regression.
- Looking broadly across the platform for latent reliability issues and address them before they are surfaced.
- Provide the orchestration for the production environment by monitoring availability and taking a holistic view of system health
- Architect the software and systems to manage platform infrastructure and applications.
- Documenting and performing annual reviews for tribal knowledge and best practices.
- Define the objectives for system performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating to continually improve
- Gather and analyze metrics for trending performance tuning and fault finding
- Partner with development teams to improve services through rigorous testing and release procedures
- Provide leadership for system design, platform management, and capacity planning
- Balance feature development speed and reliability with well-defined service level objectives
- Actively maintain a thorough understanding of system architecture, applications, and related integrations. Partner with the Platform team to understand and improve system monitoring and alerting.
- Drive active-active multisite reliability targets.
- Ability to drive performance and reliability in a multi-cloud environment.
- Implement Enterprise level procedures and processes.
- Hands on experience with the top Cloud providers.
Bachelor’s degree in computer science, computer engineering or related field, or relevant training.
Education Preferred
Or equivalent combination of experience and education.
Experience Required
8 years experience in Site Reliability Role.
8 years experience with enterprise cloud platforms.
Availability to work extended or off-cycle hours and participate in a 24/7 Site Reliability on-call rotation.
Experience Preferred
8 years’ experience in cloud operations / DevOps role.
Experience with AWS.
Experience with APM tools such as DataDog, New Relic, Nagios or Splunk.
Experience in an agile development environment.
Physical Demands
Reasonable accommodations will be made to enable individuals with disabilities to perform the essential functions of this job.