What are the responsibilities and job description for the Sr Site Reliability Engineer position at PeopleConnect?

Job Description

Do you aspire to take on a strategic, leadership-oriented role where you design and guide infrastructure at an architectural level? Are you passionate about identifying and solving complex operational challenges, improving system reliability, and driving modernization? Do you thrive on designing scalable, fault-tolerant systems and implementing automation that transforms on-prem applications into cloud-native solutions? If so, this role is the perfect next step in your journey!

As a Senior Site Reliability Engineer at Classmates.com , you will be responsible for designing, implementing, and maintaining the infrastructure and systems that power our applications and services. Collaborating closely with cross-functional teams, you will drive operational excellence, automate processes, and continuously improve system reliability. You’ll be a trusted specialist on complex technical and business challenges, leveraging your expertise in cloud technologies, automation, and performance optimization to shape the future of our platform.

In this role, you will work collaboratively with the team, often multitasking, and consistently driving projects to completion. Success in this position requires steadfast persistence, innovative thinking, the ability to interpret performance data effectively, and strong interpersonal skills. And if you can achieve all this while having fun, even better!

Location and Logistics

This is a hybrid role requiring 2-3 days per week in our Bellevue, WA office .
Local candidates are encouraged to apply.
Please note, we are unable to offer visa sponsorship, visa transfer, or corp-to-corp arrangements for this position.

Key Responsibilities :

Cloud Strategy and Architecture

Provide strategic leadership , mentorship, and a technical vision to advance site reliability engineering, DevOps, and a ‘cloud-first’ culture across the organization.

Define and implement scalable, secure, and cost-optimized cloud strategies that align with business goals and future growth.

Lead architectural decisions, establishing and enforcing best practices for cloud infrastructure design and operational excellence.

Drive modernization initiatives , transitioning legacy on-premise applications to cloud-native architectures using containerization, microservices, and serverless technologies.

Stay ahead of emerging cloud technologies, evaluating new tools and services to enhance performance, reliability, and developer self-service capabilities.

Infrastructure Automation and Design

Collaborate on designing, building, and maintaining scalable infrastructure across cloud and on-prem environments.

Architect and implement automated solutions to provision, monitor, and scale complex infrastructures, leveraging IaC tools like Terraform and Puppet, with a focus on modular, reusable designs.

Develop automation scripts, maintain CI / CD pipelines, and plan for scalability and capacity, conducting load testing as needed.

Reliability and Performance Engineering

Ensure system reliability, availability, and performance through monitoring, alerting, and incident response.

Implement and manage SLOs / SLIs to meet and exceed reliability standards.

Identify and address performance bottlenecks across the infrastructure and application stack.

Build and maintain observability solutions (e.g., monitoring, logging, and tracing) and improve system health dashboards.

Define and enforce best practices for reliability engineering, including failure injection and chaos engineering.

Security and Compliance :

Implement security measures for cloud-native applications and ensure compliance with industry standards (SOC2, PCI, etc.).

Collaborate with security teams to respond to active threats, audit systems, and continuously update configurations.

Monitor security configurations and dashboards, ensuring proactive responses to potential vulnerabilities.

Incident Management and Root Cause Analysis :

Participate in on-call rotations to provide 24 / 7 support for production environments.

Lead post-incident reviews, collaborating with cross-functional teams to identify systemic improvements.

Establish metrics for tracking incident frequency, response times, and resolution effectiveness.

Proactively test system resilience through Chaos Engineering experiments and failure injection.

Create and maintain runbooks and operational documentation to drive continuous improvement.

Disaster Recovery and Business Continuity

Design and test disaster recovery (DR) and business continuity strategies, ensuring backup and failover mechanisms are effective.

Develop and implement testing schedules for DR strategies to validate readiness and compliance.

Cost Management and Financial Optimization

Monitor cloud usage and lead FinOps initiatives to control and optimize infrastructure costs.

Collaborate with stakeholders to drive financial accountability and efficiency across engineering teams.

Collaboration, Knowledge Sharing, and Communication :

Collaborate across teams to ensure alignment and effective project implementation.

Communicate during incidents and changes, providing transparency to stakeholders.

Mentor and share knowledge with team members to foster a culture of continuous learning and innovation.

Facilitate the evaluation and adoption of tools and technologies that enhance team productivity.

Qualifications :

Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field, or equivalent experience.

5 years of experience as a Site Reliability Engineer or in a similar role, working with highly available and production environments.

Proficiency in AWS and containerization technologies like Kubernetes and Docker .

Strong experience with Infrastructure as Code (IaC) using Terraform , with automation scripting skills in Python , Bash / Shell , or Go .

Deep knowledge of Linux / Unix systems and networking fundamentals (e.g., TCP / IP, DNS, HTTP, VPN).

Experience with monitoring and observability tools (e.g., Datadog , Prometheus , Grafana ) and incident management .

Familiarity with CI / CD pipelines , preferably using tools like GitLab , and strong knowledge of DevOps practices.

Excellent troubleshooting skills, with experience in performance optimization and root cause analysis .

Strong communication and collaboration skills.

Bonus skills : experience with Rundeck, Java, Spring Framework, Terragrunt, Puppet, Vector, Loki, VictoriaMetrics , and additional cloud platforms (e.g., GCP, Azure ), as well as relevant certifications such as AWS Solutions Architect or Certified Kubernetes Administrator (CKA) .

Classmates

Classmates is the premier online, social, and mobile destination for reconnecting with the people from your high school years. Classmates offers the largest digitized collection of high school yearbooks online, with over 450,000 available to view, tag, sign, and share, and has the most comprehensive directory of high schools and class lists from the 1940s to today.

Salary Range : Min : $152,700

Mid : $170,800

Max : $190,600

The pay range reflects the salary amount the Company reasonably expects to pay for the position. It is not a guarantee of actual compensation or a specific payment amount to any candidate. The actual compensation will depend on numerous factors including, without limitation, a particular candidate’s experience and qualifications.

The Company's Applicant and Worker Privacy Notice can be found here.

PeopleConnect is an equal opportunity employer.

Local area candidates are encouraged to apply, and please note we are not able to offer visa sponsorship, visa transfer, or corp-corp arrangements.

Note for Principal Agencies - Principal agents should not forward resumes to PeopleConnect, as we will not be responsible for any fees arising from the use of resumes submitted from agencies without a prior written and signed agreement and authorized job order for this position in place.

PeopleConnect, Inc. is an equal opportunity employer

Salary : $152,700 - $190,600

Apply for this job

Receive alerts for other Sr Site Reliability Engineer job openings

Sr Site Reliability Engineer

What are the responsibilities and job description for the Sr Site Reliability Engineer position at PeopleConnect?

What is the career path for a Sr Site Reliability Engineer?

Not the job you're looking for? Here are some other Sr Site Reliability Engineer jobs in the Bellevue, WA area that may be a better fit.

We don't have any other Sr Site Reliability Engineer jobs in the Bellevue, WA area right now.

AI Assistant is available now!