What are the responsibilities and job description for the Site Reliability Engineer - Assistant Vice President - Hybrid - 5439 position at Benchmark IT - Technology Talent?
About the Role
The Site Reliability Engineering (SRE) team plays a critical role in ensuring the firm's platform delivers consistent and reliable service to its clients. This position sits at the intersection of software engineering and operations, applying engineering principles to infrastructure challenges. The ideal candidate will design and implement scalable systems, develop observability solutions that provide actionable insights, and automate processes to enhance platform reliability. The company is seeking a Site Reliability Engineer who takes a systematic approach to reliability, can translate business requirements into technical solutions, and excels at strengthening complex systems.
This position is based in either the Greenwich, CT or NYC (Midtown) office, with an expectation of being on-site 4 days per week. Employees in this role will work in the office Monday-Thursday, with the flexibility to work remotely on Friday.
Responsibilities
- Design, implement, and maintain service level objectives (SLOs) that align with business goals and customer expectations.
- Develop observability strategies, focusing on meaningful metrics that drive actionable insights.
- Architect and implement scalable infrastructure solutions using cloud-native technologies and infrastructure as code.
- Drive automation initiatives to eliminate toil and improve system reliability.
- Champion reliability best practices across development teams through consultation and tooling.
- Design and operation of a Kubernetes environment for container management and orchestration.
- Lead incident response, conduct thorough postmortems, and drive systematic improvements.
- Participate in on-call rotations with a focus on continuous service improvement.
Qualifications
- 5 years of SRE experience or related experience with 3 years in AWS
- Strong experience with container orchestration platforms like Kubernetes and related ecosystem tools
- Working knowledge of databases such as MongoDB, Postgres, DynamoDB
- Strong foundation in reliability engineering principles and distributed systems behavior
- Experience defining and implementing SLOs/SLIs and using them to drive system improvements
- Demonstrated ability to design and implement observability solutions that provide actionable insights while minimizing alert fatigue
- Coding abilities in at least one IaC language, with Terraform strongly preferred and one programming language such as Python, Ruby or Java with a focus on maintainable, tested code
- Understand modern observability practices and experience implementing and maintaining monitoring solutions such as Prometheus/Grafana, Splunk, NewRelic, CloudWatch, and ELK in the cloud
- Strong incident response skills with experience leading incident retrospectives and driving improvements
- Excellent problem-solving abilities and experience debugging distributed systems
- Track record of successfully automating operations and reducing toil
- Strong communication skills with ability to explain complex technical concepts to diverse audiences
Benefits
The base salary for this role ranges from $120,000 to $160,000, depending on experience. The firm provides a competitive compensation package, which includes salary, equity for all full-time employees, and an annual performance bonus. Employees also receive a comprehensive benefits package, featuring an employer-matched retirement plan, heavily subsidized healthcare, 100% employer-paid dental and vision coverage, telemedicine and virtual mental health counseling, parental leave, and unlimited paid time off (PTO).
If this aligns with your expertise, apply today for immediate consideration!
Salary : $120,000 - $160,000