What are the responsibilities and job description for the Site Reliability Manager/SRE || W2 || Plano, TX or Richmond, VA (hybrid) position at 1 Point System?
Job Details
Job Title: Site Reliability Manager
Location: Plano, TX or Richmond, VA (Hybrid)
W2
Overview:
Our client is seeking a Site Reliability Manager with a strong background in cloud-based solutions (preferably AWS) and a passion for driving automation, self-healing systems, and leveraging Site Reliability Engineering (SRE) principles. You will provide technical leadership to ensure the stability, scalability, and performance of our applications, identifying opportunities for automation and proactive monitoring solutions.
Key Responsibilities:
Cloud Expertise: Deep understanding of cloud-based solutions and services, with a focus on AWS (EC2, DynamoDB).
Automation & Scripting: Lead automation efforts by implementing scripting, machine learning, and self-healing systems.
DevOps Best Practices: Provide technical leadership around DevOps tools (Git/GitHub, Jenkins, Docker) and best practices.
Production Support: Ensure systems are highly reliable, with experience in production support and monitoring tools (Splunk, New Relic).
Technology Stack: Proficiency in Python, NodeJS (NR Synthetics), ReactJS, Java, and API integration using REST.
Monitoring & Alerting: Develop and implement automated monitoring and alerting solutions to minimize manual interventions.
Zero-Touch Automation: Identify opportunities to reduce manual validation and promote zero-touch automation and self-healing systems.
Requirements:
Experience with AWS cloud services (EC2, DynamoDB).
Proficient in common DevOps tools (Git/GitHub, Jenkins, Docker).
Strong skills in Python, NodeJS, ReactJS, and Java.
Familiarity with production support processes and REST APIs.
Solid understanding of SRE principles, including proactive monitoring and self-healing system design.