What are the responsibilities and job description for the Site Reliability Engineer (SRE) position at SSTech LLC?
Job Details
SRE
No DevOps Engineer, we need a pure SRE here.
Job Details:
Title: Site Reliability Engineer (SRE)
Duration: 12 months with possible conversion
Location: Richmond, VA or McLean, VA - Hybrid onsite role.
Job Description:
We are looking for a highly skilled and motivated Site Reliability Engineer (SRE) to join our team. In this role, you will be responsible for building and maintaining reliable, scalable, and efficient systems that ensure the high availability and performance of our applications. You will work closely with development and operations teams to implement SRE practices, including dashboard building, monitoring, and performance optimization.
Key Responsibilities:
Design, build, and maintain SRE dashboards to provide real-time visibility into the health and performance of our applications.
Implement and maintain SLA/SLO/SSO to ensure service reliability and align with business requirements.
Leverage DevOps principles to improve CI/CD pipelines, enabling faster and more reliable deployment cycles.
Support and optimize microservices development to ensure scalability, reliability, and performance across distributed systems.
Build and manage AWS infrastructure for efficient resource provisioning, scaling, and monitoring.
Collaborate with cross-functional teams to identify and resolve production issues in a timely manner.
Automate monitoring, alerting, and remediation processes to reduce manual intervention and increase uptime.
Participate in on-call rotations to ensure prompt resolution of incidents and service disruptions.
Conduct post-mortems on incidents, identify root causes, and implement preventive measures to avoid recurrence.
Foster a culture of continuous improvement, reliability, and resilience in the software development lifecycle.
Required Skills & Qualifications:
Proven experience in SRE practices, including dashboard building, monitoring, and alerting.
In-depth understanding of SLA/SLO/SSO concepts and how they apply to service reliability.
Strong experience with DevOps, including CI/CD pipelines, version control systems, and automated testing.
Solid background in microservices development, containerization (Docker, Kubernetes), and distributed systems.
Proficient in cloud infrastructure management, particularly AWS services (EC2, S3, Lambda, CloudWatch, etc.).
Expertise in scripting and automation tools (e.g., Python, Bash, Terraform).
Strong troubleshooting and incident response skills, with a focus on improving system reliability.
Experience with monitoring tools such as Prometheus, Grafana, and Datadog.
Strong collaboration and communication skills to work across teams and support business goals.
Join our team and play a key role in developing high-performance, scalable solutions that drive innovation and success