What are the responsibilities and job description for the Site Reliabitily Engineer position at Chabez Tech LLC?
Job Details
Job Title: Site Reliability Engineer (No DevOps Profile)
Location: Atlanta, GA (Hybrid 3 Days Onsite)
Experience: 13 Years
Job Summary:
We are seeking a highly skilled Site Reliability Engineer (SRE) with 6-10 years of experience to manage and enhance the reliability, performance, and scalability of enterprise infrastructure. The ideal candidate will have expertise in Kubernetes (K8s), Envoy, REST/gRPC/HTTP, OTEL, Networking, Python, Observability, RAG, and LLM. This role requires a proactive approach to troubleshooting, process improvement, and infrastructure automation.
Job Summary:
Seeking a Site Reliability Engineer (SRE) to improve infrastructure reliability, performance, and scalability. Requires expertise in Kubernetes (K8s), Envoy, OTEL, Networking, Python, Observability, RAG, and LLM, along with troubleshooting, process optimization, and automation skills.
Key Responsibilities:
- Resolve incidents within SLA and escalate critical issues.
- Perform alert analysis and enhance monitoring.
- Troubleshoot networking, Kubernetes, and application issues.
- Strengthen observability (Prometheus, Datadog, Grafana).
- Document and improve SOPs.
- Mentor team members and assist with migrations.
- Ensure ITSM compliance and automate infrastructure using Terraform, Ansible, CloudFormation.
Key Skills & Experience:
- 8 years in SRE.
- 5 years with Kubernetes (K8s), Envoy, OTEL, Networking, Python, Observability, RAG, LLM.
- Strong REST/gRPC/HTTP API and container orchestration experience.
- Proficiency in Terraform, Ansible, CloudFormation, and CI/CD pipelines (Jenkins, GitHub Actions, GitLab CI, ArgoCD).
- Expertise in troubleshooting distributed systems and ITIL-based incident management.
- Cloud experience (AWS, Azure, Google Cloud Platform).
- Strong communication and collaboration skills.
Client Needs:
Experienced SRE with expertise in Kubernetes, networking, observability, and automation (Terraform, Ansible). Strong troubleshooting and process improvement skills.