What are the responsibilities and job description for the SRE Lead position at ExaTech Inc?
We are seeking a highly skilled Site Reliability Engineer (SRE) Lead Engineer to drive transformational initiatives within IT operations and development. This position involves spearheading innovative SRE solutions and fostering an engineering-centric approach to IT operations. The ideal candidate will be a technical leader passionate about designing and implementing reliable, scalable, and high-performing systems with a focus on operational excellence.
Key Responsibilities:
Skills
o Docker Products
o Amazon Web Services (AWS)
System Design and Architecture:
· Design and architect reliable, scalable systems and services emphasizing operational excellence, availability, and performance.
Observability and Monitoring:
· Expertise in telemetry and observability tools, including Dynatrace APM, SolarWinds, Prometheus, Grafana, Kibana, Splunk, and other AIOPS tools.
· Develop and maintain correlation mechanisms and dashboards for comprehensive visibility across internal and external application requests.
Authentication Mechanisms:
· Deep understanding of login authentication technologies such as Ping, ForgeRock, and SiteMinder, including session and cookie management.
SRE Practices and Evangelism:
· Promote SRE principles, including incident management, monitoring, alerting, and automation.
· Collaborate with development teams to ensure operational reliability and resilience.
· Define and implement best practices for SRE within the organization.
Automation and Operational Excellence:
· Automate operational tasks to streamline workflows and improve efficiency.
· Lead efforts to optimize resource utilization and capacity planning.
Incident and Problem Management:
· Drive incident response processes, perform root cause analyses, and implement preventive measures to improve system reliability.
Security and Compliance:
· Ensure alignment of SRE practices with security and compliance requirements.
· Implement measures to safeguard systems and data integrity.
Team Leadership and Collaboration:
· Provide mentorship to SRE teams and foster skill development.
· Build strong relationships with operational teams to drive improvements across the enterprise.
Continuous Learning:
· Stay updated with industry trends, technologies, and SRE advancements to continuously enhance organizational capabilities.
Qualifications:
· 10–12 years of hands-on SRE experience with cloud technologies, development, observability tools, and automation.
Technical Expertise:
· Observability tools: Dynatrace, SolarWinds, Prometheus, Grafana, Kibana, Splunk, and AIOPS tools.
· Authentication technologies: Ping, ForgeRock, and SiteMinder.
· Cloud platforms: AWS (Control Tower, Project Setup, Creating Accounts, RDS, SSO).
· Containerization: Docker and Kubernetes.
· Automation: GitLab CI/CD, Terraform, Ansible, or equivalent scripting tools.
· Monitoring and alerting: Splunk, Prometheus, Grafana, ELK stack, etc.
· Programming languages: Groovy-DSL, Java, Python, YAML, and microservices architecture.
· Messaging systems: MQ, Kafka.
· Databases: Oracle, MySQL.
Additional Skills:
· Observability framework implementation with programmatic SLI/SLO blueprints.
· Strong knowledge of Linux commands and systems.
· Hands-on experience with APM tools like Datadog, AppDynamics, or Dynatrace.
· Certifications (Preferred):
· Certified Site Reliability Engineer (CSRE).
· Certified Kubernetes Administrator (CKA).
· AWS Certified DevOps Engineer – Professional.
· Google Cloud Professional DevOps Engineer.