What are the responsibilities and job description for the SRE Lead at Austin, Texas Metropolitan Area (Hybrid)-- FULLTIME position at Exatech Inc?
Job Details
Position : SRE Lead with AWS Expertise
Location : Austin, Texas Metropolitan Area (Hybrid)
Duration : FULLTIME
We are seeking a highly skilled Site Reliability Engineer (SRE) Lead Engineer to drive transformational initiatives within IT operations and development. This position involves spearheading innovative SRE solutions and fostering an engineering-centric approach to IT operations. The ideal candidate will be a technical leader passionate about designing and implementing reliable, scalable, and high-performing systems with a focus on operational excellence.
Key Responsibilities:
Skills
o Docker Products
o Amazon Web Services (AWS)
System Design and Architecture:
Design and architect reliable, scalable systems and services emphasizing operational excellence, availability, and performance.
Observability and Monitoring:
Expertise in telemetry and observability tools, including Dynatrace APM, SolarWinds, Prometheus, Grafana, Kibana, Splunk, and other AIOPS tools.
Develop and maintain correlation mechanisms and dashboards for comprehensive visibility across internal and external application requests.
Authentication Mechanisms:
Deep understanding of login authentication technologies such as Ping, ForgeRock, and SiteMinder, including session and cookie management.
SRE Practices and Evangelism:
Promote SRE principles, including incident management, monitoring, alerting, and automation.
Collaborate with development teams to ensure operational reliability and resilience.
Define and implement best practices for SRE within the organization.
Automation and Operational Excellence:
Automate operational tasks to streamline workflows and improve efficiency.
Lead efforts to optimize resource utilization and capacity planning.
Incident and Problem Management:
Drive incident response processes, perform root cause analyses, and implement preventive measures to improve system reliability.
Security and Compliance:
Ensure alignment of SRE practices with security and compliance requirements.
Implement measures to safeguard systems and data integrity.
Team Leadership and Collaboration:
Provide mentorship to SRE teams and foster skill development.
Build strong relationships with operational teams to drive improvements across the enterprise.
Continuous Learning:
Stay updated with industry trends, technologies, and SRE advancements to continuously enhance organizational capabilities.
Qualifications:
10 12 years of hands-on SRE experience with cloud technologies, development, observability tools, and automation.
Technical Expertise:
Observability tools: Dynatrace, SolarWinds, Prometheus, Grafana, Kibana, Splunk, and AIOPS tools.
Authentication technologies: Ping, ForgeRock, and SiteMinder.
Cloud platforms: AWS (Control Tower, Project Setup, Creating Accounts, RDS, SSO).
Containerization: Docker and Kubernetes.
Automation: GitLab CI/CD, Terraform, Ansible, or equivalent scripting tools.
Monitoring and alerting: Splunk, Prometheus, Grafana, ELK stack, etc.
Programming languages: Groovy-DSL, Java, Python, YAML, and microservices architecture.
Messaging systems: MQ, Kafka.
Databases: Oracle, MySQL.
Additional Skills:
Observability framework implementation with programmatic SLI/SLO blueprints.
Strong knowledge of Linux commands and systems.
Hands-on experience with APM tools like Datadog, AppDynamics, or Dynatrace.
Certifications (Preferred):
Certified Site Reliability Engineer (CSRE).
Certified Kubernetes Administrator (CKA).
AWS Certified DevOps Engineer Professional.
Google Cloud Professional DevOps Engineer.