What are the responsibilities and job description for the Principal Observability and Reliability Tooling Engineer position at Diligente Technologies?
Principal Observability and Reliability Tooling Engineer
Employment Type: Full time
City: Alpharetta
State: Georgia
Onsite / Hybrid role
About the Role:
Key Responsibilities:
- Develop and implement best-in-class observability strategies that align with business objectives while staying current with industry trends and best practices.
- Design and maintain secure, scalable, and highly available platforms for metrics, logging, tracing, real user monitoring, and synthetic monitoring.
- Lead the organization's adoption of OpenTelemetry, focusing on instrumentation and best practices.
- Collaborate with cross-functional teams to gather requirements and ensure observability is integrated into the development lifecycle with an emphasis on quality and governance.
- Research, analyze, and recommend technical solutions to address complex observability challenges and improve system performance.
- Build and enhance self-service capabilities for monitoring, alerting, and self-healing frameworks to empower engineering teams.
- Implement and manage observability infrastructure and configurations as code (IaC) using Terraform and Ansible.
- Troubleshoot and resolve issues related to observability pipelines and platforms.
- Create and maintain comprehensive documentation for observability practices and tools.
- Mentor and guide teams on observability principles and best practices.
About You:
Basic Requirements:
- Bachelor's degree in Engineering, a related technical discipline, or equivalent work experience.
- At least 8 years of hands-on experience in Observability, with a minimum of 4 years dedicated to developing comprehensive observability strategies and leading their practical implementation.
- At least 5 years of experience with Observability platforms such as Grafana, Splunk, PagerDuty, Datadog, and SolarWinds.
- Strong expertise in the adoption and implementation of OpenTelemetry.
- Experience in creating automation scripts and tools, including Terraform and Ansible for infrastructure management and configuration as code, along with familiarity using GitHub for version control.
- Experience with public cloud providers, preferably Google Cloud.
- Excellent problem-solving skills and the ability to thrive in a fast-paced environment.
- Strong communication and collaboration skills to work effectively with cross-functional teams.