What are the responsibilities and job description for the Senior Site Reliability Engineer position at Ethires LLC?
Job Details
Job Title: Senior Site Reliability Engineer
Location: Hartford, CT (Hybrid)
Contract: 6 months Contract to hire
Requirements
- The Hartford s RE&A Observability team is looking for a highly motivated and experienced Senior Observability Engineer Lead with expertise in Splunk, Dynatrace, and other industry observability tools.
- The Senior Observability Engineer Lead will ensure the reliability of IT services, focusing on the developer experience. This role requires a build-to-manage mindset, strong problem-solving skills, and innovative thinking applied to designing, building, testing, deploying, changing, and maintaining services, leveraging deep engineering expertise. Experience in managing AI-based systems is a plus.
- Key success metrics include service stability, effective software delivery, utilization of customer-based observability practices, and achieving top industry-leading operating norms within tools.
- The Senior Observability Engineer Lead will also contribute to the ongoing advancement of the RE Insights practice within and beyond their area of responsibility. This includes assisting with road map development, vision, strategy, RE practices post-mortem, and vendor management tasks such as MOR, QBR, and contracts.
Duties:
- Guide the use of best-in-class software engineering standards and design practices for instrumenting code and application technology stacks for Splunk, Dynatrace, and Akamai. Enable the generation of relevant metrics on overall technology health, including availability, performance, quality, technical debt, and resiliency.
- Function as the go-to technical expert for the applications supported, requiring depth and breadth of knowledge in Splunk, Dynatrace and related technologies. Provide expertise in applications, integration, interfaces, and the business domain to drive insights and improvements.
- Leverage Splunk and Dynatrace Gen AI capabilities to enhance predictive analytics and automated incident response. Utilize AI-driven insights to proactively identify and address potential issues, ensuring optimal performance and reliability of IT services.
- Enable alerting, monitoring, service intelligence, noise reduction, self-healing, dashboards (Critical user journeys), and overall insights using Splunk ITSI, Dynatrace, to support all LOBs within the organization.
- Enhance the delivery flow by engineering solutions with Splunk ITSI, Dynatrace to increase delivery speed while adhering to technology standards for sustained reliability.
- Progressively implement preventative controls and drive increased automation and self-healing capabilities using Splunk ITSI, Dynatrace. Continue to improve cost efficiency baselines.
- Promote and implement innovative solutions leveraging the capabilities of Splunk ITSI, Dynatrace.
- Ensure operational excellence. Independently drive the triaging and service restoration of all high-impact incidents to minimize the mean time to service restoration and impact the business. Demonstrate end-to-end ownership.
- Partner with infrastructure teams to design and implement intelligent incident routing, enhanced monitoring/alerting capabilities and automated service restoration processes. Take proactive measures to prevent high impactful incidents.
- Achieve and maintain the continuity of Hartford and third-party assets that support a business function. Accountable for keeping the IT application and infrastructure metadata repositories current.
- System Thinking, end-to-end and broad understanding of enterprise architectures and distributed systems.
- Highly collaborative, partners with peers, stakeholders with a passion about delighting customers.
Required Experience:
- Hands on experience with Performance and Observability tools such as Splunk ITSI (IT Service Intelligence), Dynatrace, Splunk, CloudWatch, CloudTrail, and related tools.
- Strong solution architecture orientation to enable expedient troubleshooting, issue-resolution and root-cause removal in a hybrid cloud environment.
- Experience with continuous integration and DevOps methodologies, preferred tools such as GitHub, Jenkins, Nexus, Rally, SonarQube, Akamai etc.
- Keeps abreast with new market technologies and adept at learning and adopting new models. Promotes and applies continuous learning.
- Knowledge of complex traditional and modern enterprise architectures and systems. Strong hybrid cloud experience (private and public) across various service delivery models SRE, IaaS, PaaS, SaaS.
- Effective communication (verbally and written) / collaboration / negotiation skill, working in a diverse team cross business unit.
- Python, Terraform, CloudFormation, AWS SDK, Bash, Golang / Rust / Haskell (Optional) -- 5 years plus
- CLI, API
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.