What are the responsibilities and job description for the Senior Observability Lead Engineer position at Corporate Systems Associates?
Job Details
Job ID: H#12751 - Senior Observability Lead Engineer
PLEASE NOTE: This is a 6 month contract to hire and needs to meet Client full-time conversion policies. Those dependent on a work permit sponsor now or anytime in the future (ie H1B, OPT, CPT, etc) do not meet Client requirements for this opening.
**Hybrid in Hartford, CT preferred (Charlotte, NC or Chicago, IL last resort) - Candidates willing to relocate to start day one is acceptable
***TOP MUST HAVES:
- 1. Splunk / Dynatrace (5 years)
- 2. Python (5 years)
- 3. AWS (5 years)
- 4. Experience in managing AI-based systems is HIGHLY DESIRED
What does the structure/makeup of the team currently look like?
- Small agile team within a SAFe larger value stream
What will this contractor be accomplishing for your team?
- Software development with various DevOps tasks related to observability.
What will this contractor be working on / are there specific projects?
- Increasing the self-service, usabilitty, and shipping new capabilities for our cloud focused observability suite (Splunk, Dynatrace, AWS/Google Cloud Platform Cloud Monitoring tools
What makes this role stand out / what are some unique selling points?
- Part of a larger enterprise migration to the cloud over 5 years
- Ties into a transformation program to roll SRE to the company and empower application development teams to self-service best in breed developer tools to enhance developer experience.
Describe your expectations for training once a worker starts and expected ramp up.
- Understand the core DevOps approach to infrastructure engineering. Familiarize self with SRE principles, Hartford approach to SWE. Learn any details on intermediate to advanced instrumentation features. Ensure delivery of work through automation, pipelines, documentation
What type of development will this contractor be doing? (Back end, front end, full stack?) and what will they be responsible for developing?
- Yes, light full stack and infrastructure engineering back end.
What type of application will they be developing? Will this contractor primarily be working on new development or maintaining the current environment?
- New development capabilities, automation, improvements to resiliency, efficiency, security.
Will this contractor be testing the application/software they are developing?
- They will be expected to exercise the full ownership of the SDLC including documentation, testing, observability, safe release practices.
What programming languages do you want this contractor to have experience with? (How many years of experience with each language)
- Python, Terraform, Cloudformation, AWS SDK, Bash, Golang / Rust / Haskell (Optional) -- 5 years plus
What other tools or technologies do you want this contractor to have experience with/how will they be interacting with these tools in this role?
- CLI, API
What type of background should this contractor have in order to meet the needs of your team? Do they need to hold a specific degree or certification?
- Cloud Certifications, CNCF, or general Linux / DevOps / SRE certs
The Hartford s RE&A Observability team is looking for a highly motivated and experienced Senior Observability Engineer Lead with expertise in Splunk, Dynatrace, and other industry observability tools.
The Senior Observability Engineer Lead will ensure the reliability of IT services, focusing on the developer experience. This role requires a build-to-manage mindset, strong problem-solving skills, and innovative thinking applied to designing, building, testing, deploying, changing, and maintaining services, leveraging deep engineering expertise. Experience in managing AI-based systems is HIGHLY DESIRED.
Key success metrics include service stability, effective software delivery, utilization of customer-based observability practices, and achieving top industry-leading operating norms within tools.
The Senior Observability Engineer Lead will also contribute to the ongoing advancement of the RE Insights practice within and beyond their area of responsibility. This includes assisting with road map development, vision, strategy, RE practices post-mortem, and vendor management tasks such as MOR, QBR, and contracts.
Responsibilities:
- Guide the use of best-in-class software engineering standards and design practices for instrumenting code and application technology stacks for Splunk, Dynatrace, and Akamai. Enable the generation of relevant metrics on overall technology health, including availability, performance, quality, technical debt, and resiliency.
- Function as the go-to technical expert for the applications supported, requiring depth and breadth of knowledge in Splunk, Dynatrace and related technologies. Provide expertise in applications, integration, interfaces, and the business domain to drive insights and improvements.
- Leverage Splunk and Dynatrace Gen AI capabilities to enhance predictive analytics and automated incident response. Utilize AI-driven insights to proactively identify and address potential issues, ensuring optimal performance and reliability of IT services.
Insights Solution Responsibilities:
- Enable alerting, monitoring, service intelligence, noise reduction, self-healing, dashboards (Critical user journeys), and overall insights using Splunk ITSI, Dynatrace, to support all LOBs within the organization.
- Enhance the delivery flow by engineering solutions with Splunk ITSI, Dynatrace to increase delivery speed while adhering to technology standards for sustained reliability.
- Progressively implement preventative controls and drive increased automation and self-healing capabilities using Splunk ITSI, Dynatrace. Continue to improve cost efficiency baselines.
- Promote and implement innovative solutions leveraging the capabilities of Splunk ITSI, Dynatrace.
IT Ops Responsibilities:
- Ensure operational excellence. Independently drive the triaging and service restoration of all high-impact incidents to minimize the mean time to service restoration and impact the business. Demonstrate end-to-end ownership.
- Partner with infrastructure teams to design and implement intelligent incident routing, enhanced monitoring/alerting capabilities and automated service restoration processes. Take proactive measures to prevent high impactful incidents.
- Achieve and maintain the continuity of Hartford and third-party assets that support a business function. Accountable for keeping the IT application and infrastructure metadata repositories current.
- System Thinking, end-to-end and broad understanding of enterprise architectures and distributed systems.
- Highly collaborative, partners with peers, stakeholders with a passion about delighting customers.
- Hands on experience with Performance and Observability tools such as Splunk ITSI (IT Service Intelligence), Dynatrace, Splunk, CloudWatch, CloudTrail, and related tools.
- Strong solution architecture orientation to enable expedient troubleshooting, issue-resolution and root-cause removal in a hybrid cloud environment.
- Experience with continuous integration and DevOps methodologies, preferred tools such as GitHub, Jenkins, Nexus, Rally, SonarQube, Akamai etc.
- Keeps abreast with new market technologies and adept at learning and adopting new models. Promotes and applies continuous learning.
- Knowledge of complex traditional and modern enterprise architectures and systems. Strong hybrid cloud experience (private and public) across various service delivery models ? SRE, IaaS, PaaS, SaaS.
- Effective communication (verbally and written) / collaboration / negotiation skill, working in a diverse team cross business unit