What are the responsibilities and job description for the Site Reliability Engineer position at CereCore?
Classification : Contract
Contract Length : 12-months
Position Summary
As a Senior Site Reliability Engineer (SRE), you will provide SRE best practices for mission-critical applications across the enterprise. When these applications fail, you’ll have the skills and decision-making capabilities to quickly restore services, investigate the root cause, and develop a plan that mitigates future failures. You will spend time analyzing system performance and identifying ways to enhance the reliability of our environments, from developing dashboards, performing configuration changes, building robust monitoring systems, and learning how to leverage automation to drive efficiencies. You will help drive uptime and reliability across the enterprise.
Responsibilities
- Support system upgrades, architecture design, implementations, and deployments.
- Ability to work in a complex organization, navigate multiple verticals of expertise and negotiate, guide direct and influence your peers to provide real solutions.
- Maintain industry knowledge in software development, architecture, and development products, such as databases, security, and automation products.
- Promote a collaborative team environment and work closely with colleagues to achieve business objectives.
- Collaborate with stakeholders (e.g., business stakeholders, product owners, project managers, and end users) to understand functional and non-functional requirements.
- Lead Investigations and solution proposals to development and design problems.
- Participate with team members in scope of work estimation and forecasting.
- Improve performance of existing software by diagnosing and resolving critical issues.
- Prepare technical documentation, including software & architectural design evaluation plans, data flow diagrams, test results, and technical manuals.
- Adhere to and influence established development practices and processes.
- Gather and analyze metrics from both operating systems and applications to assist in performance tuning and fault finding.
- Ongoing review of technology, infrastructure, and code to enhance and build resiliency into the applications.
- Create sustainable systems and services through automation and uplifts.
- Balance feature development & deployments with speed, reliability, and well-defined service-level objectives.
- Partner with development teams and vendors of 3rd party applications to improve services through rigorous testing and release procedures.
- Build / Develop automations to “self-heal” applications and reduce the toil of manual operational tasks. Pursuit of operational excellence, uptime, and reliability of our applications
- Participate, lead, and drive in creating postmortem analysis of why services broke or degraded, including recommendations for long-term fixes. It may require going across multiple teams and organizations within the enterprise. Determine root-cause for all production-level incidents and write corresponding high-quality RCA reports.
Requirements