What are the responsibilities and job description for the Cloud Site Reliability Engineer position at Strategic Staffing Solutions?
STRATEGIC STAFFING SOLUTIONS (S3) HAS AN OPENING!
Job Title : Cloud Site Reliability Engineer
Location : Detroit, MI Hybrid-3 days / week on site in Detroit, 2 days remote
Duration : 2 year contract
Role Type : W2 only, no corp to corp
Highly competitive rate, with benefits available
Job Summary
The Cloud Site Reliability Engineer (SRE) works closely with cloud development team, IT operations team and business partners to streamline and implement enhanced monitoring and alerting capability across infrastructure, application layers. By leveraging automation tools, SREs address and resolve issues, minimizing manual workload and enhancing system scalability and reliability. Their core focus lies in standardization and automation to build and run fault-tolerant systems. Typically, SREs possess a background in software engineering, system engineering, or system administration, coupled with substantial IT operations experience. SREs oversee availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.
Key Accountabilities
- Writing and developing code to automate processes, such as analyzing logs, testing production environments and responding to any issues
- Collaborates with agile teams and business partners to develop specifications that resolve problems and enhancement needs, including focusing on monitoring, and metrics for operational readiness
- Identify bottlenecks in development and deployment processes and designs automation solutions to mitigate
- Develop new capabilities in displaying / monitoring / alerting on key performance indicators by tracking business transactions in real-time
- Maintain and grow knowledge of platform configuration management, monitoring of established metrics, and troubleshooting
- Provides continuous feedback to development teams on system stability, defect analysis, and system enhancements
- Design and develop alert escalation and incident response automation
- Provide production support for cloud service outages and incidents and work on both tactical and strategic plans for outage prevention
- Provide feedback on resiliency and maintainability of solutions to Cloud and App architects
- Conduct disaster recovery scenario generation and testing
- Implement sustainable, audit-ready processes that support information technology controls, including deployment execution, access management, audits, incident management and related requirements.
Must-have technical skills :
The S3 Difference
The global mission of S3 is to build trusting relationships and deliver solutions that positively impact our customers, our consultants, and our communities. The four pillars of our company are to :
Job ID : JOB-238417
Publish Date : 10 Oct 2024