What are the responsibilities and job description for the Lead SRE Engineer position at Cloudious?
Job Details
Top Qualifications:
5 years of leading experience to guide SRE engineers
Digital web products
Containers (Docker) and container orchestration (Kubernetes)
Required Skills:
Docker Containers
Grafana
SRE
Responsibilities:
System Reliability and Performance: Lead and drive end to end (Supply Chain) reliability, availability, and performance of applications in Digital Experience.
Monitoring and Alerting: Design, implement, and maintain robust monitoring and alerting systems to proactively identify and resolve issues.
Infra Capacity Planning: Drive capacity planning, ensuring that systems can handle current and future workloads.
Incident Response: Lead and guide Org level application teams in incident response efforts, ensuring quick and effective resolution of issues.
Performance Tuning: Drive and implement best practices and controls to identify the bottlenecks and support performance tuning before production rollout
Post-Incident Reviews: Drive and support post-incident(P1/P2) reviews to identify root causes and prevent future incidents.
Security: Lead application teams to adopt industry standard best practices in managing security certs, Secrets and Non-User Id's to avoid any issues and outages.
Change Management: Implement robust change management processes to ensure that changes to the system are deployed safely and reliably.
Peak Season Readiness: Support Digital teams to get prepared for peak season in terms of overall E2E system resiliency and redundancy to handle expected peak usage volumes.
War room Playbooks: Support teams in preparation of playbook with War room scenarios.
Auto Failover & Auto Scaling: Lead and Support application teams in adopting best auto failover and auto scaling strategies to maintain overall system resiliency.
Collaboration with Engineers: Work with application development teams to understand their needs, identify potential reliability issues, and improve the software development lifecycle.
Cloud: Define and develop Cloud strategy for the enterprise, focusing on AWS, aligned with IT requirements
Qualifications:
Extensive experience with Docker Containers for application deployment and management.
Strong skills in using Grafana for system monitoring and visualization.
Solid understanding of SRE principles and practices.
Proven track record in geospatial analysis and data management.
Experience with hybrid work models and day shift operations is nice to have.
Ability to work collaboratively in a cross-functional team environment.
Excellent communication and reporting skills.
Detail-oriented and capable of handling complex data sets.
Proactive in identifying and solving technical challenges.
Strong commitment to quality and continuous improvement.
Ability to mentor and guide junior team members effectively.
Adaptable to new technologies and industry trends.