What are the responsibilities and job description for the Storage SRE Engineer position at IBM?
Introduction
As a Site Reliability Engineer (SRE) in the IBM Cloud Infrastructure organization, you will be responsible for ensuring the reliability, scalability, and operational efficiency of IBM Cloud's storage services. You will work closely with development teams, SRE peers and engineering managers to automate infrastructure management, optimize system performance, and enhance monitoring capabilities. This role involves writing code, building automation, troubleshooting production issues, and improving overall service reliability.
Your Role And Responsibilities
Reliability & Scalability
Master's Degree
Required Technical And Professional Expertise
As a Site Reliability Engineer (SRE) in the IBM Cloud Infrastructure organization, you will be responsible for ensuring the reliability, scalability, and operational efficiency of IBM Cloud's storage services. You will work closely with development teams, SRE peers and engineering managers to automate infrastructure management, optimize system performance, and enhance monitoring capabilities. This role involves writing code, building automation, troubleshooting production issues, and improving overall service reliability.
Your Role And Responsibilities
Reliability & Scalability
- Design, build, and maintain highly available, distributed storage services with a focus on reliability, scalability, and security.
- Implement auto-scaling, load balancing, and failover strategies to ensure seamless service availability.
- Analyze performance bottlenecks, optimize system efficiency, and contribute to capacity planning efforts.
- Develop infrastructure automation using PHP, Go, Kubernetes, and other cloud-native technologies.
- Implement self-healing mechanisms and automated remediation processes to minimize manual intervention.
- Respond to production incidents, participate on root cause analyses (RCA), and implement long-term fixes to improve system resilience.
- Collaborate on observability solutions, including monitoring, logging, and alerting, using tools like Prometheus, Grafana, Splunk, and IBM Cloud Monitoring.
- Ensure compliance with security best practices and regulatory requirements.
- Implement secret management, encryption, and access control for sensitive infrastructure components.
- Participate in security audits, vulnerability assessments, and compliance automation efforts.
- Work closely with development, operations, and security teams to design and implement resilient architectures.
- Advocate for DevOps/SRE best practices, including blameless postmortems, incident retrospectives, and operational readiness reviews.
Master's Degree
Required Technical And Professional Expertise
- 2 years of experience in SRE, DevOps, or Software Engineering roles.
- An understanding of Cloud infrastructure/operations is a must
- Knows their way around a Unix/Linux shell, can write shell scripts, and understands Linux internals
- Experience in Software Development Life Cycle, Test Driven Development, Continuous Integration and Continuous Delivery
- Experience with containers, such as with Docker, Kubernetes and Open Shift
- Familiarity with Linux systems administration, networking, and distributed systems.
- Experience with troubleshooting production incidents and implementing permanent fixes.
- Ability to write clean, maintainable, and efficient automation code.
- Familiarity with Ansible, Bash, core Python development, and deployments in production environment
- Familiarity with one of C, C , golang, python, or Java
- PHP and perl development experience
- Experience in monitoring applications such as Grafana, ELK stack, Prometheus, Nagios, and Sysdig
- Familiarity with cloud deployment tooling