Senior Site Reliability Engineer (SRE)
Remote Position
6 Month Contract
Client is seeking a highly skilled Senior Site Reliability Engineer (SRE) to join our team and help ensure the reliability, availability, and scalability of our systems. As a Senior SRE, you will work closely with development, operations, and security teams to build, monitor, and improve infrastructure and application performance while implementing best practices in automation and incident management.
Key Responsibilities :
System Reliability & Performance
- Ensure high availability and reliability of applications and infrastructure.
- Design and implement robust monitoring, logging, and alerting systems.
- Conduct performance tuning and capacity planning to optimize system efficiency.
Automation & Infrastructure as Code (IaC)
Develop and maintain automation tools to manage deployments and configurations.Implement Infrastructure as Code (IaC) using tools like Terraform, Ansible, or CloudFormation.utomate manual operational tasks to improve efficiency and reduce downtime.Incident Management & Troubleshooting
Participate in on-call rotations to quickly resolve incidents and prevent recurrence.Perform root cause analysis (RCA) for production incidents and drive post-mortem reviews.Develop and document runbooks to standardize response procedures.DevOps & CI / CD
Work closely with development teams to implement CI / CD pipelines for faster and safer deployments.Optimize build and deployment workflows using Jenkins, GitHub Actions, or similar tools.Ensure security and compliance best practices are embedded in the deployment process.Cloud & Infrastructure Management
Manage and optimize cloud-based infrastructure (AWS, Azure, GCP).Implement container orchestration solutions using Kubernetes and Docker.Ensure security best practices for cloud-based environments, including IAM and network security.Required Skills & Qualifications :
Technical Expertise
Strong experience in Linux / Unix system administration.Hands-on experience with Kubernetes, Docker, and cloud platforms (AWS, Azure, or GCP).Proficiency in Terraform, Ansible, or CloudFormation for Infrastructure as Code.Monitoring & Observability
Experience with monitoring and logging tools such as Prometheus, Grafana, ELK Stack, Datadog, or Splunk.Automation & Scripting
Strong scripting skills in Python, Bash, or Go.Expertise in automating operational tasks and workflows.Incident Management & Troubleshooting
bility to analyze system failures and implement preventive solutions.Experience with incident response and root cause analysis.CI / CD & DevOps Practices
Experience with CI / CD tools such as Jenkins, GitLab CI / CD, or GitHub Actions.Familiarity with GitOps methodologies and release automation.Security & Compliance
Knowledge of network security, IAM, and compliance frameworks like SOC2, ISO27001.Preferred Qualifications :
Experience in SaaS, fintech, or high-scale distributed systems.Certifications in AWS, Kubernetes (CKA / CKAD), or Terraform.Familiarity with service mesh technologies like Istio or Linkerd.Metasys Technologies is an equal opportunity employer. All applicants will be considered for employment without attention to race, color, religion, sex, sexual orientation, gender identify, national origin, veteran or disability status.