What are the responsibilities and job description for the Site Reliability Engineer position at Prama?
Job Title: Site Reliability Engineer (SRE) - Kubernetes and Systems Automation Specialist
Location: Remote
Job Type: Contract
Position Overview:
We are seeking a highly skilled and motivated Site Reliability Engineer (SRE) with a specialization in Kubernetes and Systems Automation. In this role, you will be responsible for ensuring the reliability, scalability, and performance of our cloud-based infrastructure. You will work closely with engineering, DevOps, and operations teams to design and implement systems that automate operations, reduce manual intervention, and enhance our platform's overall reliability.
Key Responsibilities:
1. Kubernetes Management:
- Deploy, manage, and maintain Kubernetes clusters in cloud and on-premises environments.
- Optimize Kubernetes configurations, including namespaces, pods, services, and networking.
- Implement robust CI/CD pipelines for containerized applications.
- Monitor and troubleshoot Kubernetes workloads to ensure high availability.
2. Systems Automation:
- Develop and manage Infrastructure as Code (IaC) using tools like Terraform, Ansible, or similar.
- Automate routine tasks, including monitoring, deployments, scaling, and incident responses.
- Write efficient scripts for task automation using Python, Bash, or similar languages.
- Collaborate on the design and implementation of automated disaster recovery and failover strategies.
3. Performance and Reliability:
- Set up monitoring and alerting systems using tools like Prometheus, Grafana, or Datadog.
- Perform root cause analysis for incidents and implement solutions to prevent future occurrences.
- Establish Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to measure and improve system reliability.
4. Collaboration and Culture:
- Partner with development teams to design and implement scalable and fault-tolerant systems.
- Drive a culture of reliability through postmortem analysis and a blameless incident review process.
- Provide guidance and training to teams on Kubernetes best practices and systems automation.
Qualifications:
Technical Expertise:
- Proven experience managing Kubernetes in production environments.
- Strong knowledge of containerization technologies (e.g., Docker).
- Proficiency in Infrastructure as Code (IaC) tools like Terraform, CloudFormation, or Ansible.
- Expertise in programming and scripting languages (e.g., Python, Go, Bash).
Cloud Experience:
- Hands-on experience with major cloud platforms (AWS, GCP, Azure).
- Knowledge of hybrid and multi-cloud setups is a plus.
Systems Engineering:
- Strong understanding of Linux/Unix systems and networking fundamentals.
- Familiarity with logging and monitoring tools like ELK Stack, Prometheus, Grafana, or similar.
Soft Skills:
- Strong problem-solving and analytical skills.
- Excellent communication and teamwork abilities.
- Ability to document technical solutions clearly and effectively.
Preferred Qualifications:
- Certified Kubernetes Administrator (CKA) or similar certifications.
- Experience with service mesh technologies like Istio or Linkerd.
- Familiarity with security best practices in Kubernetes and cloud environments.