What are the responsibilities and job description for the SRE with Kubernetes FOR Atlanta, GA (onsite 3 days in a week) position at Amaze Systems Inc?
Job Details
Role: SRE with Kubernetes
Location: Atlanta, GA (onsite 3 days in a week)
Duration: Long Term
JD:
Looking for a 5-10 Yrs experienced Kubernetes SRE
Must Have Skills
- Skill 1 7 Yrs of Exp Kubernetes, gitlab, splunk o11y
- Skill 2 7 Yrs of Exp , Prometheus , python or go language scripting
- Skill 3 5Yrs of Exp , Java troubleshooting, cisco observability
Responsibilities
- Infrastructure Management: Design, implement, and manage Kubernetes clusters in production environments to ensure high availability and reliability.
- Automation: Build and manage automation tools and scripts for continuous deployment, scaling, and self-healing of applications using Kubernetes and associated tooling (Helm, kubectl, Kustomize).
- Monitoring and Metrics: Implement robust monitoring solutions using Prometheus, Grafana, and other observability tools to track the health of Kubernetes clusters, applications, and services.
- Incident Management: Work with cross-functional teams to respond to incidents, identify root causes, and implement solutions to prevent recurrence.
- CI/CD Pipeline Optimization: Design and maintain continuous integration and deployment pipelines to improve the release cycle and reduce downtime.
- Capacity Planning: Forecast resource needs, scale systems efficiently, and optimize cloud infrastructure to meet growing demand.
- Disaster Recovery: Define and implement strategies for backup, recovery, and failover to ensure data integrity and uptime.
- Collaboration: Partner closely with development teams to help design scalable, resilient, and performant architectures on Kubernetes.
- Security: Ensure that the Kubernetes infrastructure follows best practices for security, including network policies, RBAC, and Pod security policies.
Required Skills & Qualifications:
- Experience with Kubernetes: Hands-on experience in deploying and managing Kubernetes clusters (preferably in production environments).
- Cloud Platforms: Strong experience with cloud platforms like AWS, Google Cloud Platform, or Azure, with a focus on Kubernetes as a service (e.g., EKS, GKE, AKS).
- Containerization: Expertise in container technologies like Docker, container orchestration with Kubernetes, and Helm charts.
- Automation Tools: Familiarity with Infrastructure-as-Code tools such as Terraform, Ansible, or CloudFormation.
- Monitoring & Observability: Knowledge of monitoring tools such as Prometheus, Grafana, ELK stack, or similar.
- Networking: Understanding of networking concepts (DNS, Load Balancers, etc.) and how they apply to Kubernetes.
- CI/CD Pipelines: Strong knowledge of CI/CD tools like Jenkins, GitLab CI, or CircleCI.
- Scripting: Proficiency in scripting languages such as Bash, Python, or Go. Incident Response & Root Cause
- Analysis: Experience in managing and resolving production incidents with a focus on improving systems after the event.
- Collaboration & Communication: Excellent communication skills to work in cross-functional teams and interact with stakeholders across the company.
Thanks &Regards
Rahul Sharma | Talent Acquisition Specialist
Amaze Systems Inc
E: |