What are the responsibilities and job description for the Site Reliability Engineer position at CEI?
Overview:
We are seeking an experienced Site Reliability Engineer (SRE) with expertise in Chaos Engineering to join a leading company in the Entertainment & Theme Parks industry. This role focuses on designing, implementing, and executing chaos experiments to proactively identify weaknesses and improve system resilience. You will play a key role in ensuring the reliability, scalability, and observability of mission-critical systems that support high-traffic digital experiences.
Job at a Glance:
Location: Hybrid – Orlando, FL
Contract Type: 12-month contract with potential for extension or conversion
Pay Rate: $75-$80/hr (W2)
Primary Responsibilities:
- Design and implement chaos experiments to simulate failures and measure system resilience.
- Collaborate with product teams to improve observability, reliability, and scalability.
- Automate infrastructure provisioning and service management.
- Develop self-healing mechanisms to mitigate failures dynamically.
- Define and enforce Service Level Objectives (SLOs) and Service Level Indicators (SLIs).
- Participate in architecture reviews to ensure fault-tolerant design principles are applied.
- Designing and executing chaos experiments to proactively identify failure points before they cause major incidents.
- Breaking down tasks and managing priorities to meet project objectives.
- Ensuring timely completion of assigned tickets and tasks.
- Representing the SRE team in cross-functional incidents and technical discussions.
- Mentoring and guiding engineers in reliability best practices and chaos testing methodologies.
Basic Qualifications:
- 5 years of experience in Site Reliability Engineering, DevOps, or a related field.
- Strong background in Chaos Engineering frameworks (e.g., Gremlin, Chaos Monkey, LitmusChaos).
- Experience with cloud platforms (AWS, GCP, or Azure) and infrastructure-as-code tools (Terraform, CloudFormation).
- Proficiency in one or more programming/scripting languages (Python, Go, Bash, etc.).
- Hands-on experience with Kubernetes and container orchestration.
- Strong troubleshooting and problem-solving skills, particularly in high-pressure situations.
- Experience with incident response, post-mortem analysis, and root cause identification.
- Familiarity with monitoring, logging, and observability tools (Prometheus, Grafana, ELK, Datadog, etc.).
- Experience with compliance, security, and vulnerability management.
- Excellent communication and stakeholder management skills.
Preferred Qualifications:
- Experience with large-scale distributed systems and microservices architecture.
- Knowledge of ITIL processes for incident, problem, and change management.
- Familiarity with message queueing systems (Kafka, RabbitMQ, etc.).
- Knowledge of advanced networking concepts, DNS, load balancing, and traffic management.
- Experience in developing automated remediation and self-healing solutions.
- Active participation in the Chaos Engineering community and open-source contributions.
Salary : $75 - $80