What are the responsibilities and job description for the Chaos Engineering Specialist position at EITAcies, Inc.?
Job Details
We are seeking a highly skilled and proactive Chaos Engineering Specialist to join our team. This individual will play a crucial role in ensuring system resilience and reliability by executing chaos engineering experiments, reviewing prior outages, and collaborating with stakeholders. The ideal candidate will bring expertise in observability platforms, Chaos engineering tools and help build a tool that meets client's needs or customize and automate existing tools in the market to run the chaos tests at scale.
Technical Skills
Chaos Engineering:
Hands-on experience with chaos engineering tools like AWS FIS, Gremlin, Chaos Monkey, Chaos Toolkit, or similar platforms.
Experience conducting experiments in production or controlled staging environments.
Ability to design, execute, and analyze chaos experiments to improve system reliability.
Ability to document the findings and observations and come up with solutions to mitigate future issues
Tool Development Expertise:
Proficiency in programming languages like Python, Java, or Go for building scalable applications.
Ability to customize and extend existing frameworks for specific testing needs.
Knowledge of frameworks for automation, such as Selenium, Robot Framework, or custom-built solutions.
Familiarity with REST APIs and the ability to integrate chaos testing tools with other platforms.
Architectural Knowledge:
Experience preparing architectural flow diagrams to illustrate system designs and processes.
Ability to identify and document potential failure points in complex systems.
Observability Platforms:
Proficiency in New Relic and Grafana for monitoring, logging, and visualization.
Ability to create custom dashboards and trace system performance.
Infrastructure as Code (IaC):
Expertise in tools like Terraform, Ansible, or CloudFormation.
Strong understanding of infrastructure automation principles.
Familiarity with scripting languages like Python, Bash, or Groovy.
CI/CD Pipelines:
Deep understanding of Jenkins, including writing custom jobs and automating Chaos Engineering experiments
Familiarity with containerization (e.g., Docker, Kubernetes) is a plus.
Professional Experience
6 years of experience in Chaos Engineering, SRE, or similar roles.
Proven track record of identifying system failures and implementing solutions.
Experience reviewing outages, gathering data, and presenting findings.
Familiarity with REST APIs and the ability to integrate chaos testing tools with other platforms.
Soft Skills
Excellent communication and collaboration skills for working with Application stakeholders and SRE engineers.
Strong organizational and project management abilities for scheduling and leading meetings.
Analytical mindset to gather data, assess system behavior, and propose recommendations.