What are the responsibilities and job description for the Automation Engineer position at Intraedge?
Job Details
Job Title: Automation Engineer (Python and SRE)
Location: Austin, TX
Duration: Long term contract
Responsibilities
Develop Python-based automation solutions to streamline on-prem (Pivotal Cloud Foundry, Windows & Linux based VMs) and Cloud infrastructure management on Google Cloud Platform and Kubernetes.
Continuously identify and implement the opportunities to enhance the operational excellence.
Build proactive and innovative solutions that can scale.
Implement and manage configuration automation using Ansible (desirable).
Integrate various tools and services via APIs and client libraries, enabling seamless interoperability across systems.
Enhance deployment reliability by implementing automated chaos strategies, failover mechanisms, and self-healing infrastructure.
Develop proactive monitoring and alerting solutions using tools like Splunk, Google Cloud Platform Operations Suite, Grafana, and Prometheus.
Perform deep root cause analysis (RCA), incident management for complex system failures and develop automation to prevent recurrence.
Work on system resilience and performance tuning, ensuring mission-critical applications run efficiently.
Apply AI/ML techniques to automation workflows, enhancing anomaly detection, predictive scaling, and intelligent alerting.
Identify and develop AIOps opportunities, reducing operational overhead through intelligent automation.
Experiment with machine learning models to optimize log analysis, monitoring insights, and failure predictions.
Required Skills & Experience
Strong background in Systems Engineering with a focus on automation and reliability.
Proficiency in Python (intermediate to expert level) for developing automation and integrations.
Hands-on expertise with Kubernetes and cloud platforms (Google Cloud Platform or any major cloud).
Experience integrating various tools and platforms via APIs and client libraries.
Deep understanding of monitoring and alerting using Splunk, Google Cloud Platform Operations Suite, Grafana, and Prometheus.
Ability to work in aggressive, high-stakes environments where reliability and uptime are critical.
Strong problem-solving skills, capable of navigating uncertainty and handling complex challenges.
Experience with Ansible for infrastructure automation.
Prior experience working in mission-critical teams handling large-scale, high-availability systems is a plus.
Enthusiasm for AI/ML and AIOps, with a desire to apply it in automation and operations.