What are the responsibilities and job description for the SRE Architect position at Galent?
We are seeking a highly skilled and experienced Site Reliability Engineer (SRE) Architect to join our team. In this role, you will be responsible for designing, implementing, and optimizing the infrastructure that supports our services and applications. You will work closely with engineering teams to ensure the availability, scalability, and performance of our systems, while also mentoring junior team members and driving operational excellence across the company.
Key Responsibilities:
- Architecture Design & Strategy:
- Lead the design and implementation of resilient, scalable, and high-performing systems.
- Collaborate with engineering teams to define architecture principles that drive reliability, scalability, and performance.
- Drive infrastructure as code (IaC) adoption and ensure best practices in cloud infrastructure management.
- Design and implement monitoring, alerting, and logging solutions that ensure the health and stability of production systems.
- Operational Excellence:
- Establish and maintain SRE best practices and standards for monitoring, incident response, and capacity planning.
- Own and lead the incident management process, ensuring timely resolution and post-incident reviews.
- Conduct risk assessments and proactive failure analysis to minimize system downtime and impact to customers.
- Collaboration & Mentoring:
- Work with cross-functional teams including software engineering, operations, and product management to ensure reliability is built into every stage of development.
- Provide technical leadership and mentorship to junior and senior engineers in the SRE team.
- Contribute to the hiring process by interviewing and assessing potential candidates for the SRE team.
- Automation & Optimization:
- Drive automation initiatives to eliminate manual processes and reduce operational overhead.
- Continuously improve existing tools and frameworks for monitoring, alerting, and incident response.
- Implement capacity planning and optimization strategies to ensure resource usage aligns with business needs.
- Metrics & Reporting:
- Define and track key reliability metrics (e.g., availability, latency, change failure rate) in collaboration with other engineering teams.
- Use data-driven approaches to drive decision-making and identify opportunities for system improvements.
- Security & Compliance:
- Ensure that systems are secure, compliant, and align with industry best practices.
- Work closely with security teams to identify and mitigate security risks in production environments.
Requirements:
- Education & Experience:
- Bachelor’s degree in Computer Science, Engineering, or a related field (or equivalent practical experience).
- 8 years of experience in Site Reliability Engineering, DevOps, or infrastructure roles.
- 3 years of experience in an architecture or leadership position within an SRE team.
- Technical Skills:
- Expertise in cloud platforms such as AWS, Azure, or Google Cloud.
- Strong experience with containerization and orchestration technologies such as Docker, Kubernetes, and Helm.
- Deep understanding of infrastructure as code (IaC) tools like Terraform, CloudFormation, or Ansible.
- Proficiency in monitoring and observability tools such as Prometheus, Grafana, New Relic, Datadog, etc.
- Advanced knowledge of automation, continuous integration/continuous deployment (CI/CD), and version control (Git).
- Strong programming/scripting skills in languages like Python, Go, Bash, or similar.
- In-depth knowledge of system architecture, distributed systems, and microservices.
- Soft Skills:
- Strong leadership and communication skills, with the ability to interact with engineers, product managers, and business stakeholders.
- Ability to manage complex, multi-faceted projects while delivering high-quality outcomes.
- Strong problem-solving and troubleshooting abilities in large-scale distributed systems.
Preferred Qualifications:
- Experience in designing and managing high-availability systems and fault-tolerant architectures.
- Familiarity with machine learning or big data systems and their operational needs.
- Certifications in cloud platforms (e.g., AWS Certified Solutions Architect, Google Professional Cloud Architect) are a plus.
- Experience with incident management tools (e.g., PagerDuty, Opsgenie) and postmortem processes.
- Knowledge of Agile or Scrum methodologies.