What are the responsibilities and job description for the Site Reliability Engineer (SRE) position at DTEL Engineering & Consultants Inc?
Job Details
Job Summary:
We are seeking a highly motivated and experienced Site Reliability Engineer (SRE) to join our dynamic team. In this role, you will be crucial in ensuring the reliability, performance, and scalability of our critical systems and applications. You will leverage your expertise in Kubernetes-based platforms (specifically OpenShift), Google Cloud Services, and automation tools to maintain and improve our infrastructure. You will collaborate with development teams to ensure smooth deployments, monitor system health, and implement proactive solutions to prevent issues.
Responsibilities:
- OpenShift Management:
- Maintain and update OpenShift deployment configurations to adhere to Ford's standards and best practices.
- Monitor resource utilization of services deployed within OpenShift, identifying areas for optimization and efficiency.
- Tekton Pipeline Development and Maintenance:
- Design, develop, and implement Tekton pipelines for automated deployments, including deploying files to Amazon S3 storage.
- Maintain and enhance existing Tekton pipelines to meet Ford's requirements and improve overall CI/CD processes.
- Google Cloud Services Expertise:
- Utilize Google Cloud Storage, Cloud Run, and Pub/Sub services to build and maintain scalable and reliable solutions.
- Implement best practices for security, cost optimization, and performance within the Google Cloud environment.
- Terraform Infrastructure as Code:
- Use Terraform to manage and modify infrastructure and services deployed in Google Cloud, ensuring consistency and repeatability.
- Site Reliability Metrics and Monitoring:
- Design and implement site reliability metrics to proactively identify and address potential issues.
- Establish monitoring and alerting systems to ensure system health and performance.
- Security and Access Management:
- Manage password changes and enforce security policies.
- API Coordination:
- Coordinate API changes with downstream applications, ensuring seamless integration and minimal disruption.
- Collaboration and Communication:
- Work closely with development, operations, and security teams to ensure alignment and effective communication.
- Participate in on-call rotations to address critical incidents.
Qualifications:
- Required:
- Bachelor's degree in Computer Science, Engineering, or a related field (or equivalent experience).
- Proven experience with Kubernetes-based systems, such as GKE or OpenShift.
- Strong experience with Tekton for CI/CD automation.
- Hands-on experience with Google Cloud Services, including Storage, Cloud Run, and Pub/Sub.
- Proficiency in Terraform for infrastructure as code.
- Experience in designing and implementing site reliability metrics.
- Solid understanding of networking, security, and cloud computing principles.
- Excellent problem-solving and communication skills.
- Preferred (Optional):
- Experience with code reviews for Angular or Java applications.
- Experience working in an Agile/Scrum environment.
- Relevant certifications (e.g., Google Cloud Certified, Kubernetes certifications).