What are the responsibilities and job description for the Platform / Site Reliability Engineer position at Axiom Software Solutions Limited?
We are looking for a skilled Platform Engineer / SRE to design, implement, and maintain our cloud infrastructure and platforms. The ideal candidate will have a strong background in Kubernetes administration, Azure cloud services, infrastructure as code, and automation. You will play a crucial role in ensuring the scalability, reliability, and security of our systems while supporting our AI/ML initiatives.
* Design, deploy, and manage infrastructure solutions using Terraform, ensuring scalability, security, and reliability.
* Develop and maintain infrastructure as code scripts to automate the provisioning and configuration of resources.
* Ensure version-controlled, repeatable deployments using IaC best practices.
* Implement and manage Kubernetes clusters for containerized applications.
* Collaborate with development teams to deploy, scale, and optimize applications in Kubernetes environments.
* Leverage scripting languages (e.g Python) to automate routine tasks and streamline workflows.
* Implement continuous integration and continuous deployment (CI/CD) pipelines for efficient software delivery.
* Ensure seamless integration of infrastructure components with CI/CD pipelines.
* Design, deploy, and maintain scalable and reliable infrastructure for AI/ML platforms.
* Implement containerization (Docker) and orchestration (Kubernetes) solutions for deploying and managing AI/ML applications.
* Ensure containerized applications are secure, scalable, and easily deployable.
* Enable seamless integration of AI/ML models into the platform, ensuring data pipelines are efficient and reliable.
* Establish monitoring and alerting systems to ensure the health and performance of AI/ML platforms.
* Implement security best practices for AI/ML platforms, ensuring data privacy and compliance with industry standards
* Bachelor's degree in computer science, Engineering, or a related field
* Proven experience in Kubernetes administration, specifically with Azure Kubernetes Service (AKS)
* Strong proficiency in Azure cloud services and Azure ARM templates
* Expert-level scripting skills in PowerShell and Python
* Hands-on experience with Terraform for infrastructure as code
* Solid understanding of CI/CD principles and experience with Azure DevOps
* Experience with containerization technologies, particularly Docker
* Strong problem-solving skills and ability to work in a fast-paced environment
* Excellent communication and collaboration skills