What are the responsibilities and job description for the MLOps Engineer position at Cynet Systems?
Job Description:
Responsibilities:
Responsibilities:
- Work with AI/Client Platform Enablement team within the eCommerce Analytics team.
- The broader team is currently on a transformation path, and this role will be instrumental in enabling the broader team's vision.
- Work closely with data scientists to help with production models and maintain them in production.
- Deploy and configure Kubernetes components for production cluster, including API Gateway, Ingress, Model Serving, Logging, Monitoring, Cron Jobs, etc.
- Improve the model deployment process for MLE for faster builds and simplified workflows.
- Be a technical leader on various projects across platforms and a hands-on contributor of the entire platform's architecture.
- System administration, security compliance, and internal tech audits.
- Responsible for leading operational excellence initiatives in the AI/Client space which includes efficient use of resources, identifying optimization opportunities, forecasting capacity, etc.
- Design and implement different flavors of architecture to deliver better system performance and resiliency.
- Develop capability requirements and transition plan for the next generation of AI/Client enablement technology, tools, and processes to enable Walmart to efficiently improve performance with scale.
- Administering Kubernetes.
- Ability to create, maintain, scale, and debug production Kubernetes clusters as a Kubernetes administrator and In-depth knowledge of Docker.
- Ability to transform designs ground up and lead innovation in system design.
- Deep understanding of data center architectures, networking, storage solutions, and scale system performance.
- Have worked on at least one Kubernetes cloud offering (EKS/GKE/AKS) or on-prem Kubernetes (native Kubernetes, Gravity, MetalK8s).
- Programming experience in Python, Node, Golang, or bash.
- Ability to use observability tools (Client, Prometheus, and Grafana ) to look at logs and metrics to diagnose issues within the system.
- Experience with Seldon core, MLFlow, Istio, Jaeger, Ambassador, Triton, PyTorch, Tensorflow/TFserving is a plus.
- Experience with distributed computing and deep learning technologies such as Apache MXNet, CUDA, cuDNN, TensorRT.
- Experience hardening a production-level Kubernetes environment (memory/CPU/GPU limits, node taints, annotations/labels, etc.).
- Experience with Kubernetes cluster networking and Linux host networking.
- Experience scaling infrastructure to support high-throughput data-intensive applications.
- Background with automation and monitoring platforms, MLOps ,and configuration management platforms FLASK , POD profile.
- 5 years relevant experience in roles with responsibility over data platforms and data operations dealing with large volumes of data in cloud based distributed computing environments.
- Graduate degree preferred in a quantitative discipline (e.g., computer engineering, computer science, economics, math, operations research).
- Proven ability to solve enterprise level data operations problems at scale which require cross-functional collaboration for solution development, implementation, and adoption.