What are the responsibilities and job description for the AI/ML expertise position at CAYS Inc?
Job Details
Hello People,
AI/ML expertise
Location : Charlotte, NC
Duration : 12 Month
Job Details:
Join a cutting-edge Operational Intelligence team revolutionizing incident management through AI-driven observability solutions. As an AI/ML Engineer, you'll develop systems that optimize monitoring, anomaly detection, and root cause analysis, empowering teams to predict and prevent production issues. This role offers the chance to innovate at scale, working with state-of-the-art technology and contributing to mission-critical systems.
Key Responsibilities:
AI-Powered Observability:
Design and implement AI-driven tools to enhance real-time monitoring, reduce false positives, and provide actionable insights.
Integrate machine earning workflows with telemetry data source (e.g., Splunk, Dynatrace, Data Lakes,, etc.)
Automation and Innovation:
Automate root cause analysis (RCA) workflows using AI to reduce Mean Time to Identify (MTTI) and Mean Time to Restore (MTTR).
Collaborate with cross-functional teams to identify AI use cases in production and translate them into scalable solutions.
Model Development Deployment:
Build and fine-tune machine learning models / LLMs (e.g., anomaly detection, platform restoral, predictive maintenance) tailored to observability and incident management.
Deploy models in production environments using frameworks like NVIDIA Triton, TensorFlow Servicing, or Kubernetes.
Data Engineering and Optimization:
Build efficient pipelines for ingesting and preprocessing data (e.g., logs, metrics, traces) from observability platforms.
Leverage vector databases for hybrid search and retrieval-augmented generation (RAG) and agentic workflows.
Experience with SLURM, Lang Chain, LllamaIndex or similar capabilities.
Knowledge of NVIDA Triton or TensorRT for inference optimization.
Experience with Vector Databases and RAG Retrieval Augmented Generation (RAG)
Experience with AI Agents / Agentic Workflows.
Technical Skills:
Full stack Developer.
Proficiency in Python, TensorFlow, PyTorch, or similar ML frameworks.
Proficiency in LangChain, LllamaIndex, or similar AI frameworks.
Experience with observantly tools like Splunk, Dynatrace, Prometheus, or similar observability tools.
Familiarity with vector databases and integration into AI workflows.
AI/ML Expertise:
Proven ability to develop and deploy ML models in production environments.
Hand-on experience with anomaly detection, predictive analytics, and NLP-based tools.
DevOps:
Familiarity with CI/CD pipelines and modern DevOps Practices.
SRE:
Familiarity with Site Reliability Engineering and modern (SRE) Practices.
Collaboration:
Strong problem-solving skills and the ability to work in cross-functional teams, communicating technical concepts to non-technical stakeholders.
Location : Charlotte, NC
Duration : 12 Month
Job Details:
Join a cutting-edge Operational Intelligence team revolutionizing incident management through AI-driven observability solutions. As an AI/ML Engineer, you'll develop systems that optimize monitoring, anomaly detection, and root cause analysis, empowering teams to predict and prevent production issues. This role offers the chance to innovate at scale, working with state-of-the-art technology and contributing to mission-critical systems.
Key Responsibilities:
AI-Powered Observability:
Design and implement AI-driven tools to enhance real-time monitoring, reduce false positives, and provide actionable insights.
Integrate machine earning workflows with telemetry data source (e.g., Splunk, Dynatrace, Data Lakes,, etc.)
Automation and Innovation:
Automate root cause analysis (RCA) workflows using AI to reduce Mean Time to Identify (MTTI) and Mean Time to Restore (MTTR).
Collaborate with cross-functional teams to identify AI use cases in production and translate them into scalable solutions.
Model Development Deployment:
Build and fine-tune machine learning models / LLMs (e.g., anomaly detection, platform restoral, predictive maintenance) tailored to observability and incident management.
Deploy models in production environments using frameworks like NVIDIA Triton, TensorFlow Servicing, or Kubernetes.
Data Engineering and Optimization:
Build efficient pipelines for ingesting and preprocessing data (e.g., logs, metrics, traces) from observability platforms.
Leverage vector databases for hybrid search and retrieval-augmented generation (RAG) and agentic workflows.
Experience with SLURM, Lang Chain, LllamaIndex or similar capabilities.
Knowledge of NVIDA Triton or TensorRT for inference optimization.
Experience with Vector Databases and RAG Retrieval Augmented Generation (RAG)
Experience with AI Agents / Agentic Workflows.
Technical Skills:
Full stack Developer.
Proficiency in Python, TensorFlow, PyTorch, or similar ML frameworks.
Proficiency in LangChain, LllamaIndex, or similar AI frameworks.
Experience with observantly tools like Splunk, Dynatrace, Prometheus, or similar observability tools.
Familiarity with vector databases and integration into AI workflows.
AI/ML Expertise:
Proven ability to develop and deploy ML models in production environments.
Hand-on experience with anomaly detection, predictive analytics, and NLP-based tools.
DevOps:
Familiarity with CI/CD pipelines and modern DevOps Practices.
SRE:
Familiarity with Site Reliability Engineering and modern (SRE) Practices.
Collaboration:
Strong problem-solving skills and the ability to work in cross-functional teams, communicating technical concepts to non-technical stakeholders.
Vijay Bhaskar.
Lead Delivery Manager.
.
.
.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.