What are the responsibilities and job description for the AI/MLOps Architect - AI Platform Architect position at Tredence Inc.?
Dear Candidate,
We have an opening for AIMLOps Architect at Dallas, TX Onsite USA
Job Type- Full time with Tredence Inc
Location- Dallas, TX Onsite
This position requires a candidate who can bridge the gap between theoretical knowledge and practical implementation, with a demonstrated ability to solve complex observability challenges that cross infrastructure, data engineering, and AI domains. The successful candidate will have encountered and overcome the nuanced challenges of monitoring AI systems at scale in production environments.
AI Platform Architect
- We are seeking an exceptionally skilled AI Platform Architect to design and implement an enterprise-grade monitoring solution on Azure and Kubernetes that provides comprehensive visibility across our diverse AI portfolio. The ideal candidate will bring extensive hands-on experience architecting distributed systems that handle complex observability challenges unique to modern AI workloads.
Key Responsibilities:
- Architecture & System Design.
- Architect a multi-tenant observability platform leveraging Azure managed services (AKS, Event Hubs, Azure Monitor) with custom components for AI-specific telemetry.
- Design scalable data ingestion pipelines capable of handling high-throughput telemetry from distributed AI systems.
- Implement sampling strategies and aggregation techniques to manage observability data volume while preserving statistical significance.
- Create resilient integration patterns between the platform and Arize AI, ensuring graceful degradation during outages.
- Develop schema evolution strategies to accommodate changing metrics requirements across AI workloads.
- Technical Implementation.
- Design and implement custom instrumentation libraries for capturing domain-specific metrics across different AI paradigms (accuracy drift in CV models, token usage in GenAI, inference latency at edge devices).
- Architect pattern for cold-path analytics vs. hot-path alerting with appropriate data storage strategies.
- Develop advanced correlation mechanisms to link model performance metrics with infrastructure telemetry.
- Create visualization layers that expose actionable insights rather than raw metrics.
- Implement anomaly detection systems that understand AI-specific failure modes.
Domain-Specific Expertise:
- Design monitoring solutions for edge AI deployments addressing intermittent connectivity, battery usage, and on-device performance degradation.
- Create specialized observability patterns for generative AI systems including prompt tracking, token economics, and hallucination detection.
- Implement embeddings drift detection for NLP models and visual quality degradation tracking for computer vision systems.
- Design monitoring systems for reinforcement learning feedback loops and online learning environments.
- Develop systems to track model version lineage and A/B experiment outcomes.
Integration & Operations:
- Implement advanced authentication and authorization patterns between observability components
- Design network architecture that enables secure telemetry collection from air-gapped environments
- Create backup and disaster recovery strategies specific to high-volume observability data
- Develop custom Kubernetes operators to automate observability infrastructure management
- Design and implement advanced alerting systems with noise reduction techniques and contextual notifications
Required Qualifications:
- 10 years of software architecture experience with at least 3 years focused on AI platforms
- Deep expertise with Azure services including AKS, Container Apps, Event Hubs, Azure Monitor, Application Insights, and Azure Log Analytics
- Hands-on experience implementing observability for at least two distinct AI domains (CV, NLP, GenAI, etc.)
- Demonstrated experience with high-scale telemetry ingestion (500 events/second) and retention strategies.
- Practical experience integrating and extending third-party observability tools like Arize AI, Weights & Biases, or similar platforms
- Expertise in Kubernetes networking, custom resources, and operators relevant to observability
- Strong programming proficiency in at least two languages commonly used in observability (Python, Go, Java)
- Experience implementing distributed tracing solutions spanning multiple services and protocols
- Demonstrated success designing intuitive dashboards that provide actionable insights from complex data.
Preferred Qualifications:
- Experience implementing observability for models deployed across public cloud and edge devices simultaneously
- Hands-on work with ML feature stores and feature monitoring in production
- Experience developing custom Prometheus exporters or OpenTelemetry plugins
- Implementation of explainability tracking for AI models in production
- Experience with model governance and regulatory compliance monitoring
- Knowledge of dimensionality reduction techniques applied to observability data visualization
- Background designing systems that handle PII/sensitive data within observability platforms
- Practical experience with cost optimization for observability at scale (100 TB of telemetry data)
Technical Proficiencies:
- Kubernetes Ecosystem: Helm, Istio, Prometheus, Grafana, Jaeger, custom operators.
- Azure Platform: RBAC, Private Link, Managed Identities, KeyVault integration, AKS networking.
- Data Processing: Real-time stream processing, time-series databases, dimension reduction.
- API Design: RESTful API design, gRPC, GraphQL, API versioning strategies.
- AI Systems: Inference optimization, model drift detection, feature importance tracking.
- Security: Zero-trust architecture, secure telemetry collection, audit logging.