Demo

AI/MLOps Architect - AI Platform Architect

Tredence Inc.
Dallas, TX Full Time
POSTED ON 4/15/2025
AVAILABLE BEFORE 5/14/2025

Dear Candidate,


We have an opening for AIMLOps Architect at Dallas, TX Onsite USA


Job Type- Full time with Tredence Inc

Location- Dallas, TX Onsite


This position requires a candidate who can bridge the gap between theoretical knowledge and practical implementation, with a demonstrated ability to solve complex observability challenges that cross infrastructure, data engineering, and AI domains. The successful candidate will have encountered and overcome the nuanced challenges of monitoring AI systems at scale in production environments.


AI Platform Architect


  • We are seeking an exceptionally skilled AI Platform Architect to design and implement an enterprise-grade monitoring solution on Azure and Kubernetes that provides comprehensive visibility across our diverse AI portfolio. The ideal candidate will bring extensive hands-on experience architecting distributed systems that handle complex observability challenges unique to modern AI workloads.


Key Responsibilities:


  • Architecture & System Design.
  • Architect a multi-tenant observability platform leveraging Azure managed services (AKS, Event Hubs, Azure Monitor) with custom components for AI-specific telemetry.
  • Design scalable data ingestion pipelines capable of handling high-throughput telemetry from distributed AI systems.
  • Implement sampling strategies and aggregation techniques to manage observability data volume while preserving statistical significance.
  • Create resilient integration patterns between the platform and Arize AI, ensuring graceful degradation during outages.
  • Develop schema evolution strategies to accommodate changing metrics requirements across AI workloads.
  • Technical Implementation.
  • Design and implement custom instrumentation libraries for capturing domain-specific metrics across different AI paradigms (accuracy drift in CV models, token usage in GenAI, inference latency at edge devices).
  • Architect pattern for cold-path analytics vs. hot-path alerting with appropriate data storage strategies.
  • Develop advanced correlation mechanisms to link model performance metrics with infrastructure telemetry.
  • Create visualization layers that expose actionable insights rather than raw metrics.
  • Implement anomaly detection systems that understand AI-specific failure modes.


Domain-Specific Expertise:


  • Design monitoring solutions for edge AI deployments addressing intermittent connectivity, battery usage, and on-device performance degradation.
  • Create specialized observability patterns for generative AI systems including prompt tracking, token economics, and hallucination detection.
  • Implement embeddings drift detection for NLP models and visual quality degradation tracking for computer vision systems.
  • Design monitoring systems for reinforcement learning feedback loops and online learning environments.
  • Develop systems to track model version lineage and A/B experiment outcomes.


Integration & Operations:


  • Implement advanced authentication and authorization patterns between observability components
  • Design network architecture that enables secure telemetry collection from air-gapped environments
  • Create backup and disaster recovery strategies specific to high-volume observability data
  • Develop custom Kubernetes operators to automate observability infrastructure management
  • Design and implement advanced alerting systems with noise reduction techniques and contextual notifications


Required Qualifications:


  • 10 years of software architecture experience with at least 3 years focused on AI platforms
  • Deep expertise with Azure services including AKS, Container Apps, Event Hubs, Azure Monitor, Application Insights, and Azure Log Analytics
  • Hands-on experience implementing observability for at least two distinct AI domains (CV, NLP, GenAI, etc.)
  • Demonstrated experience with high-scale telemetry ingestion (500 events/second) and retention strategies.
  • Practical experience integrating and extending third-party observability tools like Arize AI, Weights & Biases, or similar platforms
  • Expertise in Kubernetes networking, custom resources, and operators relevant to observability
  • Strong programming proficiency in at least two languages commonly used in observability (Python, Go, Java)
  • Experience implementing distributed tracing solutions spanning multiple services and protocols
  • Demonstrated success designing intuitive dashboards that provide actionable insights from complex data.


Preferred Qualifications:


  • Experience implementing observability for models deployed across public cloud and edge devices simultaneously
  • Hands-on work with ML feature stores and feature monitoring in production
  • Experience developing custom Prometheus exporters or OpenTelemetry plugins
  • Implementation of explainability tracking for AI models in production
  • Experience with model governance and regulatory compliance monitoring
  • Knowledge of dimensionality reduction techniques applied to observability data visualization
  • Background designing systems that handle PII/sensitive data within observability platforms
  • Practical experience with cost optimization for observability at scale (100 TB of telemetry data)


Technical Proficiencies:


  • Kubernetes Ecosystem: Helm, Istio, Prometheus, Grafana, Jaeger, custom operators.
  • Azure Platform: RBAC, Private Link, Managed Identities, KeyVault integration, AKS networking.
  • Data Processing: Real-time stream processing, time-series databases, dimension reduction.
  • API Design: RESTful API design, gRPC, GraphQL, API versioning strategies.
  • AI Systems: Inference optimization, model drift detection, feature importance tracking.
  • Security: Zero-trust architecture, secure telemetry collection, audit logging.

If your compensation planning software is too rigid to deploy winning incentive strategies, it’s time to find an adaptable solution. Compensation Planning
Enhance your organization's compensation strategy with salary data sets that HR and team managers can use to pay your staff right. Surveys & Data Sets

What is the career path for a AI/MLOps Architect - AI Platform Architect?

Sign up to receive alerts about other jobs on the AI/MLOps Architect - AI Platform Architect career path by checking the boxes next to the positions that interest you.
Income Estimation: 
$103,114 - $138,258
Income Estimation: 
$118,163 - $145,996
Income Estimation: 
$120,777 - $151,022
Income Estimation: 
$129,363 - $167,316
Income Estimation: 
$86,891 - $130,303
Income Estimation: 
$81,253 - $112,554
Income Estimation: 
$89,966 - $112,616
Income Estimation: 
$95,407 - $122,738
Income Estimation: 
$103,114 - $138,258
Income Estimation: 
$86,891 - $130,303
Income Estimation: 
$103,114 - $138,258
Income Estimation: 
$118,163 - $145,996
Income Estimation: 
$120,777 - $151,022
Income Estimation: 
$129,363 - $167,316
Income Estimation: 
$86,891 - $130,303
Income Estimation: 
$178,619 - $225,190
Income Estimation: 
$132,903 - $169,021
Income Estimation: 
$144,671 - $184,917
Income Estimation: 
$136,361 - $179,761
Income Estimation: 
$86,891 - $130,303
Income Estimation: 
$129,363 - $167,316
Income Estimation: 
$145,845 - $177,256
Income Estimation: 
$147,836 - $182,130
Income Estimation: 
$154,597 - $194,610
Income Estimation: 
$86,891 - $130,303
View Core, Job Family, and Industry Job Skills and Competency Data for more than 15,000 Job Titles Skills Library

Job openings at Tredence Inc.

Tredence Inc.
Hired Organization Address Bethesda, MD Full Time
Job Title : Architect – Snowflake & AWS Experience : 13 Years Job Summary : We are seeking a highly experienced Architec...
Tredence Inc.
Hired Organization Address Dallas, TX Full Time
About Tredence : About Tredence­ : - Tredence focuses on last mile delivery of insights into actions by uniting its stre...
Tredence Inc.
Hired Organization Address Tallahassee, FL Full Time
Job Overview : As a Remote Content Writer, you will be responsible for producing high-quality content for a variety of d...
Tredence Inc.
Hired Organization Address Seattle, WA Full Time
Senior Technical Architect (AWS Cloud and Databricks) Location : Seattle, WA Job Description We are seeking a highly ski...

Not the job you're looking for? Here are some other AI/MLOps Architect - AI Platform Architect jobs in the Dallas, TX area that may be a better fit.

Director of ASC Operations

Architect, Dallas, TX

AWS Architect

Saxon AI, Plano, TX

AI Assistant is available now!

Feel free to start your new journey!