What are the responsibilities and job description for the Observability Lead position at Aptimized?
Observability Lead
Location: Wayne, NJ
Job Type: [Full-time/Contract]
Visa : USC, GC
Job Summary:
Aptimized is seeking a highly skilled Observability Lead to spearhead our monitoring, logging and analytics initiatives. The ideal candidate will have expertise in Grafana, Vector, Power BI and Fabric Resource, ensuring comprehensive system visibility, performance optimization and data-driven insights. This role involves designing and implementing observability solutions to enhance operational efficiency and proactive incident management.
Key Responsibilities:
- Develop and Implement Observability Strategies: Design and maintain end-to-end observability frameworks leveraging Grafana, Vector, Power BI and Fabric Resource.
- Monitoring & Dashboards: Create and optimize dashboards, alerts and visualizations to provide real-time system performance insights.
- Log Management & Aggregation: Configure and maintain Vector for efficient log collection, transformation and shipping across distributed environments.
- Performance Analytics & Reporting: Utilize Power BI and Fabric Resource to analyze system performance metrics and generate actionable insights for stakeholders.
- Incident Detection & Resolution: Implement automated alerts and anomaly detection mechanisms to ensure proactive issue resolution.
- Collaboration & Stakeholder Engagement: Work with DevOps, SRE and IT teams to define observability best practices and integrate monitoring solutions into CI/CD pipelines.
- Continuous Improvement: Stay updated with industry best practices and emerging observability technologies to enhance system monitoring capabilities.
Required Qualifications:
- Proficiency in Grafana: Experience in building real-time dashboards, configuring alerts and integrating with various data sources (e.g., Prometheus, Loki, InfluxDB).
- Vector Expertise: Strong understanding of log collection, processing, and routing using Vector in cloud or on-prem environments.
- Power BI & Fabric Resource Knowledge: Ability to transform system telemetry data into meaningful insights using Microsoft’s Power BI and Fabric Resource.
- Scripting & Automation: Hands-on experience with scripting (Python, Bash, or PowerShell) for automating monitoring tasks.
- Cloud & Infrastructure Monitoring: Experience in observability solutions for AWS, Azure or Google Cloud environments.
- Strong Analytical Skills: Ability to interpret performance data, identify trends and recommend optimizations.
- Excellent Communication Skills: Ability to present insights and recommendations to technical and non-technical stakeholders.
Preferred Qualifications:
- Experience with additional monitoring tools such as Prometheus, OpenTelemetry, or Datadog.
- Familiarity with infrastructure as code (Terraform, Ansible) for deploying monitoring configurations.
- Knowledge of distributed systems and microservices architectures.