What are the responsibilities and job description for the Grafana SRE Architect position at VRK IT Vision Inc?
Job Description
Job Description
Job Summary
The Grafana SRE Architect will lead the design, implementation, and management of scalable, reliable, and performant Grafana-based observability solutions. This role bridges Site Reliability Engineering (SRE) practices with Grafana's ecosystem (Loki, Mimir, Tempo, etc.) to ensure robust monitoring, logging, tracing, and alerting for mission-critical systems. You will collaborate with DevOps, engineering, and infrastructure teams to align technical strategies with business objectives, driving automation, resilience, and cost efficiency across cloud and on-premises environments.
Key Responsibilities
- Architecture & Design
- Design end-to-end Grafana solutions for metrics, logs, traces, and dashboards, ensuring scalability, security, and compliance.
- Architect integrations with Prometheus, Loki, Mimir, Tempo, and third-party tools (e.g., AWS CloudWatch, Datadog).
- Define best practices for Grafana deployment (self-managed vs. Grafana Cloud) and optimize data storage / retention strategies.
- SRE Leadership
- Implement SRE principles : SLAs / SLOs / SLIs, error budgets, and blameless post-mortems.
- Build automated monitoring / alerting systems to preemptively identify system bottlenecks and failures.
- Lead incident response, root cause analysis, and remediation for observability-related outages.
- Collaboration & Integration
- Partner with DevOps teams to embed Grafana into CI / CD pipelines and automate provisioning via IaC (Terraform, Ansible).
- Work with developers to instrument applications for observability (OpenTelemetry, custom exporters).
- Advise stakeholders on cost-effective monitoring strategies and resource optimization.
- Performance Optimization
- Tune Grafana dashboards, queries, and data sources for high-performance environments.
- Optimize PromQL / Loki LogQL queries and manage large-scale time-series databases (Mimir).
- Conduct capacity planning and disaster recovery testing for Grafana ecosystems.
- Governance & Security
- Ensure compliance with security policies (RBAC, SSO, encryption) and audit requirements.
- Monitor Grafana stack health, perform upgrades, and enforce version control.
- Mentorship & Innovation
- Mentor SRE / engineering teams on Grafana best practices and SRE culture.
- Stay ahead of Grafana / Observability trends and pilot new tools (e.g., AI-driven anomaly detection).
Education & Experience
Technical Skills
Certifications (Preferred)
Soft Skills
Preferred Qualifications