Job Posting for Site Reliability Engineer (m/f/d) at Ververica | Original creators of Apache Flink®
About Ververica
Ververica, founded by the original creators of Apache Flink™, empowers businesses to unlock the full potential of real-time data processing and analytics. Our platform provides cutting-edge stream processing and event-driven applications, enabling companies worldwide to build scalable and reliable data-driven solutions.
Role Overview
As a Site Reliability Engineer (SRE) at Ververica, you will design, provision, and maintain the infrastructure for Ververica's Unified Streaming Data Platform across multiple cloud providers, including AWS, GCP, and Azure. You will collaborate with software engineering teams to develop solutions that enhance feature delivery, optimize performance, and address security vulnerabilities. Your role will involve architectural improvements, implementation ownership, and driving reliability best practices.
Key Responsibilities
Build and maintain the infrastructure for Ververica's Unified Streaming Data Platform across AWS, GCP, and Azure
Design and manage Infrastructure as Code (IaC) using Terraform, ensuring modularity, reusability, and best practices
Implement and enhance observability tooling, including Grafana, Prometheus, logging systems, traces, metrics, dashboards, and alerts
Ensure system reliability through SRE best practices, including defining SLIs, SLOs, and error budgets
Improve infrastructure architecture and engineering efficiency through continuous evaluation and optimization
Enhance CI/CD pipelines to automate development workflows
Monitor, identify, and resolve security vulnerabilities (CVE updates and security enhancements)
Contribute to the successful development and launch of new products, features, and services
Periodically participate in on-call rotations to manage incidents in a 24/7 live infrastructure
Maintain and update documentation, including architectural designs and changes
Requirements
Bachelor's degree in Computer Science, Information Technology, or a related field
Minimum 2 years of hands-on experience with Kubernetes clusters, Helm charts, controllers, and operators
Proficiency in designing and maintaining Terraform code with best practices
Strong knowledge of observability tools and practices, including metrics, logging, and alerting systems
Experience implementing SRE principles such as SLIs, SLOs, and error budgets
Solid understanding of Linux systems and networking in cloud environments
If your compensation planning software is too rigid to deploy winning incentive strategies, it’s time to find an adaptable solution.
Compensation Planning
View Core, Job Family, and Industry Job Skills and Competency Data for more than 15,000 Job Titles
Skills Library