What are the responsibilities and job description for the Site Reliability Engineer position at The Judge Group?
Job Title : Site Reliability Engineer
Duration : Direct hire
Location : Hybrid Role - must be able to commit to 3 days / week in our Bloomington office
What you’ll be doing :
- Collaborate with development and operations teams to design, implement, and maintain observability frameworks that provide deep insights into system performance, particularly for data and ML pipelines.
- Lead the establishment of Service Level Objectives (SLOs) and Service Level Indicators (SLIs), ensuring they align with business goals and drive continuous performance improvements.
- Partner with stakeholders to understand system performance requirements and translate them into actionable performance engineering strategies.
- Proactively identify performance bottlenecks and collaborate with teams to implement solutions that enhance system scalability and reliability.
- Design and execute performance regression test suites, focusing on data-intensive and ML workloads, to ensure continuous performance optimization.
- Own the reliability and performance metrics of our systems, driving a culture of performance excellence and proactive issue resolution.
- Collaborate with subject matter experts to gain a deep understanding of domain-specific performance challenges, particularly in data and ML pipelines.
- Utilize tools like Datadog, Jira, and GitHub to monitor system performance, manage projects, and track issues, with a strong emphasis on performance-related metrics.
- Define and monitor success metrics, ensuring our systems consistently meet or exceed performance and reliability targets.
- Actively contribute to the continuous improvement of performance engineering practices across the team, fostering a culture of excellence in observability and system performance.
- Perform other duties as assigned.
What you’ll bring to us :