Demo

Site Reliability Engineer (The Reliability Guardian)

Credible
Credible Salary
Austin, TX Full Time
POSTED ON 12/27/2024
AVAILABLE BEFORE 1/24/2025
Are you passionate about building and maintaining resilient systems that ensure high availability and performance? Do you excel at automating processes, troubleshooting complex issues, and creating systems that scale smoothly? If you’re ready to take on the challenge of ensuring reliable, efficient, and secure system operations, our client has the perfect role for you. We’re looking for a Site Reliability Engineer (aka The Reliability Guardian) to enhance system reliability, implement automation, and support a seamless user experience.

As a Site Reliability Engineer at our client, you’ll collaborate with developers, DevOps engineers, and IT specialists to build infrastructure that is both resilient and scalable. Your expertise in monitoring, automation, and performance optimization will be crucial for maintaining system uptime and supporting continuous improvement. You’ll play a vital role in making sure that services are reliable, efficient, and prepared to handle the demands of the future.

Key Responsibilities Automate Processes and Improve Efficiency: Monitor and Respond to System Health: Incident Management and Troubleshooting: Collaborate on System Architecture and Scalability: Implement Security and Compliance Standards: Develop and Maintain CI/CD Pipelines:

  • Ensure System Reliability and Performance:
  • Design and implement strategies to enhance system reliability and performance, focusing on scalability and redundancy. You’ll ensure high availability across distributed systems and proactively address potential issues.
  • Develop automation scripts and tools to reduce manual interventions and improve deployment, monitoring, and maintenance processes. You’ll leverage tools like Ansible, Puppet, or custom scripts to enhance automation.
  • Implement and manage monitoring solutions such as Prometheus, Grafana, or Datadog to track system health. You’ll set up alerts, dashboards, and automated responses to maintain optimal performance and detect potential failures early.
  • Lead incident response efforts to quickly address and resolve service disruptions. You’ll document incidents and contribute to post-mortem analysis to prevent future occurrences and refine operational procedures.
  • Work with engineering and development teams to design and scale infrastructure. You’ll contribute to decisions on architectural improvements and provide input on capacity planning and load testing.
  • Integrate security practices into the reliability workflow, ensuring that all automated processes, monitoring solutions, and operational systems meet compliance and security standards.
  • Support and improve continuous integration and deployment pipelines to facilitate smooth code releases. You’ll ensure that pipelines are optimized for speed, reliability, and scalability.

Required Skills

  • Reliability and Performance Expertise: Strong experience in ensuring system reliability and performance in complex, distributed environments. You understand how to design systems that recover gracefully from failures.
  • Automation and Scripting: Proficiency in automating tasks using scripting languages such as Python, Bash, or PowerShell. You have experience with automation tools like Ansible, Chef, or Puppet.
  • Monitoring and Incident Management: Familiarity with monitoring tools such as Prometheus, Grafana, ELK Stack, or Datadog. You’re skilled at setting up monitoring dashboards, alerts, and automated incident responses.
  • CI/CD Pipeline Knowledge: Experience in maintaining and optimizing CI/CD pipelines using tools like Jenkins, GitLab CI/CD, or CircleCI. You can integrate reliability practices into the deployment process.
  • Security and Compliance Awareness: Knowledge of integrating security standards and practices into site reliability processes, ensuring that compliance is maintained throughout operational workflows.

Educational Requirements

  • Bachelor’s or Master’s degree in Computer Science, IT, or a related field. Equivalent experience in reliability engineering or systems engineering may be considered.
  • Certifications related to cloud platforms or DevOps (e.g., AWS Certified DevOps Engineer, Google Professional Cloud DevOps Engineer) are a plus.

Experience Requirements

  • 5 years of experience in site reliability engineering, DevOps, or a similar field, with a strong background in system monitoring and automation.
  • Hands-on experience in building and managing high-availability and distributed systems.
  • Familiarity with cloud platforms (AWS, GCP, Azure) and container orchestration tools such as Kubernetes is highly desirable.

Benefits

  • Health and Wellness: Comprehensive medical, dental, and vision insurance plans with low co-pays and premiums.
  • Paid Time Off: Competitive vacation, sick leave, and 20 paid holidays per year.
  • Work-Life Balance: Flexible work schedules and telecommuting options.
  • Professional Development: Opportunities for training, certification reimbursement, and career advancement programs.
  • Wellness Programs: Access to wellness programs, including gym memberships, health screenings, and mental health resources.
  • Life and Disability Insurance: Life insurance and short-term/long-term disability coverage.
  • Employee Assistance Program (EAP): Confidential counseling and support services for personal and professional challenges.
  • Tuition Reimbursement: Financial assistance for continuing education and professional development.
  • Community Engagement: Opportunities to participate in community service and volunteer activities.
  • Recognition Programs: Employee recognition programs to celebrate achievements and milestones.

If your compensation planning software is too rigid to deploy winning incentive strategies, it’s time to find an adaptable solution. Compensation Planning
Enhance your organization's compensation strategy with salary data sets that HR and team managers can use to pay your staff right. Surveys & Data Sets

What is the career path for a Site Reliability Engineer (The Reliability Guardian)?

Sign up to receive alerts about other jobs on the Site Reliability Engineer (The Reliability Guardian) career path by checking the boxes next to the positions that interest you.
Income Estimation: 
$92,877 - $110,401
Income Estimation: 
$120,933 - $155,034
Income Estimation: 
$114,618 - $136,401
Income Estimation: 
$92,369 - $122,605
Income Estimation: 
$117,024 - $149,811
Income Estimation: 
$117,024 - $149,811
Income Estimation: 
$137,568 - $176,908
Income Estimation: 
$137,568 - $176,908
Income Estimation: 
$158,960 - $205,707
Income Estimation: 
$71,493 - $96,419
Income Estimation: 
$92,369 - $122,605
View Core, Job Family, and Industry Job Skills and Competency Data for more than 15,000 Job Titles Skills Library

Job openings at Credible

Credible
Hired Organization Address Durham, NC Full Time
About the Role: As a member of Credible’s Customer Success Team you’re enthusiastic about solving problems and answering...
Credible
Hired Organization Address San Francisco, CA Full Time
Are you passionate about turning complex data into compelling visuals that tell a story and drive strategic decisions? D...
Credible
Hired Organization Address Durham, NC Full Time
About the Role: As the Director of Product Design at Credible, you will be at the forefront of shaping exceptional user ...
Credible
Hired Organization Address Ridgefield, NJ Contractor
Company Description For More Open Positions Visit us at: http://recruiting.woongjininc.com/ Our Mission WOONGJIN, Inc. i...

Not the job you're looking for? Here are some other Site Reliability Engineer (The Reliability Guardian) jobs in the Austin, TX area that may be a better fit.

Site Reliability Engineer

Zenoss, Austin, TX

Senior Site Reliability Engineer

Cognite, Austin, TX

AI Assistant is available now!

Feel free to start your new journey!