Demo

Site Reliability Engineer (SRE) (The Uptime Guardian)

Credible
Austin, TX Full Time
POSTED ON 12/27/2024 CLOSED ON 1/25/2025

What are the responsibilities and job description for the Site Reliability Engineer (SRE) (The Uptime Guardian) position at Credible?

Introduction

Are you a systems expert who thrives on maintaining high availability, scalability, and performance in complex, distributed environments? Do you enjoy solving infrastructure challenges and automating everything in sight? If you're passionate about building resilient systems and ensuring 24/7 uptime, then our client has the perfect role for you. We’re looking for a Site Reliability Engineer (SRE) (aka The Uptime Guardian) to drive system reliability, automate operations, and ensure our services stay available even under pressure.

As a Site Reliability Engineer at our client, you’ll focus on building and maintaining highly reliable, scalable infrastructure that supports our products and services. You’ll be responsible for ensuring that our systems are optimized, automated, and robust enough to handle the demands of modern applications. This role blends software engineering, operations, and problem-solving, making it perfect for someone who enjoys working across multiple areas of the tech stack.

Key ResponsibilitiesAutomation and Infrastructure as Code (IaC):High Availability and Performance Optimization:Disaster Recovery and Backup Solutions:Collaboration with Development and DevOps Teams:On-Call Responsibilities and Incident Response:Capacity Planning and Scalability:

  • System Monitoring and Incident Management:
  • Set up and manage monitoring, logging, and alerting systems using tools like Prometheus, Grafana, or ELK Stack. You’ll proactively identify and resolve issues before they impact users and be responsible for managing incidents when they arise.
  • Automate everything! From infrastructure provisioning to deployments and scaling, you’ll use tools like Terraform, Ansible, or Puppet to manage infrastructure as code. You’ll ensure that systems are built to scale and adapt automatically to load.
  • Ensure services and applications are always available and optimized for performance. You’ll design and implement strategies to improve uptime, reduce latency, and scale services efficiently, using techniques such as load balancing, failover systems, and clustering.
  • Design, implement, and test disaster recovery strategies and backup solutions. You’ll ensure that systems and data are recoverable in the event of an outage or failure, minimizing downtime and impact on users.
  • Work closely with developers and DevOps engineers to ensure that new features are reliable and scalable. You’ll collaborate to implement reliability engineering practices such as service level indicators (SLIs) and service level objectives (SLOs) and enforce best practices for system reliability.
  • Participate in on-call rotations to respond to incidents, troubleshoot problems, and bring systems back to normal operation. You’ll ensure smooth communication during outages and post-mortems to improve future reliability.
  • Perform capacity planning to ensure systems can handle traffic increases and growth. You’ll predict future demand and ensure that infrastructure scales smoothly to accommodate it.

Required Skills

  • System Reliability and Automation Expertise: Experience with building and maintaining highly reliable systems and automating infrastructure management using tools like Terraform, Ansible, or Puppet. You’re skilled at optimizing systems for uptime and performance.
  • Monitoring and Incident Management: Proficiency in setting up and managing monitoring, logging, and alerting systems like Prometheus, Grafana, or ELK Stack. You have experience with incident management and problem resolution.
  • Cloud Infrastructure Management: Hands-on experience managing cloud infrastructure on platforms such as AWS, GCP, or Azure. You’re skilled at deploying and maintaining scalable systems in the cloud.
  • Performance Optimization: Expertise in optimizing systems for low latency, high throughput, and minimal downtime. You understand load balancing, caching strategies, and database performance optimization.
  • Security and Compliance: Understanding of security best practices, encryption, and compliance frameworks such as SOC2 or GDPR. You ensure that systems are secure while maintaining reliability.

Educational Requirements

  • Bachelor’s degree in Computer Science, Systems Engineering, or a related field. Equivalent experience in site reliability engineering, systems administration, or DevOps is also valued.
  • Certifications such as AWS Certified Solutions Architect, Kubernetes Administrator, or SRE Practitioner are a plus.

Experience Requirements

  • 3 years of experience in site reliability engineering or a similar role, with a focus on system automation, performance optimization, and cloud infrastructure management.
  • Proven experience managing large-scale, distributed systems with a focus on maintaining uptime, monitoring, and incident resolution.
  • Hands-on experience with containerization (Docker) and orchestration (Kubernetes) in a production environment.

Benefits

  • Health and Wellness: Comprehensive medical, dental, and vision insurance plans with low co-pays and premiums.
  • Paid Time Off: Competitive vacation, sick leave, and 20 paid holidays per year.
  • Work-Life Balance: Flexible work schedules and telecommuting options.
  • Professional Development: Opportunities for training, certification reimbursement, and career advancement programs.
  • Wellness Programs: Access to wellness programs, including gym memberships, health screenings, and mental health resources.
  • Life and Disability Insurance: Life insurance and short-term/long-term disability coverage.
  • Employee Assistance Program (EAP): Confidential counseling and support services for personal and professional challenges.
  • Tuition Reimbursement: Financial assistance for continuing education and professional development.
  • Community Engagement: Opportunities to participate in community service and volunteer activities.
  • Recognition Programs: Employee recognition programs to celebrate achievements and milestones.
Site Reliability Engineer
Aquent Talent -
Austin, TX
Site Reliability Engineer (2023)
Asure Software -
Austin, TX
Senior Site Reliability Engineer
Cognite -
Austin, TX

If your compensation planning software is too rigid to deploy winning incentive strategies, it’s time to find an adaptable solution. Compensation Planning
Enhance your organization's compensation strategy with salary data sets that HR and team managers can use to pay your staff right. Surveys & Data Sets

What is the career path for a Site Reliability Engineer (SRE) (The Uptime Guardian)?

Sign up to receive alerts about other jobs on the Site Reliability Engineer (SRE) (The Uptime Guardian) career path by checking the boxes next to the positions that interest you.
Income Estimation: 
$114,618 - $136,401
Income Estimation: 
$144,264 - $191,312
Income Estimation: 
$140,435 - $166,410
Income Estimation: 
$114,618 - $136,401
Income Estimation: 
$144,264 - $191,312
Income Estimation: 
$140,435 - $166,410
Income Estimation: 
$117,024 - $149,811
Income Estimation: 
$137,568 - $176,908
Income Estimation: 
$76,670 - $90,826
Income Estimation: 
$91,609 - $118,978
Income Estimation: 
$92,877 - $110,401
Income Estimation: 
$71,493 - $96,419
Income Estimation: 
$92,369 - $122,605
This job has expired.
View Core, Job Family, and Industry Job Skills and Competency Data for more than 15,000 Job Titles Skills Library

Job openings at Credible

Credible
Hired Organization Address Washington, DC Full Time
Company Description ProSidian is looking for “Great People Who Lead” at all levels in the organization. Are you a talent...
Credible
Hired Organization Address Washington, DC Full Time
Responsibilities Prepare pitches, proposals, and client meeting materials (talking points, research, etc.), ; Conduct re...
Credible
Hired Organization Address Des Moines, IA Full Time
Troubleshoot and monitor activity throughout the network. Isolate and diagnose faults and escalate to higher levels of s...
Credible
Hired Organization Address Chantilly, VA Full Time
XMSTART is looking to add an experienced and strategic full-time Administrative Assistant to our team, in Chantilly, VA ...

Not the job you're looking for? Here are some other Site Reliability Engineer (SRE) (The Uptime Guardian) jobs in the Austin, TX area that may be a better fit.

Site Reliability Engineer

Luna Data Solutions, Inc., Austin, TX

Site Reliability Engineer with JAVA

Talent Group, Austin, TX

AI Assistant is available now!

Feel free to start your new journey!