Demo

Site Reliability Engineer

Interesting Engineering, Inc.
San Diego, CA Full Time
POSTED ON 1/23/2025
AVAILABLE BEFORE 4/22/2025

Overview

The Site Reliability Engineer is a member of the Cloud Team providing support on software development, operations, and maintenance while dealing with complex infrastructure to improve performance, visibility, stability, availability, and reliability using automated solutions. This role will provide Tier 3 support, either directly or by engaging with other stakeholders, for applications and platforms residing in the Cloud. The ideal candidate has hands-on experience and understanding of the software development lifecycle from inception to implementation. The successful candidate should have knowledge and understanding of maintaining and will be responsible for ensuring the reliability and speed of the software.

Responsibilities

  • Set up and maintain Azure-native monitoring tools like Azure Monitor, Log Analytics, and Application Insights to oversee system performance, resource health, and workload behavior across AKS environments.
  • Build tailored dashboards that provide clear visualizations of key metrics and configure proactive alerting mechanisms to detect anomalies early and trigger appropriate responses.
  • Utilize Azure Sentinel to enhance security incident detection and response for AKS environments, maintaining compliance and minimizing risks.
  • Implement end-to-end observability practices by combining metrics, logs, and traces for comprehensive insights into containerized applications and their underlying infrastructure.
  • Design and maintain automation scripts using Python, PowerShell, or Bash to streamline repetitive tasks, such as automated scaling, backup processes, and system health checks.
  • Develop runbooks and automated workflows that trigger predefined remediation steps for commonly encountered issues, minimizing manual intervention and response time.
  • Create scripts that enable automatic system adjustments and recovery actions when performance thresholds are crossed or errors are detected.
  • Utilize tools such as Terraform or ARM templates to automate and manage the provisioning of cloud resources, ensuring consistency and repeatability.
  • Rapidly diagnose issues : Lead the identification and troubleshooting of issues impacting system performance, leveraging data from monitoring tools and logs for swift resolution.
  • Conduct thorough post-incident analyses to document root causes, identify areas for improvement, and implement preventive measures to reduce recurrence.
  • Keep incident response runbooks up to date with the latest information and best practices to ensure readiness and consistency during unexpected events.
  • Continuously monitor key performance indicators (KPIs) across cloud resources and workloads to spot trends, potential bottlenecks, and opportunities for enhancement.
  • Propose and implement strategies to improve the cost-efficiency and performance of cloud services, such as right-sizing resources or enhancing load-balancing configurations.
  • Work closely with architecture and development teams to provide input on designing robust, scalable, and resilient cloud solutions.
  • Implement best practices for optimizing container performance within AKS clusters, ensuring optimal CPU and memory usage without compromising application availability.
  • Provide feedback and support to development teams to ensure applications are designed with reliability and scalability in mind.
  • Advocate for and help implement best practices in reliability, incident management, and proactive monitoring across teams.
  • Collaborate with security teams to identify and mitigate vulnerabilities in cloud infrastructure, integrating security monitoring and automated compliance checks.
  • Create comprehensive documentation covering monitoring configurations, incident response protocols, and remediation procedures to ensure team alignment and knowledge retention.
  • Contribute to the creation of internal training resources to help team members familiarize themselves with new tools, techniques, and processes.
  • Regularly share insights, lessons learned, and new approaches to improve the team’s response capabilities and the overall reliability of cloud services.
  • Regularly analyze usage data and performance metrics to identify opportunities for cost optimization, such as rightsizing virtual machines, optimizing storage solutions, and scheduling non-critical resources to shut down during off-peak hours.
  • Use Azure Cost Management Billing to monitor expenses and track actual versus predicted costs.
  • Work with architecture teams to design solutions that maintain performance while minimizing costs, including the use of reserved instances, spot instances, and optimizing data transfer methods.
  • Develop automation scripts that dynamically manage resource allocation based on load, reducing unnecessary expenditure.
  • Proficiency in Service Level Objectives, Service Level Indicators, and error budgeting to balance system reliability with development velocity.
  • Expertise in chaos engineering practices to test and improve system resiliency under controlled conditions.
  • Deep knowledge of monitoring and observability tools, such as Prometheus, Grafana, and Azure Monitor.
  • Strong troubleshooting abilities for distributed systems with proficiency in identifying root causes.
  • Experience implementing incident management frameworks, ensuring smooth communication, documentation, and follow-up for service interruptions.

Qualifications

Bachelor‘s Degree in Information Technology or the equivalent combination of training, education, and experience.

Solid hands-on experience in a Site Reliability Engineer, DevOps Engineer, or similar role with a strong focus on Azure cloud services.

Technical Skills

  • Proficiency in scripting languages such as Python, PowerShell, or Bash.
  • Extensive experience with Azure monitoring tools like Azure Monitor, Log Analytics, Application Insights, and Azure Sentinel.
  • Familiarity with AKS and best practices for monitoring containerized applications.
  • Problem-Solving : Proven track record of effective troubleshooting and resolution of cloud infrastructure issues.
  • Automation Expertise : Hands-on experience creating automated solutions using IaC tools like Terraform or ARM templates.
  • Collaboration and Communication : Strong interpersonal skills to work effectively within cross-functional teams.
  • Desired Qualifications

  • Certifications : Azure certifications such as Microsoft Certified : Azure Administrator Associate or Azure Solutions Architect Expert.
  • Advanced Knowledge : Experience with Kusto Query Language (KQL) for in-depth data analysis and complex queries.
  • Security Acumen : Familiarity with integrating security best practices into monitoring and incident response.
  • Dynatrace experience a plus.
  • Knowledge, understanding, and experience of DevOps and Agile Methodologies.
  • Experience in Microsoft Azure Technologies.
  • Experience in Tanzu Application / Container Services (TAS / TKS) (Previously Pivotal Cloud Foundry) or equivalent container-based platforms / products like Openshift, Azure Kubernetes Services, Google Container Services, etc.
  • Experience using ServiceNow ITOM and ITSM to create catalogs or to automate processes by integrating with other systems.
  • Knowledge and understanding of how software is built and managed.
  • Hours : Monday - Friday, 8 : 00 AM - 4 : 30 PM

    Location : 820 Follin Lane, Vienna, VA 22180 | 5510 Heritage Oaks Drive Pensacola, FL 32526 | 141 Security Drive Winchester, VA 22602 | 9999 Willow Creek Road San Diego, CA 92131 | 295 Bendix Road, Suite 250, Virginia Beach, VA 23452 | 11270 Saint Johns Industrial Parkway South, Jacksonville, FL 32246 | 9001 Airport Freeway, Suite 925, North Richland Hills, TX 76180 | 4 Concourse Parkway, #100, Sandy Springs, GA 30328

    About Us

    Navy Federal provides much more than a job. We provide a meaningful career experience, including a culture that is energized, engaged, and committed; and fierce appreciation for our teams, who are rewarded with highly competitive pay and generous benefits and perks.

  • Best Companies for Latinos to Work for 2024
  • Computerworld Best Places to Work in IT
  • Forbes 2024 America‘s Best Large Employers
  • Forbes 2023 The Best Employers for New Grads
  • Fortune Best Workplaces for Millennials 2023
  • Fortune Best Workplaces for Women 2023
  • Fortune 100 Best Companies to Work For 2024
  • Military Times 2023 Best for Vets Employers
  • Newsweek Most Loved Workplaces
  • Ripplematch Campus Forward Award - Excellence in Early Career Hiring
  • Yello and WayUp Top 100 Internship Programs
  • From Fortune. 2024 Fortune Media IP Limited. All rights reserved. Used under license. Fortune and Fortune Media IP Limited are not affiliated with, and do not endorse products or services of Navy Federal Credit Union.

    Equal Employment Opportunity : Navy Federal values, celebrates, and enacts diversity in the workplace. Navy Federal takes affirmative action to employ and advance in employment qualified individuals with disabilities, disabled veterans, Armed Forces service medal veterans, recently separated veterans, and other protected veterans. EOE / AA / M / F / Veteran / Disability EOE / AA / M / F / Veteran / Disability

    Hybrid Workplace : Navy Federal Credit Union is a hybrid workplace, and details will be discussed during your interview process.

    Disclaimers : Navy Federal reserves the right to fill this role at a higher / lower grade level based on business need. An assessment may be required to compete for this position. Job postings are subject to close early or extend out longer than the anticipated closing date at the hiring team’s discretion based on qualified applicant volume. Navy Federal Credit Union assesses market data to establish salary ranges that enable us to remain competitive. You are paid within the salary range, based on your experience, location and market position.

    Bank Secrecy Act : Remains cognizant of and adheres to Navy Federal policies and procedures, and regulations pertaining to the Bank Secrecy Act.

    J-18808-Ljbffr

    If your compensation planning software is too rigid to deploy winning incentive strategies, it’s time to find an adaptable solution. Compensation Planning
    Enhance your organization's compensation strategy with salary data sets that HR and team managers can use to pay your staff right. Surveys & Data Sets

    What is the career path for a Site Reliability Engineer?

    Sign up to receive alerts about other jobs on the Site Reliability Engineer career path by checking the boxes next to the positions that interest you.
    Income Estimation: 
    $92,877 - $110,401
    Income Estimation: 
    $120,933 - $155,034
    Income Estimation: 
    $114,618 - $136,401
    Income Estimation: 
    $92,877 - $110,401
    Income Estimation: 
    $120,933 - $155,034
    Income Estimation: 
    $114,618 - $136,401
    Income Estimation: 
    $103,114 - $138,258
    Income Estimation: 
    $118,163 - $145,996
    Income Estimation: 
    $120,777 - $151,022
    Income Estimation: 
    $129,363 - $167,316
    Income Estimation: 
    $86,891 - $130,303
    Income Estimation: 
    $129,363 - $167,316
    Income Estimation: 
    $145,845 - $177,256
    Income Estimation: 
    $147,836 - $182,130
    Income Estimation: 
    $154,597 - $194,610
    Income Estimation: 
    $86,891 - $130,303
    Income Estimation: 
    $114,618 - $136,401
    Income Estimation: 
    $144,264 - $191,312
    Income Estimation: 
    $140,435 - $166,410
    View Core, Job Family, and Industry Job Skills and Competency Data for more than 15,000 Job Titles Skills Library

    Job openings at Interesting Engineering, Inc.

    Interesting Engineering, Inc.
    Hired Organization Address Fairfax, VA Full Time
    DevOps / Application Integration Engineer Generous PTO, 401K, Health Insurance Package Maximise your chances of a succes...
    Interesting Engineering, Inc.
    Hired Organization Address Chicago, IL Full Time
    NOTE : THIS POSITION IS NOT ELIGIBLE FOR VISA SPONSORSHIP Is this the next step in your career Find out if you are the r...
    Interesting Engineering, Inc.
    Hired Organization Address Irvine, CA Full Time
    About Codazen Check all associated application documentation thoroughly before clicking on the apply button at the botto...
    Interesting Engineering, Inc.
    Hired Organization Address Indianapolis, IN Full Time
    UX Operations Associate Location : Indianapolis, IN - hybrid schedule About Us : Join a leading pharmaceutical industry ...

    Not the job you're looking for? Here are some other Site Reliability Engineer jobs in the San Diego, CA area that may be a better fit.

    Site Reliability Engineer

    LoadSpring Solutions, Inc., Carlsbad, CA

    Site Reliability Engineer

    Yoh, A Day & Zimmermann Company, San Diego, CA

    AI Assistant is available now!

    Feel free to start your new journey!