What are the responsibilities and job description for the Site Reliability Engineer position at Interesting Engineering, Inc.?

Overview

The Site Reliability Engineer is a member of the Cloud Team providing support on software development, operations, and maintenance while dealing with complex infrastructure to improve performance, visibility, stability, availability, and reliability using automated solutions. This role will provide Tier 3 support, either directly or by engaging with other stakeholders, for applications and platforms residing in the Cloud. The ideal candidate has hands-on experience and understanding of the software development lifecycle from inception to implementation. The successful candidate should have knowledge and understanding of maintaining and will be responsible for ensuring the reliability and speed of the software.

Responsibilities

Set up and maintain Azure-native monitoring tools like Azure Monitor, Log Analytics, and Application Insights to oversee system performance, resource health, and workload behavior across AKS environments.
Build tailored dashboards that provide clear visualizations of key metrics and configure proactive alerting mechanisms to detect anomalies early and trigger appropriate responses.
Utilize Azure Sentinel to enhance security incident detection and response for AKS environments, maintaining compliance and minimizing risks.
Implement end-to-end observability practices by combining metrics, logs, and traces for comprehensive insights into containerized applications and their underlying infrastructure.
Design and maintain automation scripts using Python, PowerShell, or Bash to streamline repetitive tasks, such as automated scaling, backup processes, and system health checks.
Develop runbooks and automated workflows that trigger predefined remediation steps for commonly encountered issues, minimizing manual intervention and response time.
Create scripts that enable automatic system adjustments and recovery actions when performance thresholds are crossed or errors are detected.
Utilize tools such as Terraform or ARM templates to automate and manage the provisioning of cloud resources, ensuring consistency and repeatability.
Rapidly diagnose issues : Lead the identification and troubleshooting of issues impacting system performance, leveraging data from monitoring tools and logs for swift resolution.
Conduct thorough post-incident analyses to document root causes, identify areas for improvement, and implement preventive measures to reduce recurrence.
Keep incident response runbooks up to date with the latest information and best practices to ensure readiness and consistency during unexpected events.
Continuously monitor key performance indicators (KPIs) across cloud resources and workloads to spot trends, potential bottlenecks, and opportunities for enhancement.
Propose and implement strategies to improve the cost-efficiency and performance of cloud services, such as right-sizing resources or enhancing load-balancing configurations.
Work closely with architecture and development teams to provide input on designing robust, scalable, and resilient cloud solutions.
Implement best practices for optimizing container performance within AKS clusters, ensuring optimal CPU and memory usage without compromising application availability.
Provide feedback and support to development teams to ensure applications are designed with reliability and scalability in mind.
Advocate for and help implement best practices in reliability, incident management, and proactive monitoring across teams.
Collaborate with security teams to identify and mitigate vulnerabilities in cloud infrastructure, integrating security monitoring and automated compliance checks.
Create comprehensive documentation covering monitoring configurations, incident response protocols, and remediation procedures to ensure team alignment and knowledge retention.
Contribute to the creation of internal training resources to help team members familiarize themselves with new tools, techniques, and processes.
Regularly share insights, lessons learned, and new approaches to improve the team’s response capabilities and the overall reliability of cloud services.
Regularly analyze usage data and performance metrics to identify opportunities for cost optimization, such as rightsizing virtual machines, optimizing storage solutions, and scheduling non-critical resources to shut down during off-peak hours.
Use Azure Cost Management Billing to monitor expenses and track actual versus predicted costs.
Work with architecture teams to design solutions that maintain performance while minimizing costs, including the use of reserved instances, spot instances, and optimizing data transfer methods.
Develop automation scripts that dynamically manage resource allocation based on load, reducing unnecessary expenditure.
Proficiency in Service Level Objectives, Service Level Indicators, and error budgeting to balance system reliability with development velocity.
Expertise in chaos engineering practices to test and improve system resiliency under controlled conditions.
Deep knowledge of monitoring and observability tools, such as Prometheus, Grafana, and Azure Monitor.
Strong troubleshooting abilities for distributed systems with proficiency in identifying root causes.
Experience implementing incident management frameworks, ensuring smooth communication, documentation, and follow-up for service interruptions.

Qualifications

Bachelor‘s Degree in Information Technology or the equivalent combination of training, education, and experience.

Solid hands-on experience in a Site Reliability Engineer, DevOps Engineer, or similar role with a strong focus on Azure cloud services.

Technical Skills

Proficiency in scripting languages such as Python, PowerShell, or Bash.

Extensive experience with Azure monitoring tools like Azure Monitor, Log Analytics, Application Insights, and Azure Sentinel.

Familiarity with AKS and best practices for monitoring containerized applications.

Problem-Solving : Proven track record of effective troubleshooting and resolution of cloud infrastructure issues.

Automation Expertise : Hands-on experience creating automated solutions using IaC tools like Terraform or ARM templates.

Collaboration and Communication : Strong interpersonal skills to work effectively within cross-functional teams.

Desired Qualifications

Certifications : Azure certifications such as Microsoft Certified : Azure Administrator Associate or Azure Solutions Architect Expert.

Advanced Knowledge : Experience with Kusto Query Language (KQL) for in-depth data analysis and complex queries.

Security Acumen : Familiarity with integrating security best practices into monitoring and incident response.

Dynatrace experience a plus.

Knowledge, understanding, and experience of DevOps and Agile Methodologies.

Experience in Microsoft Azure Technologies.

Experience in Tanzu Application / Container Services (TAS / TKS) (Previously Pivotal Cloud Foundry) or equivalent container-based platforms / products like Openshift, Azure Kubernetes Services, Google Container Services, etc.

Experience using ServiceNow ITOM and ITSM to create catalogs or to automate processes by integrating with other systems.

Knowledge and understanding of how software is built and managed.

Hours : Monday - Friday, 8 : 00 AM - 4 : 30 PM

Location : 820 Follin Lane, Vienna, VA 22180 | 5510 Heritage Oaks Drive Pensacola, FL 32526 | 141 Security Drive Winchester, VA 22602 | 9999 Willow Creek Road San Diego, CA 92131 | 295 Bendix Road, Suite 250, Virginia Beach, VA 23452 | 11270 Saint Johns Industrial Parkway South, Jacksonville, FL 32246 | 9001 Airport Freeway, Suite 925, North Richland Hills, TX 76180 | 4 Concourse Parkway, #100, Sandy Springs, GA 30328

About Us

Navy Federal provides much more than a job. We provide a meaningful career experience, including a culture that is energized, engaged, and committed; and fierce appreciation for our teams, who are rewarded with highly competitive pay and generous benefits and perks.

Best Companies for Latinos to Work for 2024

Computerworld Best Places to Work in IT

Forbes 2024 America‘s Best Large Employers

Forbes 2023 The Best Employers for New Grads

Fortune Best Workplaces for Millennials 2023

Fortune Best Workplaces for Women 2023

Fortune 100 Best Companies to Work For 2024

Military Times 2023 Best for Vets Employers

Newsweek Most Loved Workplaces

Ripplematch Campus Forward Award - Excellence in Early Career Hiring

Yello and WayUp Top 100 Internship Programs

From Fortune. 2024 Fortune Media IP Limited. All rights reserved. Used under license. Fortune and Fortune Media IP Limited are not affiliated with, and do not endorse products or services of Navy Federal Credit Union.

Equal Employment Opportunity : Navy Federal values, celebrates, and enacts diversity in the workplace. Navy Federal takes affirmative action to employ and advance in employment qualified individuals with disabilities, disabled veterans, Armed Forces service medal veterans, recently separated veterans, and other protected veterans. EOE / AA / M / F / Veteran / Disability EOE / AA / M / F / Veteran / Disability

Hybrid Workplace : Navy Federal Credit Union is a hybrid workplace, and details will be discussed during your interview process.

Disclaimers : Navy Federal reserves the right to fill this role at a higher / lower grade level based on business need. An assessment may be required to compete for this position. Job postings are subject to close early or extend out longer than the anticipated closing date at the hiring team’s discretion based on qualified applicant volume. Navy Federal Credit Union assesses market data to establish salary ranges that enable us to remain competitive. You are paid within the salary range, based on your experience, location and market position.

Bank Secrecy Act : Remains cognizant of and adheres to Navy Federal policies and procedures, and regulations pertaining to the Bank Secrecy Act.

J-18808-Ljbffr

Site Reliability Engineer

Yoh - A Day & Zimmerman Company -

San Diego, CA

View Job Details

Site Reliability Engineer

Eliassen Group -

San Diego, CA

View Job Details

Site Reliability Engineer

Yoh, A Day & Zimmermann Company -

San Diego, CA

View Job Details

Apply for this job

Receive alerts for other Site Reliability Engineer job openings

Site Reliability Engineer

What are the responsibilities and job description for the Site Reliability Engineer position at Interesting Engineering, Inc.?

What is the career path for a Site Reliability Engineer?

Job openings at Interesting Engineering, Inc.

Not the job you're looking for? Here are some other Site Reliability Engineer jobs in the San Diego, CA area that may be a better fit.

We don't have any other Site Reliability Engineer jobs in the San Diego, CA area right now.

AI Assistant is available now!