Demo

Director, Site Reliability Engineering

Berkley
Manassas, VA Other
POSTED ON 7/11/2024 CLOSED ON 1/22/2025

What are the responsibilities and job description for the Director, Site Reliability Engineering position at Berkley?

Company Details

 

 

Company URL:  https://www.berkleytechnologyservices.com

 

Berkley Technology Services (BTS) is the dynamic technology solution for W. R. Berkley Corporation, a Fortune 500 Commercial Lines Insurance Company. With key locations in Urbandale, IA and Wilmington, DE, BTS provides innovative and customer-focused IT solutions to the majority of WRBC’s 60 operating units across the globe. BTS’s wide reach ensures that ideas and opinions are considered at every level of the organization to guarantee we find the best solutions possible.

 

Driven by a commitment to collaboration, BTS acts as consultants to our customers and Operating Units by providing comprehensive solutions that not only address the challenge at hand, but proactively plan for the “What’s Next” in our industry and beyond.

 

With a culture centered on innovation and entrepreneurial spirit, BTS stands as a community of technology leaders with eyes toward the future -- leaders who truly care about growing not only their team members, but themselves, and take pride in their employees who shine. BTS offers endless ways to get involved and have the chance to grow your career into a wide range of roles you'd never known existed. Come join us as we push forward into the future of industry leading technological solutions.

 

Berkley Technology Services: Right Team, Right Technology, Simple and Secure.

Responsibilities

The Sr Director, Site Reliability Engineering (SRE) is responsible for developing and implementing a comprehensive strategy for site reliability, encompassing scalability, performance, and reliability improvements. The role will align SRE objectives with overall business goals and technology roadmaps. It will foster the spirit of continuous improvement to the SRE and position it to benefit the organizational objectives across the Berkley Corporation.

 

The person in this role is responsible for overseeing SRE team operations, ensuring the reliability and availability of key applications and supporting infrastructure. This role will work effectively with Service Management to enforce best practices for system reliability, monitoring, capacity planning, incident response, problem management, disaster recovery, change management, and workflow automation.  They will also own and administer the tools and technologies necessary to generate a complete view of SRE metrics and improvement areas, including (but limited to) monitoring, logging, notification, dashboarding, and AIOps.

 

 

Team Performance Management:
  • Instantiate and build a robust SRE team over time and integrate SRE into Berkley’s product development and operational process.
  • Recruit, mentor, and develop a high-performing team of SRE professionals.
  • Monitor ongoing staff performance; identify and communicate opportunities for improvement.
  • Provide leadership and support to ensure projects are staffed appropriately and timelines are met.
  Collaboration and Relationship Building:
  • Collaborate with the BTS IT Leadership Teams and other groups across the IT organization to drive a unified approach to site reliability that reduces downtime and minimizes outage business impact.
  • Foster strong relationships with delivery organization leadership to align SRE efforts with organizational goals. Work collaboratively with other business and IT leaders to ensure cross functional problems are addressed cohesively across the organization.
  • Work cross-functionally in partnership with software development teams to guide product development in creating resilient and durable software systems.
  • Collaborate with EA to institute design patterns for resilient systems and mechanisms for scoring applications against industry-recognized configurations (including active-active, active-passive, recover-from-scratch, and data replication scenarios).
  Execution, Project, and Work Management:
  • Define, and track reliability and observability OKRs for infrastructure and key systems.Implement robust monitoring and alerting systems to proactively identify potential issues, analyze system performance, and facilitate quick response to incidents.
  • Implement AIOps functionality to enable auto-response, self-healing, and anomaly trend analysis.
  • Drive the development and implementation of automation solutions to remove “toil”, streamline processes, reduce manual interventions, and enhance the overall efficiency of the product engineering and SRE teams.
  • Work closely with product, development, infrastructure, and architecture teams to conduct capacity planning, ensuring that systems can handle current and future demand. Anticipate growth and scalability requirements.
  • Establish and oversee effective high-severity incident response processes, ensure timely incident resolution, and conduct post-mortems to identify root causes and implement preventive measures.
  • Improve reliability by identifying and addressing gaps in our architecture, services, and tooling.
  • Oversee disaster recovery program for both on premise and Cloud-based Berkley solutions.
  • Performs other duties assigned.

 

Qualifications

  • A passion for technology and innovation in the end user computing space.
  • 8 years of experience in building/leading strong and flexible teams, managing large scale systems consumed by tens/hundreds of thousands of users.
  • 8 years of experience of Site Reliability Engineering and DevOps.
  • 4 years of experience in Disaster Recovery and/or Business Continuity.
  • Strong understanding of Cloud computing platforms (Azure preferred) including life-and-shift environments (VMs, etc.) and cloud-native setups (AKS, serverless, etc.).
  • Strong understanding and experience in automation tools and programming/scripting languages to develop and implement automated system reliability and performance solutions including infrastructure automation and configurations management tools (Ansible, Chef, Puppet).
  • Strong understanding of observability, monitoring, alerting, and logging tools and ability to design and implement effective monitoring and logging strategies.
  • Experience in designing and implementing on-premise, cloud, and hybrid resiliency solutions, disaster recovery, and business continuity planning.
  • Ability to drive critical issues and system design discussions and moderate between multiple technology teams.
  • Solid understanding of security best practices in on-premise, cloud, and hybrid environments along with Network technologies.
  • Working knowledge of CI/CD - preferably GitHub workflows and Actions.
  • Working knowledge of IaC automation tools (Terraform, Ansible, etc.)
  • Experience with Kubernetes and other auto-scaling tools and technologies.
  • Skilled at assessing and developing IT talent across multiple time zones and multiple business domains.
  • Exceptional written and verbal communication skills.
  • Ability to work independently in a fast-paced environment.
  • Bachelor's Degree and 8 years of experience or a combination of associate degree with 11 years of experience
  • Travel Requirement: Up to 25%

 

BTS Leadership Values:

  • Agile
  • Customer Centric
  • Ownership Mindset
  • Sense of Urgency
  • Servant Leadership
  • 1BTS

 

Leadership Behavioral Attributes:

  • Flexibility
  • Customer Service Oriented
  • Operational Effectiveness
  • Personal Ownership
  • Quick Decision Making
  • Team Builder
  • Transformational Leadership

 

This company is an equal employment opportunity employer

VP of Site Reliability Engineering
Intellibus -
Silver Spring, MD
Senior Software Engineering Manager - Site Reliability Engineering
GEICO -
Bethesda, MD
Senior Software Engineering Manager – Site Reliability Engineering
GEICO -
Bethesda, MD

If your compensation planning software is too rigid to deploy winning incentive strategies, it’s time to find an adaptable solution. Compensation Planning
Enhance your organization's compensation strategy with salary data sets that HR and team managers can use to pay your staff right. Surveys & Data Sets

What is the career path for a Director, Site Reliability Engineering?

Sign up to receive alerts about other jobs on the Director, Site Reliability Engineering career path by checking the boxes next to the positions that interest you.
Income Estimation: 
$194,072 - $240,547
Income Estimation: 
$220,784 - $286,649
Income Estimation: 
$112,673 - $137,290
Income Estimation: 
$139,945 - $168,577
Income Estimation: 
$140,233 - $181,029
Income Estimation: 
$161,209 - $233,553
Income Estimation: 
$152,549 - $188,894
Income Estimation: 
$194,072 - $240,547
Income Estimation: 
$135,994 - $168,063
Income Estimation: 
$161,209 - $233,553
Income Estimation: 
$139,945 - $168,577
Income Estimation: 
$164,835 - $201,088
Income Estimation: 
$135,994 - $168,063
Income Estimation: 
$161,209 - $233,553

Sign up to receive alerts about other jobs with skills like those required for the Director, Site Reliability Engineering.

Click the checkbox next to the jobs that you are interested in.

  • Bug/Defect Analysis Skill

    • Income Estimation: $74,092 - $105,774
    • Income Estimation: $80,479 - $90,779
  • Debugging Skill

    • Income Estimation: $74,206 - $95,716
    • Income Estimation: $73,353 - $96,975
This job has expired.
View Core, Job Family, and Industry Job Skills and Competency Data for more than 15,000 Job Titles Skills Library

Job openings at Berkley

Berkley
Hired Organization Address Greenwich, CT Full Time
Responsibilities: The Presiden t is responsible for the successful management of virtually all aspects of the specific “...
Berkley
Hired Organization Address Chesterfield, MO Full Time
Company Details: Midwest Employers Casualty (MEC) is a member of the W. R. Berkley Corporation, a fortune 500 company, r...
Berkley
Hired Organization Address Charlotte, NC Full Time
Company Details: Berkley Southeast Insurance Group (BSIG) is a member company of W. R. Berkley Corporation, a Fortune 50...
Berkley
Hired Organization Address Wilmington, DE Other
Company Details Berkley Technology Services (BTS) is the dynamic technology solution for W. R. Berkley Corporation, a Fo...

Not the job you're looking for? Here are some other Director, Site Reliability Engineering jobs in the Manassas, VA area that may be a better fit.

AI Assistant is available now!

Feel free to start your new journey!