What are the responsibilities and job description for the Site Reliability Engineering (SRE) Manager position at Massachusetts Medical Society?

Site Reliability Engineering (SRE) Manager

Category

Information Technology

Job Location

Waltham, Massachusetts

Tracking Code

1119

Position Type

Full-Time/Regular

The Massachusetts Medical Society (MMS) is the statewide professional association for physicians and medical students, supporting 25,000 members. We are dedicated to educating and advocating for the physicians of Massachusetts and patients locally and nationally. A leadership voice in health care, the MMS contributes physician and patient perspectives to influence health-related legislation at the state and federal levels, works in support of public health, provides expert advice on physician practice management, and addresses issues of physician well-being. Under the auspices of NEJM Group, the MMS extends our mission globally by advancing medical knowledge from research to patient care through the New England Journal of Medicine, NEJM Evidence, NEJM AI, NEJM Catalyst, NEJM Journal Watch, and through our accredited and comprehensive continuing medical education programs.

The world has changed, and so has the way we work. The MMS has adopted a flexible work model that allows most employees to choose where they work - at home, onsite in our Waltham office, or a combination of the two - based on their preferences and our business needs. Because what matters is the work we do, not where we do it.

We are seeking a skilled and motivated Site Reliability Engineering (SRE) Manager to lead our growing SRE team and play a critical role in driving operational excellence, technical innovation, and strategic alignment within our hybrid infrastructure. This position balances hands-on technical work with leadership responsibilities, focusing on people management, project execution, and cloud infrastructure architecture. If you are passionate about designing resilient, self-healing systems and empowering development teams through scalable frameworks, we want to hear from you!

Responsibilities:

People and Project Management (50%)

Provide leadership, mentorship, and guidance to a team of SREs/DevOps professionals
Collaborate with stakeholders to define and prioritize objectives, ensuring alignment with business goals.
Oversee project execution, ensuring timely delivery of initiatives like CI/CD pipeline enhancements, observability improvements, and security hardening as well as operational support deliverables.
Foster a culture of collaboration, accountability, and continuous improvement across the team.
Support career development through regular feedback, technical mentoring, and training opportunities.

Hands-On Technical Contributions (50%)

Develop self-service frameworks that empower development teams while maintaining operational standards.
Strengthen security practices by establishing robust guardrails and aligning with industry best practices.
Enhance system observability by integrating tools like Datadog for monitoring, alerting, and analytics.
Evangelize scalable operational practices and play an active role automating and enforcing the same.
Architect and implement cloud infrastructure solutions to meet scalability and resilience requirements for delivering and testing highly available platforms for our complex multi-tier applications.
Support and improve on-premise and hybrid infrastructure solutions, balancing legacy design with our move to the cloud.
Design and improve CI/CD pipelines to optimize deployment speed, reliability, and security.
Responsible for writing and maintaining technical documentation.
Develop release plans and service level agreements and foster the migration of legacy applications to modern CI/CD pipelines.
Own production incidents/issues and provide application support during and - on occasion - outside of normal business hours, responding to infrastructure incidents and alerts and escalating to other subject matter experts as necessary.
Work with third-party vendors to resolve infrastructure issues.
Other responsibilities as assigned.

Strategic Responsibilities

Drive the adoption of resilient, self-healing design patterns across the infrastructure.
Partner with development teams to create scalable solutions that streamline workflows and reduce toil.
Advocate for operational excellence by implementing and enhancing frameworks for reliability, incident response, and continuous learning.

Qualifications:

Required Skills and Experience

Bachelor's degree in a related field with 6 years of experience in software development, SRE, or DevOps, or equivalent education and experience is required.
Proven experience in a leadership role managing and scaling SRE or DevOps teams.
Hands-on expertise with hybrid cloud architectures, particularly transitioning from on-premises to modern cloud platforms.
Excellent knowledge of Linux systems (Amazon Linux) and Windows systems.
Understanding of AWS VPC, network management, and datacenter operations.
Proficiency in CI/CD pipeline design and tools such as GitHub Actions, Jenkins, or similar.
Strong knowledge of observability tools like Datadog, Grafana, or Prometheus.
Solid understanding of infrastructure as code (IaC) practices and tools (e.g., Terraform, CloudFormation).
Experience with security best practices, including compliance, vulnerability management, and identity/access management.
Working knowledge of databases and system performance.
Excellent communication and project management skills, with experience using tools like Jira and Confluence.
Ability to mentor team members technically and strategically.
Must be an excellent and creative problem solver. (You don't need to know everything, but you need to know how to find the solution.)
Demonstrated cooperative work style with strong communication, interpersonal and teamwork skills in an Agile environment.
Must be self-motivated, with the ability to work with minimal supervision.

Preferred Qualifications

Experience with containerization and orchestration tools like Docker and Kubernetes.
Previous experience with an API management tool (MuleSoft preferred).
Experience with self-healing system design and automated failure recovery strategies.
Hands-on experience with scripting languages such as Python, Bash, or PowerShell.
Familiarity with Agile methodologies and practices.

Benefits:

Our generous benefits offerings include: 3 weeks of paid vacation, 6 personal days, 12 sick days, 13 paid holidays, medical and dental plans, 401(k) plans with company match, backup childcare assistance, tuition assistance and more!

The MMS has earned praise as one of the Top Places to Work in Massachusetts by The Boston Globe for the past 15 years in a row! The Globe surveys employees regarding their opinions about company leadership, benefits, ethics, values and culture, and recognizes those companies who receive high marks from their employees.

Massachusetts Medical Society is an Equal Opportunity Employer: Min/Fem/Vet/Disabled

The Massachusetts Medical Society is an EOE: Minorities, Females, Veterans and Disabled.

Apply for this job

Receive alerts for other Site Reliability Engineering (SRE) Manager job openings

Site Reliability Engineering (SRE) Manager

What are the responsibilities and job description for the Site Reliability Engineering (SRE) Manager position at Massachusetts Medical Society?

What is the career path for a Site Reliability Engineering (SRE) Manager?

Not the job you're looking for? Here are some other Site Reliability Engineering (SRE) Manager jobs in the Waltham, MA area that may be a better fit.

We don't have any other Site Reliability Engineering (SRE) Manager jobs in the Waltham, MA area right now.

AI Assistant is available now!