What are the responsibilities and job description for the Operations Lead Engineer position at Paytronix?
We are seeking an Operations Lead Engineer to provide day-to-day leadership for our Network Operations Center functions. This role will serve as a key point of contact during incidents, outages, and escalations, while also guiding NOC analysts and engineers in operational excellence. The NOC Operations Lead will be instrumental in maintaining network uptime, optimizing monitoring strategies and developing & implementing standard operational procedures.
As a core part of the larger IT Operations team, you will work alongside and in collaboration with cloud operations engineers, software developers and enterprise architects with respect to the design, development, provisioning, installation, configuration, operation, and maintenance of systems and software, as well as related infrastructure environments. You will also work in close partnership with our customer support teams and business stakeholders.
The ideal candidate will have a technical background, be comfortable partnering closely with software developers, cloud engineers and customer support representatives and can take ownership, lead, and collaborate across multidisciplinary groups while integrating automation technologies to reduce reliance on 24/7 human monitoring and repetitive, manual tasks.
Our company has an open, relaxed, and friendly environment where you’ll get to work with people serious about the work they do, but always appreciate a great sense of humor. We trust our employees, so you’ll be given a fair amount of latitude with, and ownership for your own time.
The kind of stuff you’ll be doing:
- Establishing new foundations for and reshaping existing NOC functions & ownership areas
- Establishing automated standard operating procedures (SOPs), incident response plans, and escalation protocols for the NOC
- Lead development/implementation of tools and dashboards to provide increased visibility and observability
- Develop automated alerting and remediation processes to detect, diagnose, and respond to network and system issues without direct human intervention.
- Implement predictive systems monitoring to proactively identify and resolve potential issues before they impact operations
- Partnering with IT Compliance teams to ensure all processes and procedures aligned with standard compliance frameworks, notably PCI and SOC
- Driving the Root Cause Analysis process post incident review process (PIR) and ensure ongoing visibility and ownership of actions is clear and managed
- Applying structured problem management approaches to our re-occurring, high severity incidents; champion the problem management lifecycle
- Managing NOC KPIs and producing regular monthly reporting in support of those operational KPIs
- This is an on-call position with an expectation to participate in a 24x7 on-call rotation
- This position is a hybrid position which requires the ability to work within our Newton, MA office multiple days per week
The kind of experience you’ll need:
- 5 years’ experience working in a corporate IT environment, a thorough understanding of service management, with at least 3 years demonstrable success in an Incident Management and / or Problem Management role
- Solid experience of the Incident, Problem, Change disciplines within ITIL
- Experience in handling large customer / business IT incidents
- Working in pressured IT environments where customer and commercial impact is high
- Experience supporting a 7x24x365 Software-as-a-Service computing platform
- Excellent communication skills including formal presentations
The extra stuff that would be nice:
- A curious mindset with an aptitude to figure things out and get things done
- An eagerness to learn, apply and teach new skills along within a highly collaborative and communicative environment
- Project management skills
- ITIL Certifications
- Ability to work independently as well as within a team environment
- Familiarity with common security frameworks such as CIS, PCI, NIST, SOC
- The theory and practice of Agile, CI/CD and DevOps methodologies
- Infrastructure-as-code tools such as Terraform and CloudFormation
- Configuration management tools such as Ansible and Puppet
- Deployment tools and technologies including Jenkins, Github, Docker, Kubernetes
- Using the Bash, Powershell & Python scripting languages to automate and streamline application and infrastructure-as-code deployments
- System and application monitoring using tools such as DataDog, NewRelic & Solarwinds
- Infrastructure management within the Microsoft Azure and Amazon AWS cloud computing platforms including design, maintenance, performance, scalability and security.
Salary range: $110k-$140k
Salary : $110,000 - $140,000