What are the responsibilities and job description for the Lead Site Reliability Engineer position at CEI?

One of CEI's largest Financial Services & Banking clients is seeking a Lead SRE to join their growing organization!

Client/Industry: Financial Services & Banking

Job Title: Lead Site Reliability Engineer

Location: Hybrid - 3 Days On-Site / 2 Days Remote | Pittsburgh, PA 15222 ; Cleveland, OH 44136 ; Dallas, TX 75234 ; Birmingham, AL 35233 ; or Phoenix, AZ 85016

Work Schedule/Shift:

Shift 1 (Priority): Saturday & Sunday: 7pm – 7am (EST) / Tuesday & Thursday: 8pm – 5am (EST)

Shift 2: M-F : 11pm – 7am (EST)

Duration/Length of Assignment: 5 Month Contract to Hire

*Must be able to convert to a full-time employee without sponsorship, restrictions, or an additional employer*

W2 Employment Only – No Corp to Corp / C2C arrangements.
Expected potential for contract extension(s) and/or conversion to Full-Time/Permanent Employment.
Optional benefits available during contract (Medical, Dental, Vision, and 401k)

Position Overview:

This role is a critical leadership position within the Site Reliability Center (SRC), responsible for overseeing a team of global contractors supporting enterprise technology and security applications. The SRC Lead will be focused on maintaining system reliability, availability, and performance through proactive monitoring, troubleshooting, and escalation. The team plays a key role in ensuring system health, reducing downtime, and optimizing performance across multiple business applications. Reporting to senior leadership, the SRC Lead will work closely with application support teams, internal stakeholders, and global engineers to coordinate efforts and resolve critical system issues. The primary function of this role is production support, requiring a highly technical approach to troubleshooting system issues, driving incident resolution, and implementing process improvements. The SRC Lead is expected to take ownership of escalated problems, guide discussions with key stakeholders, and update documentation for continued operational success. With 185 combined applications and platforms in scope, the SRC Lead will develop expertise in high-priority critical systems and drive technical conversations to resolution. This position requires strong analytical skills, leadership in a high-paced production environment, and the ability to effectively coordinate across global teams.

Required Skills/Experience/Qualifications:

Bachelor's degree in Computer Science, Information Technology, or related field
8 years of experience in Site Reliability Engineering (SRE), DevOps, or technical production support
Strong background in monitoring and debugging tools, including LogScale, Splunk, and Dynatrace
Hands-on experience with DevOps pipelines using Git, Jenkins, and Artifactory
Proficiency in Red Hat Linux, Openshift, and Windows infrastructure
Strong understanding of networking concepts, including DNS, load balancing, network tracing, and firewalls
Experience working with relational databases such as Oracle and SQL
Ability to troubleshoot and support APIs and web services technologies, including SOAP, JSON, and REST
Familiarity with directory services, including LDAP and Active Directory
Proficiency in Java for troubleshooting and debugging system issues
Experience in operational incident management, root cause analysis, and production system monitoring
Ability to drive problem resolution, manage impact assessments, and escalate issues appropriately
Strong leadership and mentorship skills to guide global engineering teams

Preferred Skills (Not Required):

Experience with Python/Java scripting, Ansible, and PowerShell for automation
Knowledge of modern development tools and methodologies, including Agile, CI/CD, Git, and Jenkins
Experience with Kafka event streaming and ETL tools like Informatica
Familiarity with NoSQL databases such as MongoDB and Cassandra
Experience with Evolven for system analysis
Prior experience in a 24x7 production support environment

Day to Day/Responsibilities:

Monitor system health and analyze metrics using LogScale, Splunk, and Dynatrace to proactively identify issues and potential system failures
Lead troubleshooting efforts on production issues, coordinating with DevOps teams and escalating when necessary to system SMEs
Maintain and support DevOps pipelines using Git, Jenkins, and Artifactory, ensuring reliability of automated deployments
Troubleshoot and resolve infrastructure-related issues across Red Hat Linux, Openshift, and Windows environments
Analyze and resolve network-related issues including DNS, load balancing, network tracing, and firewall configurations
Support database operations by managing Oracle and SQL databases, optimizing performance, and identifying system inefficiencies
Lead system impact assessments and provide technical guidance on API integrations, SOAP/REST web services, and JSON data handling
Ensure compliance with LDAP and Active Directory configurations for authentication and access control
Participate in incident and problem management, identifying recurring issues and implementing long-term fixes
Update and maintain runbooks and operational documentation, ensuring clear guidelines for handling recurring incidents
Act as the escalation point for global L1.5 engineers, providing technical mentorship and driving a culture of continuous learning
Collaborate with cross-functional teams, stakeholders, and internal/external business partners to resolve technical challenges
Ensure timely escalation of critical issues and provide post-incident analysis for process improvements
Work closely with leadership to reduce Level 3 escalations by improving knowledge-sharing and process automation
Provide operational support for large-scale distributed applications, ensuring high availability and reliability
Communicate with senior leadership, including directors and CIOs, to report on system status, major incidents, and process improvements
Identify and implement automation solutions using Python, Java, Ansible, or PowerShell to improve efficiency and reduce manual tasks
Drive performance optimization efforts by analyzing system bottlenecks and recommending improvements for stability and uptime

Salary : $75 - $90

Apply for this job

Receive alerts for other Lead Site Reliability Engineer job openings

Lead Site Reliability Engineer

What are the responsibilities and job description for the Lead Site Reliability Engineer position at CEI?

What is the career path for a Lead Site Reliability Engineer?

Job openings at CEI

Not the job you're looking for? Here are some other Lead Site Reliability Engineer jobs in the Pittsburgh, PA area that may be a better fit.

We don't have any other Lead Site Reliability Engineer jobs in the Pittsburgh, PA area right now.

AI Assistant is available now!