What are the responsibilities and job description for the Lead Site Reliability Engineer position at CEI?
One of CEI's largest Financial Services & Banking clients is seeking a Lead SRE to join their growing organization!
Client/Industry: Financial Services & Banking
Job Title: Lead Site Reliability Engineer
Location: Hybrid - 3 Days On-Site / 2 Days Remote | Pittsburgh, PA 15222 ; Cleveland, OH 44136 ; Dallas, TX 75234 ; Birmingham, AL 35233 ; or Phoenix, AZ 85016
Work Schedule/Shift:
Shift 1 (Priority): Saturday & Sunday: 7pm – 7am (EST) / Tuesday & Thursday: 8pm – 5am (EST)
Shift 2: M-F : 11pm – 7am (EST)
Duration/Length of Assignment: 5 Month Contract to Hire
*Must be able to convert to a full-time employee without sponsorship, restrictions, or an additional employer*
- W2 Employment Only – No Corp to Corp / C2C arrangements.
- Expected potential for contract extension(s) and/or conversion to Full-Time/Permanent Employment.
- Optional benefits available during contract (Medical, Dental, Vision, and 401k)
Position Overview:
This role is a critical leadership position within the Site Reliability Center (SRC), responsible for overseeing a team of global contractors supporting enterprise technology and security applications. The SRC Lead will be focused on maintaining system reliability, availability, and performance through proactive monitoring, troubleshooting, and escalation. The team plays a key role in ensuring system health, reducing downtime, and optimizing performance across multiple business applications. Reporting to senior leadership, the SRC Lead will work closely with application support teams, internal stakeholders, and global engineers to coordinate efforts and resolve critical system issues. The primary function of this role is production support, requiring a highly technical approach to troubleshooting system issues, driving incident resolution, and implementing process improvements. The SRC Lead is expected to take ownership of escalated problems, guide discussions with key stakeholders, and update documentation for continued operational success. With 185 combined applications and platforms in scope, the SRC Lead will develop expertise in high-priority critical systems and drive technical conversations to resolution. This position requires strong analytical skills, leadership in a high-paced production environment, and the ability to effectively coordinate across global teams.
Required Skills/Experience/Qualifications:
- Bachelor's degree in Computer Science, Information Technology, or related field
- 8 years of experience in Site Reliability Engineering (SRE), DevOps, or technical production support
- Strong background in monitoring and debugging tools, including LogScale, Splunk, and Dynatrace
- Hands-on experience with DevOps pipelines using Git, Jenkins, and Artifactory
- Proficiency in Red Hat Linux, Openshift, and Windows infrastructure
- Strong understanding of networking concepts, including DNS, load balancing, network tracing, and firewalls
- Experience working with relational databases such as Oracle and SQL
- Ability to troubleshoot and support APIs and web services technologies, including SOAP, JSON, and REST
- Familiarity with directory services, including LDAP and Active Directory
- Proficiency in Java for troubleshooting and debugging system issues
- Experience in operational incident management, root cause analysis, and production system monitoring
- Ability to drive problem resolution, manage impact assessments, and escalate issues appropriately
- Strong leadership and mentorship skills to guide global engineering teams
Preferred Skills (Not Required):
- Experience with Python/Java scripting, Ansible, and PowerShell for automation
- Knowledge of modern development tools and methodologies, including Agile, CI/CD, Git, and Jenkins
- Experience with Kafka event streaming and ETL tools like Informatica
- Familiarity with NoSQL databases such as MongoDB and Cassandra
- Experience with Evolven for system analysis
- Prior experience in a 24x7 production support environment
Day to Day/Responsibilities:
- Monitor system health and analyze metrics using LogScale, Splunk, and Dynatrace to proactively identify issues and potential system failures
- Lead troubleshooting efforts on production issues, coordinating with DevOps teams and escalating when necessary to system SMEs
- Maintain and support DevOps pipelines using Git, Jenkins, and Artifactory, ensuring reliability of automated deployments
- Troubleshoot and resolve infrastructure-related issues across Red Hat Linux, Openshift, and Windows environments
- Analyze and resolve network-related issues including DNS, load balancing, network tracing, and firewall configurations
- Support database operations by managing Oracle and SQL databases, optimizing performance, and identifying system inefficiencies
- Lead system impact assessments and provide technical guidance on API integrations, SOAP/REST web services, and JSON data handling
- Ensure compliance with LDAP and Active Directory configurations for authentication and access control
- Participate in incident and problem management, identifying recurring issues and implementing long-term fixes
- Update and maintain runbooks and operational documentation, ensuring clear guidelines for handling recurring incidents
- Act as the escalation point for global L1.5 engineers, providing technical mentorship and driving a culture of continuous learning
- Collaborate with cross-functional teams, stakeholders, and internal/external business partners to resolve technical challenges
- Ensure timely escalation of critical issues and provide post-incident analysis for process improvements
- Work closely with leadership to reduce Level 3 escalations by improving knowledge-sharing and process automation
- Provide operational support for large-scale distributed applications, ensuring high availability and reliability
- Communicate with senior leadership, including directors and CIOs, to report on system status, major incidents, and process improvements
- Identify and implement automation solutions using Python, Java, Ansible, or PowerShell to improve efficiency and reduce manual tasks
- Drive performance optimization efforts by analyzing system bottlenecks and recommending improvements for stability and uptime
Salary : $75 - $90