Goldman Sachs is Hiring a Site Reliability Engineer, VP - Scheduling Platform Near Dallas, TX
Job Description
Description for Internal Candidates
What We DoAt Goldman Sachs, our Engineers don’t just make things – we make things possible. Change the world by connecting people and capital with ideas. Solve the most challenging and pressing engineering problems for our clients. Join our engineering teams that build massively scalable software and systems, architect low latency infrastructure solutions, proactively guard against cyber threats, and leverage machine learning alongside financial engineering to continuously turn data into action. Create new businesses, transform finance, and explore a world of opportunity at the speed of markets. Engineering, which is comprised of our Technology Division and global strategists groups, is at the critical center of our business, and our dynamic environment requires innovative strategic thinking and immediate, real solutions. Want to push the limit of digital possibilities? Start here. Who We Look ForGoldman Sachs Engineers are innovators and problem-solvers, building solutions in risk management, big data, mobile and more. We look for creative collaborators who evolve, adapt to change and thrive in a fast-paced global environment. Who We AreProcmon Platform delivers a highly scalable and reliable ecosystem for scheduling business critical jobs across Goldman Sachs. Our platform is responsible for scheduling tens of millions of daily jobs for Global Banking & Markets, Asset & Wealth Management, Risk and other business and engineering functions. The ecosystem includes a number of high availability, very large scale systems including:
Job scheduling
Event streaming
Log shipping
Data warehouses
Security infrastructure
Responsibilities
Own technical operations for systems that manage hundreds of thousands of compute cores
Build observability for new deployments to ensure robustness from day one, as well as mature deployments to identify and implement improvements
Troubleshoot and resolve issues with block devices, file descriptors, and packet loss
Lead real-time outage investigations and present postmortems to senior management
Define SLIs and SLOs and partner with development teams to ensure systems are sufficiently well designed and instrumented
Partner with our development team throughout development and operations
Plan and manage deployments and migrations (including end-of-life programs)
Plan and implement robust business continuity and security programs
Provide regional coverage for the Procmon platform and participate in on-call support
Requirements
5 years of relevant professional experience
3 years of Linux fundamentals and system administration skills
3 years of networking experience(familiarity with TCP/IP, IP routing, firewalls, secure tunneling protocols)
3 years experience working with distributed computing systems and Cloud computing environments
Excellent problem-solving and automation skills
Proficiency in at least one programming language; the team uses a mix of Go, Python and Erlang
Able to operate effectively in a mission critical, highly regulated financial services environment