What are the responsibilities and job description for the Site Reliability Engineer, Public Sector position at OpenAI?
About the Team
Join the engineering teams that bring OpenAI’s ideas safely to the world!!
The Applied Engineering team works across research, engineering, product, and design to bring OpenAI’s technology to consumers and businesses. We seek to learn from deployment and distribute the benefits of AI, while ensuring that this powerful tool is used responsibly and safely. Safety is more important to us than unfettered growth.
About the Role
We’re seeking a Site Reliability Engineer with experience in managing systems and infrastructure at scale. You’ll join a nimble team where you’ll help drive deployment of OpenAI’s technology into new environments and infrastructure to enable the critical missions in the public sector. This role engages cross-functionally with internal product, security, and compliance teams to build required functionality and ensure we’re delivering a scalable, reliable platform. The proximity to customers provides a unique opportunity to see the impact of your work first-hand.
This role is based in Washington D.C. and San Francisco, CA. Travel to and working from customer sites is required for this role.
In this role, you will :
Design and build performant, reliable, and scalable infrastructure, both on-premises and in the cloud, for our public sector customers.
Administer the systems from the hardware up to kubernetes, ensuring our teams have a standardized infrastructure to deploy OpenAI’s technology onto.
Own the reliability of these systems by being on-site with the customer, utilizing observability tooling, and directly troubleshooting issues that arise as the first line of support.
Partner with teams across engineering and security to ensure the product supports the unique needs of the infrastructure and use-cases.
Automate routine tasks and standardize our infrastructure offerings to allow our team to scale as we continue to grow.
Partner with teams across the business, including engineering, security, and compliance, to enable our products to work within the unique constraints of new environments.
You might thrive in this role if you :
Hold an active US security clearance
5 years experience operating infrastructure and systems at scale
Worked out of secure environments, closely collaborating with both on-site clients and remote colleagues.
Hands-on experience with containers (Docker) and orchestration platforms (kubernetes)
Scripting experience with Python or equivalents for automating routine tasks
Own problems end-to-end, and are willing to pick up whatever knowledge you're missing to get the job done to ensure both your team and our customers succeed.
Strong troubleshooting skills across the entire stack (infrastructure, systems, and applications)
Thrive in dynamic environments and can navigate ambiguity with ease.