What are the responsibilities and job description for the Head of GPU Infrastructure position at Talented Hires?
Head of GPU Infrastructure
Location: US Remote
Department: Engineering
About Us
AI technology is driving an unprecedented demand for computing power. Global investments in AI infrastructure are expected to reach hundreds of billions in the coming years.
Our company is at the forefront of this transformation, building and operating GPU supercomputers for some of the world’s most advanced AI labs, governments, and enterprises.
Our customers include leading names in the AI ecosystem, and our mission is to deliver the best supercomputing experience available.
We are a lean, high-performance team, dedicated to excellence in everything we do. We expect our team members to share our commitment to high standards and customer-first philosophy.
About the Role
We are looking for a Head of Infrastructure to lead global deployments of GPU supercomputers at an unmatched scale. Reporting directly to senior leadership, you will be the linchpin for sourcing, procurement, and deployment, ensuring timely delivery of some of the world's largest GPU clusters.
You will also have the unique opportunity to build and lead a world-class deployment team, shaping the infrastructure function in a high-growth, fast-paced environment. This role requires extensive technical expertise and exceptional communication skills to collaborate effectively with internal and external stakeholders.
Responsibilities:
- Manage the entire supply chain, from sourcing components to delivering operational GPU clusters.
- Build and maintain relationships with OEMs, optimizing delivery timelines and costs.
- Design and build AI clusters tailored to customer needs and informed by deployment learnings.
- Secure additional data center capacity to support rapid scaling.
- Recruit and manage a specialized team of deployment engineers to deliver clusters with speed and reliability.
- Collaborate across engineering, sales, finance, and legal to anticipate and fulfill customer needs.
- Represent the company at conferences, trade shows, and site visits with customers, OEMs, and partners.
Qualifications
Required:
- 3 years of experience deploying GPU clusters or 5 years in large-scale infrastructure deployment.
- Hands-on experience with data center hardware installations.
- Strong relationships with compute and storage OEMs, data centers, and ISPs.
- Expertise in InfiniBand or RoCE networking.
- Familiarity with tools like Kubernetes, SLURM, PyTorch, and JAX.
- Strong attention to detail, ability to prioritize, and thrive in a fast-paced environment.
- Advanced degree in a technical field such as Computer Engineering, Computer Science, or similar.
Preferred:
- Experience designing and operating 4000 GPU clusters.
- Proficiency in managing bare metal hardware with tools like MaaS or Netbox.
- Experience with large-scale storage systems like DDN, VAST, or Ceph.
Perks & Benefits
- Compensation: Competitive salary and equity package.
- Retirement: Aligned with local standards.
- Insurance: Comprehensive health, dental, and vision coverage.
- PTO: Generous leave policies tailored to local norms.
- Travel: Fully-paid business travel to conferences and trade shows.
This is a career-defining opportunity to make a tangible impact on the future of AI infrastructure.
If you are passionate about pushing the boundaries of technology, we want to hear from you.
Apply now to join our cutting-edge team!