What are the responsibilities and job description for the GPU Platform Engineer, AI & HPC position at AMD?
WHAT YOU DO AT AMD CHANGES EVERYTHING
We care deeply about transforming lives with AMD technology to enrich our industry, our communities, and the world. Our mission is to build great products that accelerate next-generation computing experiences - the building blocks for the data center, artificial intelligence, PCs, gaming and embedded. Underpinning our mission is the AMD culture. We push the limits of innovation to solve the world’s most important challenges. We strive for execution excellence while being direct, humble, collaborative, and inclusive of diverse perspectives.
AMD together we advance_
The Team
AMD's Data Center GPU organization is transforming the industry with our AI based Graphic Processors. Our primary objective is to design exceptional products that drive the evolution of computing experiences, serving as the cornerstone for enterprise Data Centers, (AI) Artificial Intelligence, HPC and Embedded systems. If this resonates with you, come and joining our Data Center GPU organization where we are building amazing AI powered products with amazing people.
The Role
The Software Platform Architecture (SPA) team has an open position for a GPU Platform Engineer, AI & HPC. SPA is the hardware-accelerated, software focused wing of the newly formed Cluster Platform Engineering (CPE) team at AMD and rolls up through the Data Center GPU (DCGPU) business unit. This role will be responsible for helping to select, curate, design, automate, and document all software underpinning an entire full stack AI focused platform. This work is not net new code development but instead focused on choosing the right software properties and how data and operations flow through it to ease the adoption and operations of large-scale GPU accelerated AI (Artificial Intelligence) and HPC (High Performance Computing) Cluster systems within AMD.
SPA works closely with the Site Reliability Engineering (SRE) and Data Center Operations (DCOps) teams who tackle day-to-day commissioning and operations of the clusters under CPE’s control. SPA’s work is measured by how much we reduce the operational toil while increasing the rigor and repeatability of processes for the SRE and DCOps teams. SPA has design responsibility for the full Day 0 - Day 2 software platform.
This position is an exciting opportunity to help build a platform leveraging AMD’s industry leading infrastructure and choosing a world class software stack in support of this critical growth area for AMD, its engineering teams, its customers, and for the industry.
The Platform Engineer role in SPA cuts across all hardware and software infrastructure, up through platform software, consumption portals, and ultimately the real goal: having the AI application software experience be optimized for AMD. AI applications are focused on those best-leveraging the AMD Instinct GPU and AMD EYPC CPU in cluster systems.
The Person
AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants’ needs under the respective laws throughout all stages of the recruitment and selection process.
We care deeply about transforming lives with AMD technology to enrich our industry, our communities, and the world. Our mission is to build great products that accelerate next-generation computing experiences - the building blocks for the data center, artificial intelligence, PCs, gaming and embedded. Underpinning our mission is the AMD culture. We push the limits of innovation to solve the world’s most important challenges. We strive for execution excellence while being direct, humble, collaborative, and inclusive of diverse perspectives.
AMD together we advance_
The Team
AMD's Data Center GPU organization is transforming the industry with our AI based Graphic Processors. Our primary objective is to design exceptional products that drive the evolution of computing experiences, serving as the cornerstone for enterprise Data Centers, (AI) Artificial Intelligence, HPC and Embedded systems. If this resonates with you, come and joining our Data Center GPU organization where we are building amazing AI powered products with amazing people.
The Role
The Software Platform Architecture (SPA) team has an open position for a GPU Platform Engineer, AI & HPC. SPA is the hardware-accelerated, software focused wing of the newly formed Cluster Platform Engineering (CPE) team at AMD and rolls up through the Data Center GPU (DCGPU) business unit. This role will be responsible for helping to select, curate, design, automate, and document all software underpinning an entire full stack AI focused platform. This work is not net new code development but instead focused on choosing the right software properties and how data and operations flow through it to ease the adoption and operations of large-scale GPU accelerated AI (Artificial Intelligence) and HPC (High Performance Computing) Cluster systems within AMD.
SPA works closely with the Site Reliability Engineering (SRE) and Data Center Operations (DCOps) teams who tackle day-to-day commissioning and operations of the clusters under CPE’s control. SPA’s work is measured by how much we reduce the operational toil while increasing the rigor and repeatability of processes for the SRE and DCOps teams. SPA has design responsibility for the full Day 0 - Day 2 software platform.
This position is an exciting opportunity to help build a platform leveraging AMD’s industry leading infrastructure and choosing a world class software stack in support of this critical growth area for AMD, its engineering teams, its customers, and for the industry.
The Platform Engineer role in SPA cuts across all hardware and software infrastructure, up through platform software, consumption portals, and ultimately the real goal: having the AI application software experience be optimized for AMD. AI applications are focused on those best-leveraging the AMD Instinct GPU and AMD EYPC CPU in cluster systems.
The Person
- Excellent communication and interpersonal skills
- The ability to interact with various teams in order to account for their needs in platform design
- Technology Orientation - affinity towards seeing application and platform trends, and testing/validating those trends to allow AMD to take best, and earliest advantage
- Outstanding Integrity - a thoroughly honest and forthright individual, who is upfront and direct with subordinates, peers, and management executives to whom he/she reports
- Effective working in a culturally diverse organization
- The Platform Engineer role in SPA cuts across all hardware and software infrastructure, up through platform software, consumption portals, and ultimately the real goal: having the AI application software experience be optimized for AMD. AI applications are focused on those best leveraging the AMD Instinct GPU and AMD EYPC CPU in cluster systems
- Work with all CPE teams to validate that SPA’s platform designs are Day 0 - Day 2 ready and able to integrate with other teams’ workflows
- Work with the Release Engineering team to automate the application of updates and system configuration management tools
- Maintain tight interaction with the SRE team to continually improve how what SPA designs is integrated into an operational change process and cadence
- Ensure that all applications and infrastructure elements expose/export telemetry that is centrally managed and used to guide the management of the entire platform
- Write the glue-code necessary to connect systems to each other if no native mechanisms exist
- Ensure all platform designs reflect Security as a core principle, with input to Policy, Guidelines, and participate in platform and project retrospectives/blameless post-mortems
- Experience in full-stack (infra, platform, application) multi-site, multi-region solutions at scale
- Strong multi-distro Linux knowledge across deployment, configuration, and management
- Cloud Native platform implementation
- Kubernetes as application dial-tone all the way up through Service Mesh and multi-tenant application deployment and management
- Strong knowledge of multiple virtualization and containerization technologies systems like KVM, Xen, and Kubernetes - OpenShift or OpenStack is a bonus
- Experience with automation platforms at scale using Ansible, Terraform / OpenTofu
- Some experience with application and platform telemetry frameworks, such as OpenTelemetry
- Strong networking knowledge with a primary focus on L3 and path-vector routing protocols
- Experience with RDMA/RoCE and InfiniBand a plus
- Demonstrated record of accomplishment of successfully building and delivering complex operational solutions at scale, with the ability to learn new systems quickly in a rapidly changing environment
- Python and Golang experience a plus
- Platform message bus (such as Kafka) experience
- Remote position with ability to travel when required (up to 10%)
- BSEE or relevant technical degree; MSEE or MBA is desirable and preferred
AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants’ needs under the respective laws throughout all stages of the recruitment and selection process.
Salary : $153,280 - $229,920