What are the responsibilities and job description for the GPU Platform Engineer, AI & HPC position at AMD?

WHAT YOU DO AT AMD CHANGES EVERYTHING

We care deeply about transforming lives with AMD technology to enrich our industry, our communities, and the world. Our mission is to build great products that accelerate next-generation computing experiences - the building blocks for the data center, artificial intelligence, PCs, gaming and embedded. Underpinning our mission is the AMD culture. We push the limits of innovation to solve the world’s most important challenges. We strive for execution excellence while being direct, humble, collaborative, and inclusive of diverse perspectives.

AMD together we advance_

The Team

AMD's Data Center GPU organization is transforming the industry with our AI based Graphic Processors. Our primary objective is to design exceptional products that drive the evolution of computing experiences, serving as the cornerstone for enterprise Data Centers, (AI) Artificial Intelligence, HPC and Embedded systems. If this resonates with you, come and joining our Data Center GPU organization where we are building amazing AI powered products with amazing people.

The Role

The Software Platform Architecture (SPA) team has an open position for a GPU Platform Engineer, AI & HPC. SPA is the hardware-accelerated, software focused wing of the newly formed Cluster Platform Engineering (CPE) team at AMD and rolls up through the Data Center GPU (DCGPU) business unit. This role will be responsible for helping to select, curate, design, automate, and document all software underpinning an entire full stack AI focused platform. This work is not net new code development but instead focused on choosing the right software properties and how data and operations flow through it to ease the adoption and operations of large-scale GPU accelerated AI (Artificial Intelligence) and HPC (High Performance Computing) Cluster systems within AMD.

SPA works closely with the Site Reliability Engineering (SRE) and Data Center Operations (DCOps) teams who tackle day-to-day commissioning and operations of the clusters under CPE’s control. SPA’s work is measured by how much we reduce the operational toil while increasing the rigor and repeatability of processes for the SRE and DCOps teams. SPA has design responsibility for the full Day 0 - Day 2 software platform.

This position is an exciting opportunity to help build a platform leveraging AMD’s industry leading infrastructure and choosing a world class software stack in support of this critical growth area for AMD, its engineering teams, its customers, and for the industry.

The Platform Engineer role in SPA cuts across all hardware and software infrastructure, up through platform software, consumption portals, and ultimately the real goal: having the AI application software experience be optimized for AMD. AI applications are focused on those best-leveraging the AMD Instinct GPU and AMD EYPC CPU in cluster systems.

The Person

Excellent communication and interpersonal skills
The ability to interact with various teams in order to account for their needs in platform design
Technology Orientation - affinity towards seeing application and platform trends, and testing/validating those trends to allow AMD to take best, and earliest advantage
Outstanding Integrity - a thoroughly honest and forthright individual, who is upfront and direct with subordinates, peers, and management executives to whom he/she reports 
Effective working in a culturally diverse organization

Key Responsibilities

The Platform Engineer role in SPA cuts across all hardware and software infrastructure, up through platform software, consumption portals, and ultimately the real goal: having the AI application software experience be optimized for AMD. AI applications are focused on those best leveraging the AMD Instinct GPU and AMD EYPC CPU in cluster systems
Work with all CPE teams to validate that SPA’s platform designs are Day 0 - Day 2 ready and able to integrate with other teams’ workflows
Work with the Release Engineering team to automate the application of updates and system configuration management tools
Maintain tight interaction with the SRE team to continually improve how what SPA designs is integrated into an operational change process and cadence
Ensure that all applications and infrastructure elements expose/export telemetry that is centrally managed and used to guide the management of the entire platform
Write the glue-code necessary to connect systems to each other if no native mechanisms exist
Ensure all platform designs reflect Security as a core principle, with input to Policy, Guidelines, and participate in platform and project retrospectives/blameless post-mortems

Preferred Experience

Experience in full-stack (infra, platform, application) multi-site, multi-region solutions at scale
Strong multi-distro Linux knowledge across deployment, configuration, and management
Cloud Native platform implementation
Kubernetes as application dial-tone all the way up through Service Mesh and multi-tenant application deployment and management
Strong knowledge of multiple virtualization and containerization technologies systems like KVM, Xen, and Kubernetes - OpenShift or OpenStack is a bonus
Experience with automation platforms at scale using Ansible, Terraform / OpenTofu
Some experience with application and platform telemetry frameworks, such as OpenTelemetry
Strong networking knowledge with a primary focus on L3 and path-vector routing protocols
Experience with RDMA/RoCE and InfiniBand a plus
Demonstrated record of accomplishment of successfully building and delivering complex operational solutions at scale, with the ability to learn new systems quickly in a rapidly changing environment
Python and Golang experience a plus
Platform message bus (such as Kafka) experience
Remote position with ability to travel when required (up to 10%)

Academic Credentials

BSEE or relevant technical degree; MSEE or MBA is desirable and preferred

At AMD, your base pay is one part of your total rewards package. Your base pay will depend on where your skills, qualifications, experience, and location fit into the hiring range for the position. You may be eligible for incentives based upon your role such as either an annual bonus or sales incentive. Many AMD employees have the opportunity to own shares of AMD stock, as well as a discount when purchasing AMD stock if voluntarily participating in AMD’s Employee Stock Purchase Plan. You’ll also be eligible for competitive benefits described in more detail here.

AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants’ needs under the respective laws throughout all stages of the recruitment and selection process.

Salary : $153,280 - $229,920

Apply for this job

Receive alerts for other GPU Platform Engineer, AI & HPC job openings

GPU Platform Engineer, AI & HPC

What are the responsibilities and job description for the GPU Platform Engineer, AI & HPC position at AMD?

What is the career path for a GPU Platform Engineer, AI & HPC?

Job openings at AMD

Not the job you're looking for? Here are some other GPU Platform Engineer, AI & HPC jobs in the Austin, TX area that may be a better fit.

We don't have any other GPU Platform Engineer, AI & HPC jobs in the Austin, TX area right now.

AI Assistant is available now!