What are the responsibilities and job description for the RDMA Ops Engineer - Computing Infrastructure Networking position at Alibaba Cloud?

We're seeking a skilled RDMA Ops Engineer to optimize and maintain high-performance networking infrastructure for our computing clusters. This role focuses on building and operatiing ultra-low latency, high-throughput networks using RDMA technologies to power next-generation computing workloads.

Key Responsibilities:

• Deploy, operate and maintain RDMA-based network architectures (RoCE/InfiniBand) for cluster with thousands of nodes

• Optimize network performance for distributed collective communication workloads (NCCL, MPI, etc.)

• Solve complex network issues in distributed collective communication (e.g., NCCL/MPI communication bottlenecks)

• Use automation tools for network provisioning, monitoring, diagnostics，and network performance profiling (latency/throughput analysis)

• Implement CI/CD pipelines for network infrastructure-as-code

• Manage end-to-end network lifecycle: deployment, configuration, monitoring, upgrades

• Collaborate with computing algorithm engineers to troubleshoot network-related bottlenecks in training/inference pipelines

• Bridge Computing framework requirements with underlying network infrastructure capabilities

• Ensure compliance with security and scalability requirements

Minimum qualification:

- Strong scripting skills (Python/Go/Bash) for operational automation

- Expert-level RDMA operational experience (RoCEv2/InfiniBand)

- Understanding of Linux internals (kernel bypass, syscall optimization, etc)，and proficient in Linux network stack tuning (irqbalance, NUMA, hugepages)

- Hands-on experience with RDMA/DPDK performance tuning

- Strong knowledge of network protocols (TCP/IP, RoCEv2) and NIC architecture principles

- Ability to abstract complex technical concepts into architectural diagrams

- Proven track record of translating R&D innovations into production solutions

- Strong communication skills for cross-functional collaboration with Computing researchers and SRE teams

Preferred qualification:

- Have experience on managing production Computing networks

- Familiar with Kubernetes networking (CNI, Multus, SR-IOV) and GPU-aware scheduling

- Background in Computing system optimization (NVIDIA collective libraries, MPI tuning)

- Deep understanding of Computing workload patterns and their network implications

The pay range for this position at commencement of employment is expected to be between $104,400 and $171,000/year. However, base pay offered may vary depending on multiple individualized factors, including market location, job-related knowledge, skills, and experience.

If hired, employee will be in an “at-will position” and the Company reserves the right to modify base salary (as well as any other discretionary payment or compensation program) at any time, including for reasons related to individual performance, Company or individual department/team performance, and market factors.

Salary : $104,400 - $171,000

Apply for this job

Receive alerts for other RDMA Ops Engineer - Computing Infrastructure Networking job openings

Job openings at Alibaba Cloud

Alibaba Cloud-Business Development Manager-Sunnyvale, US

Alibaba Cloud

Sunnyvale, CA Full Time

Job Description 1. Customer Relationship Establishment & Business Opportunity Expansion Proactively gain insights into k...

Alibaba Cloud-Quality Assurance Engineer (Testing & Commissioning)-Washington D.C.

Alibaba Cloud

Washington, DC Full Time

Job Description We, Alibaba Overseas Engineering & TPM team, are seeking for a highly skilled and experienced Constructi...

Software Engineer (Cloud Storage Services)

Alibaba Cloud

Seattle, WA Full Time

In Alibaba Cloud, we provide the fundamental Cloud technology and infrastructure to help merchants, brands and other bus...

Video Cloud Site Reliability Engineer (SRE)

Alibaba Cloud

Bellevue, WA Full Time

We are committed to providing intelligent, high-quality, high-performance, ultra-low-latency, flexible and professional ...

Not the job you're looking for? Here are some other RDMA Ops Engineer - Computing Infrastructure Networking jobs in the Sunnyvale, CA area that may be a better fit.

RDMA Ops Engineer - Computing Infrastructure Networking

What are the responsibilities and job description for the RDMA Ops Engineer - Computing Infrastructure Networking position at Alibaba Cloud?

What is the career path for a RDMA Ops Engineer - Computing Infrastructure Networking?

Job openings at Alibaba Cloud

Not the job you're looking for? Here are some other RDMA Ops Engineer - Computing Infrastructure Networking jobs in the Sunnyvale, CA area that may be a better fit.

We don't have any other RDMA Ops Engineer - Computing Infrastructure Networking jobs in the Sunnyvale, CA area right now.

AI Assistant is available now!