Demo

DL Communications Collectives SW Engineer

Rivos
Santa Clara, TX Remote Full Time
POSTED ON 2/4/2025
AVAILABLE BEFORE 3/3/2025

We are working on software to improve the Deep Learning ecosystem and help hardware engineers build great Deep Learning parallel systems.

We are looking for a strong candidate with a background in writing systems software for networking devices (and optionally Linux kernel networking stack or network drivers). Someone who's implemented network protocols or has worked on OpenMPI.This role involves designing and implementing highly optimized communication collectives libraries similar to UCC (Unified Collective Communication) and NCCL (NVIDIA Collective Communications Library). The ideal candidate will work closely with hardware and software teams to ensure efficient data communication and synchronization across multiple AI accelerators in a distributed system, enabling scalable deep learning and high-performance computing applications.

You will be learning technical and organizational skills from industry veterans: how to write performant and readable code; how to structure and communicate projects, ideas, and progress; how to work effectively with the Open Source community.

We are big proponents of Open Source and Free software and contribute back our improvements to all the great projects we use.


We prefer candidates who work out of one of our offices, but will consider remote candidates as well.

\n


Responsibilities
  • Build-up communication components of an AI Software Stack
  • Port AI Software to run on a new H/W platform
  • Profiling and tuning of communications within AI applications
  • Design, develop, and optimize communication collectives (e.g., AllReduce, AllGather, Broadcast, ReduceScatter) for large-scale distributed computing and machine learning frameworks.
  • Implement and optimize communication algorithms (ring, tree, butterfly, etc.) tailored for our architectures and multi-node clusters.
  • Ensure low-latency, high-bandwidth communication across multi-GPU setups, supporting interconnects such as PCIe and Infiniband.
  • Collaborate with hardware engineers and other software teams to optimize performance.
  • Implement fault tolerance and scalability mechanisms in distributed systems to handle large-scale workloads.
  • Write unit tests and benchmark tools to validate the performance and correctness of collective operations.
  • Stay current with advancements in hardware and networking technologies to continuously improve the library's performance.


Requirements
  • Strong understanding of GPU architectures (CUDA, AMD ROCm) and experience in GPU programming (CUDA, HIP, or similar).
  • Proficiency in designing and implementing parallel and distributed algorithms, particularly communication collectives.
  • Experience with network interconnects (NVLink, PCIe, Infiniband, RDMA) and understanding of their performance implications.
  • Hands-on experience with communication collectives libraries like UCC, NCCL, or MPI.
  • Strong knowledge of concurrency, synchronization, and memory consistency models in multi-threaded and distributed environments.
  • Experience with profiling and optimizing low-level performance (memory bandwidth, latency, throughput) on GPU architectures.
  • Familiarity with deep learning frameworks (TensorFlow, PyTorch, etc.) and their use of communication collectives.
  • Strong problem-solving skills and ability to work in a fast-paced, collaborative environment.
  • Network driver experience recommended
  • Excellent skills in problem solving, written and verbal communication
  • Strong organization skills, and highly self-motivated.
  • Ability to work well in a team and be productive under aggressive schedules.


Optional Requirements
  • Experience with NumPy, PyTorch, TensorFlow or JAX
  • Experience with Rust
  • Experience with CUDA, OpenCL, OpenGL, or SYCL
  • Coursework or experience with Machine Learning algorithms


Education and Experience
  • Bachelor’s, Master’s, or PhD in Computer Engineering, Software Engineering or Computer Science


\n

If your compensation planning software is too rigid to deploy winning incentive strategies, it’s time to find an adaptable solution. Compensation Planning
Enhance your organization's compensation strategy with salary data sets that HR and team managers can use to pay your staff right. Surveys & Data Sets

What is the career path for a DL Communications Collectives SW Engineer?

Sign up to receive alerts about other jobs on the DL Communications Collectives SW Engineer career path by checking the boxes next to the positions that interest you.
Income Estimation: 
$86,680 - $110,316
Income Estimation: 
$110,730 - $135,754
Income Estimation: 
$117,033 - $148,289
Income Estimation: 
$87,720 - $106,708
Income Estimation: 
$108,098 - $130,480
Income Estimation: 
$108,098 - $130,480
Income Estimation: 
$131,611 - $156,576
Income Estimation: 
$110,730 - $135,754
Income Estimation: 
$128,617 - $162,576
Income Estimation: 
$117,033 - $148,289
View Core, Job Family, and Industry Job Skills and Competency Data for more than 15,000 Job Titles Skills Library

Job openings at Rivos

Rivos
Hired Organization Address Santa Clara, TX Full Time
Rivos Custom Circuits team is seeking highly motivated candidates to develop state of the art custom SRAM memories, Regi...
Rivos
Hired Organization Address Santa Clara, TX Intern
Positions are open for Co-op/internship in the areas of CPU RTL design and verification from unit level to chip level. W...
Rivos
Hired Organization Address Santa Clara, TX Full Time
Join a cutting-edge and well-funded hardware startup in Silicon Valley as an Deep Learning and Large Language Model Perf...
Rivos
Hired Organization Address Santa Clara, CA Full Time
Join a cutting-edge and well-funded hardware startup as a Physical Design CAD Engineer. Our mission is to reimagine sili...

Not the job you're looking for? Here are some other DL Communications Collectives SW Engineer jobs in the Santa Clara, TX area that may be a better fit.

Embedded SW Engineer, Common Core Platform

Axis Communications AB, Lund, NV

Operations/Communications Coordinator

Engineer Reserve Corp., Doylestown, PA

AI Assistant is available now!

Feel free to start your new journey!