Demo

Production Engineer

Karkidi
Berkeley, CA Full Time
POSTED ON 1/17/2025
AVAILABLE BEFORE 4/17/2025

Production Engineers at Covariant play a mission-critical role in ensuring our services' seamless operation and future scalability. In this role, you'll be at the forefront of every significant engineering endeavor embedded within our production and research teams. As a production engineer, you will drive innovation and efficiency in our projects by applying your expertise in AWS, Docker, Kubernetes, Puppet, and Terraform to architect scalable and resilient infrastructure for our innovative AI robotics systems.

AREAS OF FOCUS

  • Own and orchestrate large GPU clusters across different cloud providers using IaaC and scripts to provide researchers with a single cohesive interface
  • Help other teammates architect and build scalable tooling for our edge robot fleet
  • Collaborate with brilliant researchers to evolve our training and inference tooling to be state-of-the-art

YOU WILL

  • Design, build, manage and monitor the infrastructure we use to deploy our AI software and robotics solutions
  • Develop and evolve software engineering and operational practices for the unique needs of distributed AI-powered cyber-physical systems
  • Identify and establish healthy engineering and operational culture and processes
  • Deliver previously impossible robotics capabilities that solve real needs for our partners and customers
  • Collaborate with, learn from, and support a diverse and cross-functional team, including mechanical, electrical, and robotics engineers, AI / ML researchers, and business development
  • YOU HAVE

  • Substantial previous experience in operating and automating production systems in both cloud and bare metal, deploying and administering Linux systems and / or wide-area networks, and building new tools and / or extending existing tools to add new capabilities
  • A track record of accelerating developer productivity through improved tooling, automation, and education
  • A track record of partnering with stakeholders to deliver solutions throughout the development process
  • A solid foundation in Python, Linux, and networking
  • Commitment to continuous learning and willingness to pick up new languages or technologies as needed, to solve real problems and deliver business impact
  • NICE TO HAVES

  • Desire to work with a small collaborative team, with a high degree of autonomy and responsibility
  • Are motivated to work on challenging real-world engineering problems without prior solutions
  • Are excited to join coworkers who strive to be inclusive, thoughtful, and down-to-earth
  • Are self-directed and enjoy figuring out what is the most important problem to work on
  • Have previously done one or more of the following : deployed client-side software, including protecting source code, establishing secure licensing, and performing release engineering; or, set up and scaled developer tooling and CI / CD systems; or built ML or IoT data pipelines processing images and metadata from live deployments; or managed high-bandwidth deep learning or super-computing hardware
  • SAMPLE WEEK IN THE LIFE

  • Monday : Start the week with a team meeting to discuss ongoing projects and explore potential collaborations. Resume work on the rollout of BigProxy v2 in the development environment, refining probing tests to enhance its reliability. Also, schedule a discussion with our Tailscale account representative to renew our contract.
  • Tuesday : Address an urgent issue with the networking backplane of one of our GPU clusters not performing optimally. Conduct a troubleshooting session with the cluster provider to adjust the NCCL topology file, following unexpected changes on their end.
  • Wednesday : Develop a new alert in Datadog to monitor the performance of the GPU cluster backplane, ensuring it is adaptable for use with various providers.
  • Thursday : Collaborate with a colleague on deploying a PyPi server in our cloud infrastructure. Continue the implementation and testing of BigProxy v2 which was paused on Tuesday.
  • Friday : Lead a presentation at the weekly engineering deep dive to discuss the features and potential rollout of BigProxy v2, which consolidates all connections from remote deployments to the cloud through a single channel and simplifies SSH access to GPU clusters outside AWS / GCP. Gather and incorporate feedback from the team to finalize the deployment strategy.
  • SALARY RANGE : $165,000 - $210,000 a year

    Base pay is one element of our total rewards package which may also include comprehensive benefits and equity etc., depending on eligibility. The annual base salary range for this position is from $165,000 to $210,000. The actual base pay offered will be determined on factors such as years of relevant experience, skills, education etc. Decisions will be determined on a case-by-case basis.

    J-18808-Ljbffr

    Salary : $165,000 - $210,000

    If your compensation planning software is too rigid to deploy winning incentive strategies, it’s time to find an adaptable solution. Compensation Planning
    Enhance your organization's compensation strategy with salary data sets that HR and team managers can use to pay your staff right. Surveys & Data Sets

    What is the career path for a Production Engineer?

    Sign up to receive alerts about other jobs on the Production Engineer career path by checking the boxes next to the positions that interest you.
    Income Estimation: 
    $85,140 - $105,525
    Income Estimation: 
    $107,004 - $128,710
    Income Estimation: 
    $102,830 - $126,611
    Income Estimation: 
    $105,325 - $132,008
    Income Estimation: 
    $92,369 - $122,605
    Income Estimation: 
    $117,024 - $149,811
    Income Estimation: 
    $158,960 - $205,707
    Income Estimation: 
    $154,509 - $200,187
    Income Estimation: 
    $71,493 - $96,419
    Income Estimation: 
    $92,369 - $122,605
    Income Estimation: 
    $117,024 - $149,811
    Income Estimation: 
    $137,568 - $176,908
    View Core, Job Family, and Industry Job Skills and Competency Data for more than 15,000 Job Titles Skills Library

    Job openings at Karkidi

    Karkidi
    Hired Organization Address Boston, MA Full Time
    We are looking for innovative and creative individuals who seize opportunities to uncover hidden drivers, impacts, and k...
    Karkidi
    Hired Organization Address Mountain View, CA Full Time
    Minimum qualifications : Bachelor's degree in Electrical Engineering, Computer Engineering, Computer Science, a related ...
    Karkidi
    Hired Organization Address San Diego, CA Full Time
    The Digital Modernization Sector has a career opportunity for a Senior Solution Architect for Innovations supporting the...
    Karkidi
    Hired Organization Address Mississippi State, MS Full Time
    You will join a team working with next gen technologies on strategic innovation projects in order to identify areas for ...

    Not the job you're looking for? Here are some other Production Engineer jobs in the Berkeley, CA area that may be a better fit.

    Production Engineer

    Provision People, Walnut Creek, CA

    Senior Production Engineer

    Christy Media Solutions, Alameda, CA

    AI Assistant is available now!

    Feel free to start your new journey!