What are the responsibilities and job description for the Senior AI Infrastructure Engineer position at Signify Technology?
Job Title : Senior AI Infrastructure Engineer
Location : Remote but must be located in the Bay Area
Salary Range : $200,000-$250,000 Equity
About the Company
They are a fast-growing startup in the 3D generation space, focused on creating tools for 3D artists and game developers. With over 1 million users, their platform is at the forefront of revolutionizing the creation of 3D content using advanced AI and machine learning. Their products enable game developers to quickly generate high-quality 3D models. As they continue to expand, they are looking for an experienced Senior AI Infrastructure Engineer to help scale their AI and machine learning infrastructure.
About the Role
In this role, the engineer will be responsible for training and managing GPU clusters, scaling data processing workflows, and optimizing the performance of AI models on cloud infrastructure. They will work hands-on with large-scale datasets and GPUs to build and scale the infrastructure required to support cutting-edge AI applications such as Text-to-3D and Image-to-3D generation. The ideal candidate will have experience managing their own GPU clusters (8 GPUs), scaling workloads, and working with large image datasets in a cloud environment.
Responsibilities
- GPU Cluster Management : Lead the training and inferencing processes for image-based AI models on GPU clusters. Manage and scale 8 GPUs, ensuring efficient operation and optimal performance across the cluster. This includes setup, monitoring, and troubleshooting of GPU resources.
- Data Processing & Scaling : Work directly with large-scale data processing workflows. Ensure data is processed, cleaned, and ready for training. Scale data pipelines to support high throughput in cloud environments such as AWS or Azure.
- Model Tuning & Training : Work with teams to fine-tune AI models on large image datasets. Train models from scratch or fine-tune pre-trained models for specific use cases, ensuring high performance and scalability. Fine-tuning multi-GPU setups will be a critical part of the role.
- Cloud Infrastructure : Utilize cloud platforms like AWS or Azure to manage and scale GPU clusters. Optimize cloud resources for large-scale training jobs and ensure infrastructure supports the growing demands of their AI models.
- Collaboration & Innovation : Collaborate closely with AI and ML teams to deploy new algorithms, experiment with distributed training, and enhance infrastructure. Play a key role in scaling their GenAI products and ensuring systems can handle millions of AI operations per month.
Required Skills
Salary : $200,000 - $250,000