What are the responsibilities and job description for the Machine Learning Infrastructure Specialist position at GLASS Imaging?
About the Job
Glass Imaging is looking for a highly skilled Machine Learning Infrastructure Specialist to re-design and develop the backbone of our ML training and evaluation ecosystem. In this role, you will have the freedom to craft everything from GPU allocation and data management to experiment tracking and evaluation pipelines.
Your primary responsibilities will include designing and building systems for GPU resource allocation, dataset management, experiment tracking, and evaluation pipelines. You will also be responsible for improving automation of ML train/test infrastructure and implementing automated dataset versioning and validation.
To be successful in this role, you will need strong software engineering skills, experience designing and building infrastructure for ML training workflows, and familiarity with performance profiling and optimization for ML training. You should also have expertise in Python, Linux scripting, and typical ML frameworks (e.g., PyTorch, TensorFlow). Experience with GPU management, distributed computing, and optimizing training pipelines is highly desirable.