What are the responsibilities and job description for the Research And Development Engineer position at Emaago?
A top tier HFT firm is looking for a Senior Research Engineer who will be responsible for building AI systems that can perform previously impossible tasks or achieve unprecedented levels of performance. They are looking for people with solid engineering skills who are comfortable working with large distributed systems and strive to write quality, well tested code.
The most outstanding deep learning results are increasingly attained at a massive scale, and these results require engineers who are comfortable working in large distributed systems. They expect engineering to play a key role in most major advances in AI of the future.
In This Role, You Will
- Build and own data pipelines operating on internet-scale data spanning the text, image, and audio modalities
- Collaborate with many teams to incorporate latest and greatest research into pre-training datasets
- Research new methods for improving our datasets alongside researchers
You Might Thrive In This Role If You
- Enjoy working at the cutting-edge of large language model research
- Have experience running complicated processing on very large datasets
- Are comfortable working in a fast-paced, dynamic environment - research can evolve quite rapidly!
About The Team
They strongly believe in the importance of data and have seen repeatedly how large of an impact focusing on data quality can yield across all of our projects. The Data Processing team brings this focus to the flagship models, owning the pipelines for turning raw data into the high quality, diverse, and multimodal datasets used to train largest models. They work closely with teams focused on data acquisition, data quality, and multimodal data throughout Research.
In addition to building new datasets, they collaborate on data research and acquisition to explore ways to get more out of data, including questions around efficiency, efficacy, and diversity. They also own and continuously improve the infrastructure used across several teams to prepare data for training models small and large.