What are the responsibilities and job description for the Research Engineer, Speech Foundation Models position at Tykhe Inc?
We are seeking a highly skilled and experienced Research Lead for Speech, Audio, and Conversational AI to join our innovative team. In this role, you will spearhead the research and development of cutting-edge technologies in speech processing, text-to-speech (TTS), audio analysis, and real-time conversational AI. You will push the boundaries of what's possible in automatic speech recognition (ASR), speaker identification, diarization, speech synthesis, voice cloning, dubbing and audio generation.
Key Responsibilities:
- Bring the state of the art in Audio/Speech and Large Language Models to develop advanced Audio Language Models and Speech Language Models.
- Research, architect, and deploy new generative AI methods such as autoregressive models, causal models, and diffusion models
- Design and implement low-latency end-to-end models with multilingual speech/audio as both input and output.
- Conduct experiments to evaluate and improve the performance of these models, focusing on accuracy, naturalness, efficiency, and real-time capabilities across multiple languages.
- Stay at the forefront of advancements in speech processing, audio analysis, and large language models, integrating new techniques into our foundation models.
- Collaborate with cross-functional teams to integrate these foundation models into Krutrim's AI stack and products.
- Publish research findings in top-tier conferences and journals such as INTERSPEECH, ICASSP, ICLR, ICML, NeurIPS, and IEEE/ACM Transactions on Audio, Speech, and Language Processing.
- Mentor and guide junior researchers and engineers, fostering a collaborative and innovative team environment.
- Drive the adoption of best practices in model development, including rigorous testing, documentation, and ethical considerations in multilingual AI.
Qualifications:
- Ph.D. in Computer Science, Electrical Engineering, or a related field with a focus on speech processing, audio analysis, and machine learning.
- Train speech / audio models for representation (like, W2V-BERT, SONAR, AST), generation (like, Hi-Fi GAN, VQ-GAN, AudioLDM), Conformers, multilingual multitask models (like, SeamlessM4T).
- Expertise with Audio Language Models like AudioPALM, Moshi and Seamless M4T
- Proven track record of developing and applying novel neural network architectures such as Transformers, Mixture of Experts, Diffusion Models, and State Space Machines (MAMBA, SAMBA).
- Extensive experience in developing and optimizing models for low-latency, real-time applications.
- Strong background in multilingual speech recognition, voice cloning, dubbing and synthesis, with an understanding of the challenges specific to different language families.
- Proficiency in deep learning frameworks (e.g., TensorFlow, PyTorch) and experience deploying large-scale speech and audio models.
- Demonstrated expertise in high-performance computing with proficiency in Python, C/C , CUDA, and kernel-level programming for AI applications.
- Experience with audio signal processing techniques and their application in end-to-end neural models.