What are the responsibilities and job description for the Data Engineer with Data Hub position at Avanciers?
Role :: Data Engineer Data Hub
Location :: Austin, TX (onsite Role)
Full-time
Client Note - We are looking for a person who has worked on “DataHub” tool which is used for data lineage.
These resumes have a single word of data hub in entire resume which in context of data warehouse that’s not relevant.
Keys kill: Customization on Data Hub Java Experience.
Development of Data Hub using Java, Good Data Cataloguing experience
Job Description:
Directed projects involving data cataloging using the DataHub open-source framework, anomaly detection through machine learning models, and Spark based framework.
- Ingested metadata from multiple systems to pull metadata information of assets from data lake, upstream and downstream systems.
- Developed custom API solutions that can bring data of ETL pipelines as a push mechanism to DataHub. This enriched the impact analysis to identify the data pipelines reading/writing to a data asset.
- Provided a holistic picture of end-to-end lineage that helped with PII identification, Governance, Impact analysis.
- Improved the performance of Spark-based applications, ensuring seamless functionality.
- Provided recommendations on design and development of ETL pipelines using Spark. Developed and maintained Spark based client custom framework used for config-as-code mechanism of data enrichment and transfer.
- Successfully supported Spark version upgrades and executed AWS cost optimization initiatives for platform-wide efficiency.
- Worked with ML engineers to create features from profiled batch data; and identify anomalies in data patterns.
Ideal candidate could be –
An experienced Data Engineer with a strong background in data lineage, data cataloging, and custom tool development using DataHub and Java. Expertise in utilizing the DataHub open-source framework for data cataloging, metadata ingestion, and end-to-end lineage visualization. Proficient in the development of custom APIs to integrate ETL pipelines with DataHub, enriching impact analysis and enabling seamless identification of data flow across systems.
Core Skills & Expertise:
- In-depth knowledge of DataHub for data lineage, metadata management, and anomaly detection.
- Java development expertise for creating custom API solutions and enhancing DataHub functionality.
- Hands-on experience with Spark for data processing, performance optimization, and framework development.
- Strong background in ETL pipeline development and optimization, particularly with Spark and custom config-as-code mechanisms.
- Proficient in working with AWS for platform optimization and cost reduction.
- Experience working alongside ML engineers to profile and analyze batch data, creating features and detecting anomalies in data patterns.
- Ability to visualize and maintain a holistic picture of end-to-end data lineage, facilitating PII identification, governance, and impact analysis.