What are the responsibilities and job description for the Data Engineer position at Definitive Healthcare, US?
We are looking for a talented Data Engineer to join our team and help us build and maintain robust data infrastructure and pipelines. If you are passionate about data and have a strong background in Python, Spark, AWS, SQL, SSIS, and related technologies, we want to hear from you!
Responsibilities:
- Design and Develop Data Pipelines:
o Build and maintain scalable data pipelines using Python, Spark, Databricks , SQL, and SSIS.
o Implement data workflows and ETL processes using Apache Airflow and SSIS.
- Data Integration and Management:
o Integrate data from various sources (AWS, on-premises) into a unified data warehouse.
o Handle variety of data formats such as csv, text, xml, parquet, delta etc.,
o Ensure data quality and integrity through effective data cleansing and curation practices.
o Manage and optimize data storage solutions, ensuring high availability and performance.
o Automate observability of data and workloads
- Metadata Management and Governance:
o Implement and manage Unity Catalog for metadata management.
o Ensure data governance policies are followed, including data security, privacy, and compliance.
o Develop and maintain data documentation and data dictionaries.
o Automate data observability across pipelines
- Performance Tuning and Troubleshooting:
o Optimize Spark jobs for performance and efficiency.
o Investigate and resolve performance bottlenecks in Spark applications.
o Utilize JVM tuning techniques to improve application performance.
- Data Maturity Lifecycle:
o Implement and manage the Medallion architecture for data maturity lifecycle.
o Ensure data is appropriately processed and categorized at different stages (bronze, silver, gold) to maximize its usability and value.
- Collaboration and Continuous Improvement:
o Work closely with data scientists, analysts, and other stakeholders to understand data needs and deliver solutions.
o Implement CI/CD pipelines to automate deployment and testing of data infrastructure.
o Stay up-to-date with the latest industry trends and technologies to continuously improve data engineering practices.
Required Skills and Qualifications:
- Technical Skills:
o Hands-on SQL, Python or Scala programming.
o Strong experience with SSIS, Apache Spark and Databricks.
o Hands-on experience with Apache Airflow or similar workflow orchestration tools.
o Knowledge of data cleansing and curation techniques.
o Familiarity with Unity Catalog or other metadata management tools.
o Understanding of data governance principles and best practices.
o Experience with cloud platforms (AWS).
o Proficiency in CI/CD tools and practices (e.g., Jenkins, GitLab CI, etc.).
o Experience with JVM tuning and Spark job performance investigation.
o Experience with Medallion architecture for data maturity lifecycle.
- Soft Skills:
o Excellent problem-solving and analytical skills.
o Strong communication and collaboration skills.
o Ability to work independently and as part of a team.
o Detail-oriented with a focus on delivering high-quality work.
Preferred Qualifications:
- Certification in cloud platforms (AWS Certified Data Analytics, etc.).
- Familiarity with SQL and NoSQL databases.
- Experience in a similar role within a fast-paced, data-driven environment.