What are the responsibilities and job description for the Senior Big Data Engineer position at Definitive Healthcare, US?
Responsibilities:
- Design and Develop Data Pipelines:
- Build and maintain scalable data pipelines using Python, Spark, and Databricks.
- Implement data workflows and ETL processes using Apache Airflow.
- Data Integration and Management:
- Integrate data from various sources (AWS, GCP, on-premises) into a unified data warehouse.
- Handle variety of data formats such as csv, text, xml, parquet, delta etc.,
- Ensure data quality and integrity through effective data cleansing and curation practices.
- Manage and optimize data storage solutions, ensuring high availability and performance.
- Automate observability of data and workloads
- Metadata Management and Governance:
- Implement and manage Unity Catalog for metadata management.
- Ensure data governance policies are followed, including data security, privacy, and compliance.
- Develop and maintain data documentation and data dictionaries.
- Automate data observability across pipelines
- Performance Tuning and Troubleshooting:
- Optimize Spark jobs for performance and efficiency.
- Investigate and resolve performance bottlenecks in Spark applications.
- Utilize JVM tuning techniques to improve application performance.
- Data Maturity Lifecycle:
- Implement and manage the Medallion architecture for data maturity lifecycle.
- Ensure data is appropriately processed and categorized at different stages (bronze, silver, gold) to maximize its usability and value.
- Collaboration and Continuous Improvement:
- Work closely with data scientists, analysts, and other stakeholders to understand data needs and deliver solutions.
- Implement CI/CD pipelines to automate deployment and testing of data infrastructure.
- Stay up to date with the latest industry trends and technologies to continuously improve data engineering practices.
Required Skills and Qualifications:
- Technical Skills:
- Hands-on Python or Scala programming.
- Strong experience with Apache Spark and Databricks.
- Hands-on experience with Apache Airflow or similar workflow orchestration tools.
- Data modeling and processing fundamentals with large-scale volume of data
- Knowledge of data cleansing and curation techniques.
- Familiarity with Unity Catalog or other metadata management tools.
- Understanding of data governance principles and best practices.
- Experience with cloud platforms (AWS and GCP).
- Strong understanding of normalization and denormalization.
- Proficiency in CI/CD tools and practices (e.g., Jenkins, GitLab CI, etc.).
- Experience with JVM tuning and Spark job performance investigation.
- Experience with Medallion architecture for data maturity lifecycle.
- Familiarity with containerization
- Soft Skills:
-
- Excellent problem-solving and analytical skills.
- Strong communication and collaboration skills.
- Ability to work independently and as part of a team.
- Detail-oriented with a focus on delivering high-quality work.
Preferred Qualifications:
- Certification in cloud platforms (AWS Certified Data Analytics, Google Cloud Professional Data Engineer, etc.).
- Familiarity with SQL and NoSQL databases.
- Experience in a similar role within a fast-paced, data-driven environment.