What are the responsibilities and job description for the Databricks Engineer/ Databricks SME - W2 ONLY - HYBRID - GA/TX position at Indus River Technologies Inc.?
Job Details
*NOTE: W2 ONLY - HYBRID - Atlanta, GA/ Dallas, TX.
*NOTE: W2 ONLY - HYBRID - Atlanta, GA/ Dallas, TX.
*NOTE: W2 ONLY - HYBRID - Atlanta, GA/ Dallas, TX.
Title: Databrick SME
Location: Atlanta, GA/ Dallas, TX (HYBRID)
Duration: 6-12 months contract
Key Responsibilities:
Databricks Expertise & Development: Lead the design, implementation, and optimization of end-to-end data pipelines using Databricks and Apache Spark for large-scale data processing. Leverage Delta Lake for scalable and reliable data storage, ensuring smooth integration with cloud platforms like Azure or AWS.
Data Profiling & Quality Assurance: Conduct comprehensive data profiling using tools like Great Expectations or Apache Griffin to analyze data quality, detect anomalies, and generate detailed reports on data completeness, consistency, and accuracy. Collaborate with the team to implement data cleansing and quality rules using Databricks notebooks and Python.
Data Modeling & Architecture: Design and implement data models using tools such as DBT (Data Build Tool), Star Schema, or Snowflake Schema to support analytical and business intelligence needs. Utilize Azure Synapse Analytics or Amazon Redshift for building data lakes and warehousing solutions, ensuring scalable and efficient data models.
Collaboration & Cross-functional Engagement: Partner with data engineers, data scientists, and business stakeholders to translate business requirements into efficient data solutions. Utilize tools like Jira, Confluence, and Slack for effective communication and project management in an agile environment.
Performance Optimization & Troubleshooting: Monitor and optimize Databricks workflows and Apache Spark jobs for performance and scalability. Use tools like Databricks Runtime, Apache Airflow for orchestration, and Datadog or New Relic for system monitoring and troubleshooting, ensuring high availability and performance of data pipelines and models.