What are the responsibilities and job description for the Big Data Engineer position at Eateam?
Job Overview :
We're seeking a highly skilled Data Engineer, Big Data Engineer to build scalable data pipelines, develop ML models, and integrate big data systems. You'll work with structured, semi-structured, and unstructured data, focusing on optimizing data systems, building ETL pipelines, and deploying AI models in cloud environments.
Key Responsibilities :
Data Ingestion : Build scalable ETL pipelines using Apache Spark, Talend, AWS Glue, Google Dataflow, Apache NiFi. Ingest data from APIs, file systems, and databases.
Data TransformationValidation : Use Pandas, Apache Beam, and Dask for data cleaning, transformation, and validation. Automate data quality checks with Pytest, Unittest.
Big Data Systems : Process large datasets with Hadoop, Kafka, Apache Flink, Apache Hive. Stream real-time data using Kafka, Google Cloud PubSub.
Task Queues : Manage asynchronous processing with Celery, RQ, RabbitMQ, or Kafka. Implement retry mechanisms and track task status.
Scalability : Optimize for performance with distributed processing (Spark, Flink), parallelization (joblib), and data partitioning.
CloudStorage : Work with AWS, Azure, GCP, Databricks. Store and manage data with S3, BigQuery, Redshift, Synapse Analytics, and HDFS.
Required Skills :
ETL Data Processing : Expertise in Apache Spark, AWS Glue, Google Dataflow, Talend.
Big Data Tools : Proficient with Hadoop, Kafka, Apache Flink, Hive, Presto.
Databases : Strong experience with MySQL, PostgreSQL, MongoDB, Cassandra.
Machine Learning : Hands-on with TensorFlow, PyTorch, Scikit-learn, XGBoost.
Cloud Platforms : Experience with AWS, Azure, GCP, Databricks.
Task Management : Familiar with Celery, RQ, RabbitMQ, Kafka.
Version Control : Git for source code management.
Desirable Skills :
Real-time Data Processing : Experience with Apache Pulsar, Google Cloud PubSub.
Data Warehousing : Familiarity with Redshift, BigQuery, Synapse Analytics.
Scalability Optimization : Knowledge of load balancing (NGINX, HAProxy) and parallel processing.
Data Governance : Use of MLflow, DVC, or other tools for model and data versioning.
Tools Technologies :
ETL : Apache Spark, Talend, AWS Glue, Google Dataflow.
Big Data : Hadoop, Kafka, Apache Flink, Presto.
Databases : MySQL, PostgreSQL, MongoDB, Cassandra.
Cloud : AWS, GCP, Azure, Databricks.
Storage : S3, BigQuery, Redshift, Synapse Analytics, HDFS.
Version Control : Git.