What are the responsibilities and job description for the Data Engineer position at O3 Technology Solutions?
Job Details
Job Description:
We are seeking a skilled Data Engineer with expertise in building and optimizing data pipelines and infrastructure to support ML and AI applications. The ideal candidate should have strong programming skills in Python and Scala, with hands-on experience in Apache Spark and big data processing frameworks.
Key Responsibilities:
Data Pipeline Development
- Design, build, and maintain scalable ETL/ELT data pipelines for structured and unstructured data.
- Develop real-time and batch data processing pipelines using Apache Spark (PySpark, Scala).
- Optimize data workflows for performance, reliability, and cost efficiency.
Big Data & Cloud Engineering
- Work with distributed data processing frameworks such as Apache Spark, Hadoop, or Kafka.
- Implement data lake, data warehouse, and data marts architectures.
- Leverage cloud-based data solutions (AWS, Azure, or Google Cloud Platform) for storage, transformation, and analytics.
ML & AI Infrastructure Support
- Design data pipelines for ML model training, evaluation, and deployment.
- Support feature engineering, data validation, and model inference processes.
- Collaborate with Data Scientists and ML Engineers to ensure high-quality data availability for AI models.
Database & Storage Optimization
- Work with SQL and NoSQL databases (e.g., PostgreSQL, Redshift, Snowflake, BigQuery, Cassandra, MongoDB).
- Optimize database performance, indexing, and query execution for large datasets.
Security & Compliance
- Implement data security best practices, including encryption, access controls, and auditing.
- Ensure compliance with GDPR, CCPA, and PCI-DSS data privacy regulations.
Required Skills & Experience:
Programming Languages:
- Proficient in Python & Scala (Scala preferred).
- Experience with PySpark & Apache Spark for distributed data processing.
Big Data & Cloud Technologies:
- Apache Spark (PySpark, Scala), Hadoop, Hive, Kafka.
- Experience with Cloud Data Platforms such as AWS (Glue, EMR, Redshift), Azure (Databricks, Synapse), or Google Cloud Platform (BigQuery, Dataflow).
- Working knowledge of containerization (Docker, Kubernetes).
Data Engineering & Pipeline Development:
- Strong experience in ETL/ELT development using Spark, Airflow, or Dataflow.
- Familiarity with orchestration tools (Apache Airflow, Prefect, Dagster, or AWS Step Functions).
Database & Query Optimization:
- Proficiency in SQL (PostgreSQL, Snowflake, Redshift, BigQuery, or MySQL).
- Experience with NoSQL databases (MongoDB, Cassandra, DynamoDB, or HBase).
ML & AI Data Infrastructure:
- Understanding of data preprocessing for ML models.
- Experience working with ML pipelines, feature stores, and model serving frameworks.