Data Engineer
Who are you?
You are a seasoned Data Engineer with a deep understanding of data modeling, massive parallel processing (in both real-time and batch) and bringing Machine learning capabilities into large-scale production systems. You have experience at a cutting-edge startup and are passionate about building the data infrastructures that fuel the world’s first intelligent agent. You are a team player with excellent collaboration, communication skills, and a “can-do” approach.
What you’ll be doing?
- Build, maintain, and scale data pipelines for both batch and real-time data processing across multiple sources and ecosystems.
- Design and implement robust APIs and integrate diverse data systems to support data collection and aggregation.
- Develop and manage advanced data architectures, including lakehouses, streamhouses, and data warehouses.
- Collaborate with data scientists and other stakeholders to implement effective data solutions and integrate large language models (LLMs) into our systems.
- Work with cross-functional teams to define business needs and translate them into technical implementations that leverage your deep understanding of data architectures and software engineering best practices.
- Develop and lead initiatives to manage, monitor, and debug data systems, enhancing their reliability, efficiency, and overall quality.
What should you have?
- 8+ years of experience in designing and managing sophisticated lakehouse and data warehouse architectures, ensuring scalable, efficient, and reliable data storage solutions.
- 8+ years of experience building and maintaining ETLs using Apache Spark.
- 5+ years of experience working with streaming technologies (e.g., Apache Kafka, Pub/Sub) and implementing real-time data pipelines using Stream processing technologies (e.g., Spark Streaming, Cloud Functions).
- 8+ years of experience with SQL and distributed query engines such as Presto and Trino, with a strong focus on analyzing and optimizing query plans to develop efficient and complex queries.
- 5+ years of experience developing APIs using Python, with proficiency in asynchronous programming and task queues.
- Proven expertise in deploying and managing Spark applications on enterprise-grade platforms such as Amazon EMR, Kubernetes (K8S), and Google Cloud Dataproc.
- Solid understanding of distributed systems and experience with open file formats such as Paimon and Iceberg.
- 5+ years of experience developing infrastructures that bring machine learning capabilities to production, using solutions such as Kubeflow, Sagemaker, and Vertex.
- 8+ years of experience writing production-grade Python code and working with both relational and non-relational databases.
- Solid understanding of software engineering concepts, design patterns, and best practices, with the ability to architect solutions and integrate different system components.
- Proven experience working with unstructured data, complex data sets, and data modeling.
- Advantage – Demonstrated experience orchestrating containerized applications in AWS and GCP using EKS and GKE.
- Advantage – Proficiency in Scala and Java.