Data Engineer | AUI | Controllable AI Agent

Who are you?

You are a seasoned Data Engineer with a deep understanding of data modeling, massive parallel processing (in both real-time and batch) and bringing Machine learning capabilities into large-scale production systems. You have experience at a cutting-edge startup and are passionate about building the data infrastructures that fuel the world’s first intelligent agent. You are a team player with excellent collaboration, communication skills, and a “can-do” approach.

What you’ll be doing?

Build, maintain, and scale data pipelines for both batch and real-time data processing across multiple sources and ecosystems.
Design and implement robust APIs and integrate diverse data systems to support data collection and aggregation.
Develop and manage advanced data architectures, including lakehouses, streamhouses, and data warehouses.
Collaborate with data scientists and other stakeholders to implement effective data solutions and integrate large language models (LLMs) into our systems.
Work with cross-functional teams to define business needs and translate them into technical implementations that leverage your deep understanding of data architectures and software engineering best practices.
Develop and lead initiatives to manage, monitor, and debug data systems, enhancing their reliability, efficiency, and overall quality.

What should you have?

8+ years of experience in designing and managing sophisticated lakehouse and data warehouse architectures, ensuring scalable, efficient, and reliable data storage solutions.
8+ years of experience building and maintaining ETLs using Apache Spark.
5+ years of experience working with streaming technologies (e.g., Apache Kafka, Pub/Sub) and implementing real-time data pipelines using Stream processing technologies (e.g., Spark Streaming, Cloud Functions).
8+ years of experience with SQL and distributed query engines such as Presto and Trino, with a strong focus on analyzing and optimizing query plans to develop efficient and complex queries.
5+ years of experience developing APIs using Python, with proficiency in asynchronous programming and task queues.
Proven expertise in deploying and managing Spark applications on enterprise-grade platforms such as Amazon EMR, Kubernetes (K8S), and Google Cloud Dataproc.
Solid understanding of distributed systems and experience with open file formats such as Paimon and Iceberg.
5+ years of experience developing infrastructures that bring machine learning capabilities to production, using solutions such as Kubeflow, Sagemaker, and Vertex.
8+ years of experience writing production-grade Python code and working with both relational and non-relational databases.
Solid understanding of software engineering concepts, design patterns, and best practices, with the ability to architect solutions and integrate different system components.
Proven experience working with unstructured data, complex data sets, and data modeling.
Advantage – Demonstrated experience orchestrating containerized applications in AWS and GCP using EKS and GKE.
Advantage – Proficiency in Scala and Java.

Who are you?

What you’ll be doing?

What should you have?

Apply for this job

Apply for this job