Summary
Overview
Work history
Education
Skills
Accomplishments
Timeline
Generic
ANKIT ARORA

ANKIT ARORA

Dubai,UAE

Summary

🧠 Innovative and results-driven Data Architect & Engineering Leader with 13+ years of experience building scalable, cloud-native data platforms and intelligent analytics systems.
🚀 Proven track record in designing high-throughput event ingestion architectures, real-time and batch pipelines, and AI-enabled data marts for business agility and to optimize data pipelines, resulting in significant time and cost savings.

👨‍💻 Experienced Data Engineer excelling in transforming complex datasets into actionable insights. 🤝 Known for a collaborative approach and 🧩 problem-solving mindset, driving impactful improvements in data infrastructure.
🔄 Adept at translating business goals into robust data solutions using modern technologies such as AWS ☁️, Spark ⚡, Trino 🔍, Kafka 📨, and Kubernetes 🐳.
💸 Recognized for driving efficiency through metadata governance, and machine learning–ready feature stores 🧬.
📐 Known for leading strategic design initiatives, authoring architecture blueprints, and enabling enterprise-wide data-driven decision-making


Overview

14
14
years of professional experience
4
4
years of post-secondary education

Work history

Lead Data Engineer

Careem
Dubai, UAE
06.2022 - 06.2025


🗃️ Careem Data Platform
🔧 Led end-to-end data warehouse delivery initiatives, working closely with cross-functional teams including product managers, analysts, and engineers.
⚙️ Spearheaded the development of a versatile Spark job orchestration framework that submits SQL jobs via JSON, enabling dynamic configuration, optimization, and reuse.
⏱️ Designed and deployed both real-time and batch ingestion mechanisms, improving data freshness, availability, and agility for analytical teams.
🧠 Built an AI-first Data Mart framework that enables users to run natural language queries in real-time through MCP AI server and other LLM-backed tools—translating business questions into dynamic SQL execution.
🤖 Ingested and operationalized inferred data from ChatGPT and similar AI tools, integrating it seamlessly into Careem’s pipeline ecosystem for enhanced decision intelligence.
📏 Pioneered a comprehensive data quality framework leveraging Great Expectations and OpenMetadata—standardizing both internal and external validation layers.
🪄 Orchestrated master data pipelines using Apache Airflow with common DAG templates, facilitating parallelism and dependency-based data mart builds.
☁️ Provisioned and configured scalable AWS EMR clusters via Terraform, optimizing for compute efficiency, autoscaling, and cost tracking.
🧾 Integrated Hive metadata layers into EMR, Trino, and Presto environments—seamlessly connecting cataloged datasets across platforms.
💸 Achieved major cost savings by refactoring SQL logic, reducing EMR runtimes, tagging resources, and enforcing cloud governance rules.
🚀 Championed CI/CD automation with GitHub Actions, enhancing deployment agility and integrity across production pipelines.
🐳 Migrated Spark workloads to Kubernetes, enabling fine-tuned resource isolation and observability.
📊 Built and launched a Superset-based data reporting interface, drastically improving query performance and data exploration for analysts.
📉 Integrated a low-latency Druid metrics layer for real-time dashboards, supporting high-concurrency access patterns and business-critical KPIs.
🧩 Delivered a Customer 360° Data Mart and a hybrid Feature Store (offline — Hive/Trino; online — Redis/Kafka), enabling sub-10 ms feature retrievals for ML models and experiment pipelines.
🪣 Architected a robust S3-backed data lake, consolidating raw system logs and structured data feeds into a unified schema-driven platform.
🧭 Maintained a centralized metadata and lineage management system, streamlining data discoverability and governance for end users.


🗃️ Event Platform & Feature Store
⚡ Designed and implemented an enterprise-scale Event Ingestion Platform Architecture, optimized for ultra-low latency (sub-second) and high-volume (60K+ events/sec) analytics using Kafka, Spark Streaming, and S3/Kafka sinks.
🧑‍💻 Developed a Self-Service Event Routing UI empowering downstream teams to seamlessly subscribe to real-time topics with zero engineering dependency.
📄 Authored and drove adoption of high-impact architectural blueprints via technical design documents:
 ✅ Mini-App Session Stitching Design Document (Owner): Unified fragmented session journeys across SuperApp verticals (Food, Pay, Ride) with hybrid SQL attribution and platform metadata enforcement.
 ✅ Decoupled Event Ingestion Architecture Blueprint (Contributor): Redefined ingestion architecture to decouple compute and storage, cutting latency by 40% and reducing infra cost by ~20%.
🧠 Positioned Careem’s data platform for AI-readiness by contributing to online & offline feature store architecture—powering ML models and experimentation with millisecond response times.

Senior Data Engineer

S&P Global
03.2019 - 06.2022

HR Analytics

  • Created the HR Repository from scratch for people team to provide a platform for business stakeholders, Reporting team and Data Scientist team.
  • Assist in setting strategic direction for database, infrastructure and technology through necessary research and development activities.
  • Work as Data Architect to build the HR Repository on AWS platform.
  • Design and implement various components of data pipeline that include data integration, storage, processing, and analysis of business data.
  • Lead the development of project outputs such as business case development, solution vision and design, user requirements, solution mockup, prototypes, and technical architecture, test cases, and deployment plans.
  • Leads data assets as per the enterprise standards, guidelines, and policies.
  • Works on streamline data flows and models; improve consistency, quality, accessibility, and security of Data.
  • Data Analysis on various business problem like Attrition Analysis, People Movement Dashboard, NLP analysis on several Surveys etc. using machine learning and python.

Senior Data Engineer

UnitedHealth Group
08.2016 - 03.2019

Unified DataWare House

  • Design the Interface between different OLTP systems (like based on mainframe systems, oracle databases, Db2 Database, Hadoop base data -lake, Network base MQs, Real time system) and UDW.
  • Design the framework of Dataware House.
  • Made the Architecture of metadata repository including control tables, UNIX local repository, archiving the old information and traceable fields in each table.
  • After go live of project do the scalability of UDW, optimization of components, purging and do the on-going improvements.
  • Develop the ETL jobs.

Senior Data Engineer

UnitedHealth Group
09.2013 - 08.2016

Oxford DataWare House

  • Onsite Coordinator as System level Owner and handles the team from India.
  • Root cause Analysis for recurring failures and apply fix wherever necessary.
  • Identify the areas of improvement, address them on priority basis and provide solution with low level design.
  • Impact Analysis on enhancements being made to ensure data quality.
  • In addition to the daily production support it includes monitoring, fix failures, and work on service calls.
  • Migrate Project from On-premises to cloud.
  • Recreate the ETL Jobs in AWS Cloud.
  • Do the Lift and Shift of Data using Schema Migration.
  • Testing and Validation of Data after loading to Cloud and match to Data in on-premises database.
  • Impact analysis of downstream applications and setup new connections for them.

Data Engineer

Tata Consultancy Services
05.2013 - 08.2013

Hyper Growth

  • Understand existing DataStage job's functionality and re-design them to have better performance.
  • Design sequencing of components to optimize performance.
  • Redesign database tables to fine tune extraction and loading of database tables.
  • Implement Low Level Design, High Level Design, Unit Test Cases and System Test Cases for jobs and module.

Data Engineer

Tata Consultancy Services
11.2012 - 05.2013

BI Clinical Prism

  • Understanding the technical requirements and design the solution.
  • Implementing automated sequencing of ETL Data load based on logical dependency across data entity.
  • Designing System test cases for ETL jobs and sequences.
  • Unit testing of ETL components.
  • Implementing complex logic through PL/SQL procedures which can be reused across DataStage jobs.
  • Performance Optimization of DataStage jobs and Oracle Queries.
  • Developed Unix Scripts for executing DataStage jobs and sequences.

Data Engineer

Tata Consultancy Services
01.2012 - 11.2012

MDM Reporting

  • This is warehousing project to generate the report from master data.
  • ETL Pipeline Development.
  • UNIT Testing.

Web Developer

Concept
05.2011 - 09.2011

Multiple Web Portal

  • Requirement Gathering.
  • Develop front end and back end of web portal using PHP, MySQL.

Education

B.Tech. - Compturer Engineering

Jaipur Engineering College and Research Centre
India
06.2007 - 05.2011

Skills

🗂️ Data Technologies & Platforms
🐘 Hadoop
⚡ Spark
🐍 Python
🧱 Data Modeling
🏗️ Data Architecture
🛡️ Data Governance
✔️ Data Quality
🔁 Online and Offline Feature Store
🔧 ETL Pipeline Design
🦴 Feast (Feature Store)
🤖 Generative AI
🧠 AI-Powered Data Mart
📈 Realtime Analytics

☁️ Cloud Platforms & Infrastructure
🖥️ Amazon EC2
📦 Amazon ECR
🌩️ Amazon EMR
🔄 Amazon ECS
🐳 Docker
📨 Kafka
🐙 Kubernetes
🧬 AWS Glue
⚙️ AWS Lambda
💻 UNIX Shell
🔍 Athena
☁️ Google Cloud
🚀 GCP Compute
🧭 Amazon S3
🏢 Amazon Redshift

🔍 Query Engines & Databases
❄️ Snowflake
🧮 Trino
🚇 Presto
🐬 MySQL
🟠 Oracle
🐝 Hive
📁 HDFS
📊 BigQuery

📉 Monitoring & Workflow Orchestration
📈 Amazon CloudWatch
⏰ Apache Airflow
🗓️ TWS (Tivoli Workload Scheduler)

📊 Visualization & Metadata
📉 Tableau
📘 OpenMetadata
🧪 Great Expectations
🔎 Amundsen

🧰 Data Migration & Analysis
🛠️ AWS DMS
🔬 Exploratory Data Analysis
📐 Hypothesis Testing
📊 Inferential Statistics
🧠 Machine Learning

Accomplishments

  • 🚀 Designed and scaled a high-throughput event platform handling 60K+ events/sec with sub-second latency across real-time analytics pipelines.
  • 🧠 Delivered an AI-first Data Mart, enabling natural language business queries through MCP AI server and ChatGPT-inferred prompt integration.
  • ⚙️ Built a Customer 360° Data Mart and Feature Store (offline via Hive/Trino; online via Redis/Kafka) supporting
  • 📄 Authored & implemented core design documents that redefined session stitching and decoupled event ingestion for SuperApp verticals.
  • ⛓ Migrated Spark workloads to Kubernetes, automated deployment via GitHub Actions, and drove major cost savings through optimized EMR/Trino workloads.

Timeline

Lead Data Engineer

Careem
06.2022 - 06.2025

Senior Data Engineer

S&P Global
03.2019 - 06.2022

Senior Data Engineer

UnitedHealth Group
08.2016 - 03.2019

Senior Data Engineer

UnitedHealth Group
09.2013 - 08.2016

Data Engineer

Tata Consultancy Services
05.2013 - 08.2013

Data Engineer

Tata Consultancy Services
11.2012 - 05.2013

Data Engineer

Tata Consultancy Services
01.2012 - 11.2012

Web Developer

Concept
05.2011 - 09.2011

B.Tech. - Compturer Engineering

Jaipur Engineering College and Research Centre
06.2007 - 05.2011
ANKIT ARORA