🧠 Innovative and results-driven Data Architect & Engineering Leader with 13+ years of experience building scalable, cloud-native data platforms and intelligent analytics systems.
🚀 Proven track record in designing high-throughput event ingestion architectures, real-time and batch pipelines, and AI-enabled data marts for business agility and to optimize data pipelines, resulting in significant time and cost savings.
👨💻 Experienced Data Engineer excelling in transforming complex datasets into actionable insights. 🤝 Known for a collaborative approach and 🧩 problem-solving mindset, driving impactful improvements in data infrastructure.
🔄 Adept at translating business goals into robust data solutions using modern technologies such as AWS ☁️, Spark ⚡, Trino 🔍, Kafka 📨, and Kubernetes 🐳.
💸 Recognized for driving efficiency through metadata governance, and machine learning–ready feature stores 🧬.
📐 Known for leading strategic design initiatives, authoring architecture blueprints, and enabling enterprise-wide data-driven decision-making
🗃️ Careem Data Platform
🔧 Led end-to-end data warehouse delivery initiatives, working closely with cross-functional teams including product managers, analysts, and engineers.
⚙️ Spearheaded the development of a versatile Spark job orchestration framework that submits SQL jobs via JSON, enabling dynamic configuration, optimization, and reuse.
⏱️ Designed and deployed both real-time and batch ingestion mechanisms, improving data freshness, availability, and agility for analytical teams.
🧠 Built an AI-first Data Mart framework that enables users to run natural language queries in real-time through MCP AI server and other LLM-backed tools—translating business questions into dynamic SQL execution.
🤖 Ingested and operationalized inferred data from ChatGPT and similar AI tools, integrating it seamlessly into Careem’s pipeline ecosystem for enhanced decision intelligence.
📏 Pioneered a comprehensive data quality framework leveraging Great Expectations and OpenMetadata—standardizing both internal and external validation layers.
🪄 Orchestrated master data pipelines using Apache Airflow with common DAG templates, facilitating parallelism and dependency-based data mart builds.
☁️ Provisioned and configured scalable AWS EMR clusters via Terraform, optimizing for compute efficiency, autoscaling, and cost tracking.
🧾 Integrated Hive metadata layers into EMR, Trino, and Presto environments—seamlessly connecting cataloged datasets across platforms.
💸 Achieved major cost savings by refactoring SQL logic, reducing EMR runtimes, tagging resources, and enforcing cloud governance rules.
🚀 Championed CI/CD automation with GitHub Actions, enhancing deployment agility and integrity across production pipelines.
🐳 Migrated Spark workloads to Kubernetes, enabling fine-tuned resource isolation and observability.
📊 Built and launched a Superset-based data reporting interface, drastically improving query performance and data exploration for analysts.
📉 Integrated a low-latency Druid metrics layer for real-time dashboards, supporting high-concurrency access patterns and business-critical KPIs.
🧩 Delivered a Customer 360° Data Mart and a hybrid Feature Store (offline — Hive/Trino; online — Redis/Kafka), enabling sub-10 ms feature retrievals for ML models and experiment pipelines.
🪣 Architected a robust S3-backed data lake, consolidating raw system logs and structured data feeds into a unified schema-driven platform.
🧭 Maintained a centralized metadata and lineage management system, streamlining data discoverability and governance for end users.
🗃️ Event Platform & Feature Store
⚡ Designed and implemented an enterprise-scale Event Ingestion Platform Architecture, optimized for ultra-low latency (sub-second) and high-volume (60K+ events/sec) analytics using Kafka, Spark Streaming, and S3/Kafka sinks.
🧑💻 Developed a Self-Service Event Routing UI empowering downstream teams to seamlessly subscribe to real-time topics with zero engineering dependency.
📄 Authored and drove adoption of high-impact architectural blueprints via technical design documents:
✅ Mini-App Session Stitching Design Document (Owner): Unified fragmented session journeys across SuperApp verticals (Food, Pay, Ride) with hybrid SQL attribution and platform metadata enforcement.
✅ Decoupled Event Ingestion Architecture Blueprint (Contributor): Redefined ingestion architecture to decouple compute and storage, cutting latency by 40% and reducing infra cost by ~20%.
🧠 Positioned Careem’s data platform for AI-readiness by contributing to online & offline feature store architecture—powering ML models and experimentation with millisecond response times.
HR Analytics
Unified DataWare House
Oxford DataWare House
Hyper Growth
BI Clinical Prism
MDM Reporting
Multiple Web Portal
🗂️ Data Technologies & Platforms
🐘 Hadoop
⚡ Spark
🐍 Python
🧱 Data Modeling
🏗️ Data Architecture
🛡️ Data Governance
✔️ Data Quality
🔁 Online and Offline Feature Store
🔧 ETL Pipeline Design
🦴 Feast (Feature Store)
🤖 Generative AI
🧠 AI-Powered Data Mart
📈 Realtime Analytics
☁️ Cloud Platforms & Infrastructure
🖥️ Amazon EC2
📦 Amazon ECR
🌩️ Amazon EMR
🔄 Amazon ECS
🐳 Docker
📨 Kafka
🐙 Kubernetes
🧬 AWS Glue
⚙️ AWS Lambda
💻 UNIX Shell
🔍 Athena
☁️ Google Cloud
🚀 GCP Compute
🧭 Amazon S3
🏢 Amazon Redshift
🔍 Query Engines & Databases
❄️ Snowflake
🧮 Trino
🚇 Presto
🐬 MySQL
🟠 Oracle
🐝 Hive
📁 HDFS
📊 BigQuery
📉 Monitoring & Workflow Orchestration
📈 Amazon CloudWatch
⏰ Apache Airflow
🗓️ TWS (Tivoli Workload Scheduler)
📊 Visualization & Metadata
📉 Tableau
📘 OpenMetadata
🧪 Great Expectations
🔎 Amundsen
🧰 Data Migration & Analysis
🛠️ AWS DMS
🔬 Exploratory Data Analysis
📐 Hypothesis Testing
📊 Inferential Statistics
🧠 Machine Learning