Data Engineer
& Researcher.

currently:

Advocating for Open source | MS Data Science @ University of Washington, Seattle

<pk />
Pavankumar Suresh
* Engineering Philosophy
01 / SIMPLICITY

Simple Beats Clever

Clean schemas and robust, vectorized pipelines beat complex, fragile architectures every single time. Good design is quiet and highly maintainable.

02 / STANDARD

Open Standards Only

Deep focus on open formats like Parquet, Arrow, and Hudi. Your data belongs to your engineering stack, completely free from proprietary vendor locks.

03 / USER-CENTRIC

Customer First

Engineering starts with understanding the user's requirements. We reverse-engineer our solutions directly from core user needs to build platforms that actually solve real problems.

* Work Experience

University of Washington

Research Assistant, Health Sciences · Jan 2026 - Present · Seattle, WA

Building an end-to-end air quality data pipeline ingesting data from 5 heterogeneous instruments into a unified DuckDB + Parquet schema, enabling researchers to run statistical analyses without manual data wrangling. Designing a standardized ingestion layer that harmonizes multi-format sensor data with automated parsing, schema validation, and cloud storage on AWS S3.

University of Washington

Student Researcher, Bioengineering · Mar 2026 - Present · Seattle, WA

Contributing to AutoRELATE, a multimodal LLM research project evaluating AI-based clinical communication assessment. Focused on annotation pipelines and model evaluation workflows, investigating multimodal inputs to enhance model performance for equitable, scalable clinician training tools.

Kimberly-Clark

Data Engineer · Aug 2023 - Aug 2025 · Bengaluru, India

Architected a cloud-native automated pricing engine for 30+ Huggies SKUs on Amazon India using Azure Data Factory, Databricks, and Data Lakes, contributing to a 2.14% revenue increase. Engineered Power BI reporting across 6 APAC markets reducing manual requests by 80%. Optimized master data pipelines for a 56% runtime decrease and led a production storage migration to Azure Blob Storage with zero downtime across 20+ Databricks notebooks.

Kimberly-Clark

Data Engineering Intern · Jan 2023 - Jul 2023 · Bengaluru, India

Developed anomaly detection models in PySpark for scalable log processing and built Power BI dashboards that identified 50+ security findings in 3 months. Analyzed 200+ phishing emails to design organization-wide training programs and built MLflow-based monitoring pipelines for real-time alerts on data and prediction drift.

Université du Québec en Outaouais

Student Researcher, MITACS Globalink · May 2022 - Aug 2022 · Gatineau, Canada

Awarded MITACS Globalink Research Scholarship to develop a thermal imaging dataset for sports injury diagnosis. Collected and processed 20,000+ thermal images across 4 anatomical regions to support deep learning segmentation models.

* Projects
LanceDB x PuppyGraph
Postgres-wire proxy so PuppyGraph can query LanceDB tables over JDBC. Key fix: LEFT JOIN rewrite in the DuckDB session layer to handle PuppyGraph's getColumns query.
DuckDB LanceDB PuppyGraph
FHIR to Parquet
Transformed 51GB of Synthea FHIR JSON to Parquet. Query speedup of 39x. Presented at a Seattle data meetup to Databricks and Uber staff engineers.
FHIR Parquet DuckDB
ClickHouse vs DuckDB
High-fidelity clinical analytics benchmark comparing DuckDB and ClickHouse performance on 14.8M patient observations.
ClickHouse DuckDB Benchmarks
Kafka & Spark Streaming
Designed a Kafka–Spark streaming pipeline processing 50K+ ticket purchase events for late-arriving data. Integrated dbt layered analytics under a Medallion Architecture on DuckDB to support consistent metrics.
Kafka Spark dbt DuckDB
Privacy-Preserving PII Label Detection Using Machine Learning
Built an ML classifier (TF-IDF, SMOTE, Random Forest) reaching 96.6% accuracy on a 70K-record synthetic dataset, outperforming the Sherlock baseline. Published in IEEE ICCCNT 2023.
IEEE Machine Learning Data Privacy Research
* Writing
* Tech Stack
pythonsqlduckdbapache hudilancedbapache sparkpysparkdatabrickskafkaairflowazure data factorydbtazureawssnowflakepandasnumpypower bitableau