BillingShield — a healthcare payment-integrity platform, built end-to-end.
Full medallion pipeline on CMS Medicare data — 10M+ provider-procedure claims flowing through Bronze → Silver → Gold Delta tables on Databricks, a PySpark + dbt transformation layer, an XGBoost fraud classifier with SHAP explanations, served through FastAPI and a Streamlit dashboard.
The decision I'd defend
Splitting train/test at the NPI level, not the row level. Row-level splits leak provider identity and inflate accuracy 10+ points. The kind of thing that looks fine in a notebook and breaks in production.
