Estimating the Burden of Underreported Fungal Diseases in the United States
Using large-scale electronic health records and U.S. Census data, I estimate the prevalence of multiple invasive fungal diseases at national and state levels,
including in states where these infections are not reportable. I model demographic, geographic, and climate-associated risk factors to identify patterns in
individual- and population-level disease risk. This work addresses major gaps in fungal disease surveillance and contributes to national estimates of fungal burden.
Benchmarking EHR-Derived Fungal Disease Estimates Against National Inpatient Data
To evaluate the reliability of electronic health record data for fungal disease surveillance, I benchmarked EHR-derived estimates of histoplasmosis and blastomycosis
against the HCUP National Inpatient Sample. This project assesses how well real-world clinical data capture national patterns of disease and informs best practices for
using EHRs in epidemiologic research on underreported infections.
Automated Crosswalk Generation for Clinical Diagnostic Code Standardization
I developed a web-scraping–based informatics pipeline to harmonize clinical diagnostic codes across ICD-9, SNOMED CT, and ICD-10 for large-scale fungal disease surveillance.
Given an ICD-9 or SNOMED CT input code, the system programmatically queries authoritative online resources, extracts hierarchical and relational mappings, and applies rule-based
validation to generate standardized ICD-10 equivalents. This tool supports reproducible epidemiologic analyses by enabling consistent case identification across heterogeneous
clinical datasets and legacy health record systems.
Automated Overdose Surveillance from Unstructured Public Health Records
For Marin County Health and Human Services, I developed an automated OCR-based pipeline to process Poison Control and Coroner’s Office reports ingested as PDFs. Using robust
text parsing, cleaning, and validation scripts, I transformed over a decade of unstructured historical records into tabulated data suitable for analysis. I then conducted
epidemiologic analyses on these data to support internal dashboards, surveillance, and public health reporting.
Building and Analyzing a Global Dataset of Historical Tax Revolts
I built an informatics pipeline to structure nearly 400 historical tax revolt events from an unstructured economic history text and conducted computational analyses using NLP
and large language model APIs to classify revolt characteristics and infer spatiotemporal patterns, supporting early dissertation research by Economics PhD candidate Eva Davoine.