Projects

Current research collaborations, applied informatics work, and personal explorations. Note: Write-ups and interactive tools for each project are under construction.

Current Projects

Data-Driven Severity Phenotyping of Rare Fungal Infections Using Electronic Health Records

For my master’s capstone, I’m analyzing large-scale electronic health record (EHR) data to characterize how disease severity manifests in the rare fungal infections histoplasmosis and blastomycosis. I’m using EHR-derived clinical indicators—including diagnoses, care setting utilization, comorbidity burden, and antifungal treatment patterns—to model patient trajectories across acute, post-acute, and longer-term phases of illness. By applying unsupervised statistical methods to these multidimensional indicators, I aim to identify data-driven severity phenotypes that are clinically interpretable and reflect real-world disease complexity, particularly in contexts where prospective cohort studies are infeasible due to disease rarity.

CocciCast: Automated Coccidioidomycosis Forecasting Platform

In collaboration with PhD candidate Simon Camponuri, I’m developing a user-facing web platform for an automated coccidioidomycosis (Valley fever) forecasting model in California. My work focuses on building the data infrastructure and automation pipeline that ingests historical and ongoing environmental data alongside coccidioidomycosis case data, and integrates these inputs with an established environmental forecasting model described here. The goal is to support scalable, reproducible cocci risk forecasting by streamlining data updates and model execution within an accessible web-based UI.

Professional Projects

Estimating the Burden of Underreported Fungal Diseases in the United States

Using large-scale electronic health records and U.S. Census data, I estimate the prevalence of multiple invasive fungal diseases at national and state levels, including in states where these infections are not reportable. I model demographic, geographic, and climate-associated risk factors to identify patterns in individual- and population-level disease risk. This work addresses major gaps in fungal disease surveillance and contributes to national estimates of fungal burden.

Benchmarking EHR-Derived Fungal Disease Estimates Against National Inpatient Data

To evaluate the reliability of electronic health record data for fungal disease surveillance, I benchmarked EHR-derived estimates of histoplasmosis and blastomycosis against the HCUP National Inpatient Sample. This project assesses how well real-world clinical data capture national patterns of disease and informs best practices for using EHRs in epidemiologic research on underreported infections.

Automated Crosswalk Generation for Clinical Diagnostic Code Standardization

I developed a web-scraping–based informatics pipeline to harmonize clinical diagnostic codes across ICD-9, SNOMED CT, and ICD-10 for large-scale fungal disease surveillance. Given an ICD-9 or SNOMED CT input code, the system programmatically queries authoritative online resources, extracts hierarchical and relational mappings, and applies rule-based validation to generate standardized ICD-10 equivalents. This tool supports reproducible epidemiologic analyses by enabling consistent case identification across heterogeneous clinical datasets and legacy health record systems.

Automated Overdose Surveillance from Unstructured Public Health Records

For Marin County Health and Human Services, I developed an automated OCR-based pipeline to process Poison Control and Coroner’s Office reports ingested as PDFs. Using robust text parsing, cleaning, and validation scripts, I transformed over a decade of unstructured historical records into tabulated data suitable for analysis. I then conducted epidemiologic analyses on these data to support internal dashboards, surveillance, and public health reporting.

Building and Analyzing a Global Dataset of Historical Tax Revolts

I built an informatics pipeline to structure nearly 400 historical tax revolt events from an unstructured economic history text and conducted computational analyses using NLP and large language model APIs to classify revolt characteristics and infer spatiotemporal patterns, supporting early dissertation research by Economics PhD candidate Eva Davoine.

Personal Projects

A Multimodal Analysis of Getting Killed by Geese

Blending my two passions of music and quantitative data analysis, I explore the sonic and lyrical elements of Getting Killed, the latest EP by booming NY rock band, Geese. Using audio embeddings and text analysis, I compare relationships between tracks and explore how these patterns align with listener response.

Website Development

What began as an exercise in learning new coding languages, building websites for myself and friends has gradually become a space to explore how people interact with information. As I work through these projects, I’ve become more aware of my place in a growing digital age and how my decisions can meaningfully shape how information is accessed and understood.