I work on ML pipelines, model adaptation, and data workflows.

A lot of my work involves taking raw data—text, logs, and trace-like workflow data—structuring it, and building systems to train and evaluate models on top of it. I spend most of my time on data quality, evaluation, and iteration rather than just model choice.

This page is a breakdown of what I've built and how it actually works, especially the pipeline decisions, evaluation loops, and tradeoffs behind the final result.

what I build

ML pipelines, multimodal processing systems, and data workflows built around unstructured inputs.

how I work

Start with messy data, make it usable, build an evaluation loop, and iterate until the signal is real.

current focus

Applied ML systems where pipeline design, model behavior, and practical outcomes all matter.

selected work

Systems I've built, the data they used, and the outcomes they produced.

Multimodal pipeline#01

Video Summarization Pipeline

Built an end-to-end pipeline that combined transcription, OCR, frame processing, and NLP to turn long-form recordings into structured, timestamped summaries.

PythonWhisperEasyOCROpenCVNLP

what I built

Connected audio transcription, visual text extraction, and frame-level processing into one workflow for long-form video.
Designed the pipeline to handle noisy recordings, variable quality, and inconsistent visual context.
Structured the output into readable timestamped summaries so the system was useful beyond raw transcription.

result / signal

Reduced manual review time by 60–70%
Built across multiple modalities in one pipeline

ML pipeline#02

UFO Sightings: Machine Learning Pipeline

Analyzed 80K+ NUFORC reports to extract signal from noisy, witness-reported text through cleaning, feature engineering, clustering, and supervised modeling.

Pythonpandasregexscikit-learnK-meansDBSCANNeural networks

what I built

Built a preprocessing pipeline for messy text-heavy reports and engineered features that made downstream modeling tractable.
Used K-means and DBSCAN to explore structure before moving into supervised classification.
Iterated on the full workflow rather than only model choice, improving the baseline through repeated evaluation and refinement.

result / signal

Worked on 80K+ records
Improved baseline performance from 1.2% to 7.6%

Research systems#03

Citation Network Analysis at Scale

Worked with 750GB+ of scholarly citation data on HPC systems to study influence, brokerage, and knowledge flow across research networks.

PythonHPCnetwork analysisdoc2vecsimilarity modeling

what I built

Processed large-scale citation data in a high-performance computing environment rather than relying on local workflows.
Computed network metrics such as betweenness and brokerage to model influence and structural position.
Applied representation and similarity methods to study relationships between research areas and scholarly communities.

result / signal

Worked on 750GB+ of citation data
Combined network science and NLP methods

Text classification#04

Spam vs. Ham Email Classifier

Built a binary text classifier for spam detection with an emphasis on feature engineering, model comparison, and fast iteration.

Pythonpandasscikit-learn

what I built

Engineered 20+ text features to capture useful signals beyond a default baseline approach.
Compared preprocessing and model choices in a workflow that stayed easy to inspect and improve.
Used the project to practice practical evaluation rather than treating text classification as a black-box task.

result / signal

20+ engineered features
Reached 90% accuracy

research

I'm most interested in systems where modeling decisions only make sense when the data pipeline and evaluation are solid.

At the Berkeley Institute for Data Science, I work with 750GB+ of scholarly citation data on HPC systems to analyze knowledge flow, influence, and structural position in research networks.

My work combines network metrics such as betweenness and brokerage with representation and similarity methods like doc2vec to study relationships between fields, papers, and communities.

Across projects, I care most about turning ambiguous, unstructured data into something a model or downstream system can actually use—then building the evaluation loop needed to tell whether the system is really improving.

contact

I'm looking for opportunities where I can contribute to applied machine learning, evaluation-heavy workflows, and real systems built on top of messy data.

ananya.athreyas@berkeley.edu linkedin.com/in/ananya-athreyas

Gold Presidential Volunteer Service Award · East Bay SPCA