Data Science Student
RBC Borealis (LSi – Cohort Fall 2025)
Built a PySpark ingestion pipeline and trained scikit-learn classifiers to categorize 40k records with full experiment logging.
What I Did
I built a PySpark ingestion pipeline over three enterprise sources, trained scikit-learn classifiers with stratified cross-validation to categorize 40k records by type, and published structured labeled datasets to PostgreSQL. I logged model configurations, class distributions, and validation metrics across runs to ensure reproducibility.
Impact
The pipeline processed 40k records into structured labeled datasets ready for downstream use. The experiment logging ensured all runs were reproducible and comparable.
What I Learned
I gained experience with PySpark for distributed ingestion across multiple enterprise sources, scikit-learn classifier training with stratified cross-validation, and structured experiment logging for reproducibility.
Key Highlights
Built a PySpark ingestion pipeline over three enterprise sources, trained scikit-learn classifiers with stratified cross-validation to categorize 40k records by type, and published structured labeled datasets to PostgreSQL.
Logged model configurations, class distributions, and validation metrics across runs to ensure reproducibility.