Experience

Data Science Student

RBC Borealis (LSi – Cohort Fall 2025)

Oct 2025 – Dec 2025Toronto (Remote)

Built a PySpark ingestion pipeline and trained scikit-learn classifiers to categorize 40k records with full experiment logging.

What I Did

I built a PySpark ingestion pipeline over three enterprise sources, trained scikit-learn classifiers with stratified cross-validation to categorize 40k records by type, and published structured labeled datasets to PostgreSQL. I logged model configurations, class distributions, and validation metrics across runs to ensure reproducibility.

Impact

The pipeline processed 40k records into structured labeled datasets ready for downstream use. The experiment logging ensured all runs were reproducible and comparable.

What I Learned

I gained experience with PySpark for distributed ingestion across multiple enterprise sources, scikit-learn classifier training with stratified cross-validation, and structured experiment logging for reproducibility.

Key Highlights

Built a PySpark ingestion pipeline over three enterprise sources, trained scikit-learn classifiers with stratified cross-validation to categorize 40k records by type, and published structured labeled datasets to PostgreSQL.
Logged model configurations, class distributions, and validation metrics across runs to ensure reproducibility.

Tech Stack

PySparkScikit-learnPostgreSQLClassificationCross-validation

What I Did

Impact

What I Learned

Key Highlights

Tech Stack

Tags

Command Palette