Experience

Data Science Student

RBC Borealis (LSi – Cohort Fall 2025)

Oct 2025Dec 2025Toronto (Remote)

Built a PySpark ingestion pipeline and trained scikit-learn classifiers to categorize 40k records with full experiment logging.

What I Did

I built a PySpark ingestion pipeline over three enterprise sources, trained scikit-learn classifiers with stratified cross-validation to categorize 40k records by type, and published structured labeled datasets to PostgreSQL. I logged model configurations, class distributions, and validation metrics across runs to ensure reproducibility.

Impact

The pipeline processed 40k records into structured labeled datasets ready for downstream use. The experiment logging ensured all runs were reproducible and comparable.

What I Learned

I gained experience with PySpark for distributed ingestion across multiple enterprise sources, scikit-learn classifier training with stratified cross-validation, and structured experiment logging for reproducibility.

Key Highlights

  • Built a PySpark ingestion pipeline over three enterprise sources, trained scikit-learn classifiers with stratified cross-validation to categorize 40k records by type, and published structured labeled datasets to PostgreSQL.

  • Logged model configurations, class distributions, and validation metrics across runs to ensure reproducibility.

Tech Stack

PySparkScikit-learnPostgreSQLClassificationCross-validation

Tags

researchmldata-engineering

Command Palette

Search for a command to run...