AI Engineer & Data Engineer · Toronto, ON

Scott
Shi.

AI Engineer & Data Engineer with 10+ years across Canada's top financial institutions. Currently building production-grade AI systems — LLM agents, RAG pipelines, multi-agent orchestrators — while architecting Azure Databricks platforms ingesting 40M+ records daily at TD Bank.

View My Work Let's Talk Email Me

Scott Shi — Data Scientist II · TD Bank

10+

Years in Data

40M

Records / Day

Enterprise Clients

Certifications

Open to full-time roles — Toronto · Remote · Open to U.S. relocation

Stack

PySpark· Azure Databricks· Delta Lake· Apache Airflow· Python· SQL· ADLS Gen2· AWS Glue· Azure Synapse· LLaMA 3.3· RAG· ChromaDB· DuckDB· MLflow· Docker· PySpark· Azure Databricks· Delta Lake· Apache Airflow· Python· SQL· ADLS Gen2· AWS Glue· Azure Synapse· LLaMA 3.3· RAG· ChromaDB· DuckDB· MLflow· Docker·

About

I'm Scott Shi, an AI Engineer & Data Engineer at TD Bank, Toronto, currently building production-grade AI systems alongside enterprise data platforms.

10+ years across Canada's top financial institutions — TD, Sun Life, Desjardins, and the NFL — delivering pipelines, cloud migrations, and ML feature stores that run at real scale.

On the AI side, I'm actively building LLM-based agents, RAG pipelines, and multi-agent orchestrators.

Building Now

DataScope — a SQL agent (LLaMA 3.3 70B + DuckDB), RAG pipeline (ChromaDB + ONNX), and multi-agent orchestrator. FastAPI · Streamlit · MLflow · Docker · GitHub Actions.

Core Strength

End-to-end data systems — from raw ingestion through Delta Lake transformation to LLM-powered agents. Equally comfortable tuning Spark jobs or shipping production AI.

Background

MSc Economics, University of Guelph (2014). 8 professional certifications including Databricks Professional, AWS Data Engineer, and Microsoft AI Leader.

Experience

Jump to

2024 – Present

2022 – 2024

2021 – 2022

2018 – 2021

2015 – 2018

2024

Data Scientist II

TD Bank · Toronto, ON · Dec 2024 – Present

Architected and delivered an end-to-end data platform on Azure Databricks (PySpark / Delta Lake) ingesting 40M+ records daily, improving pipeline SLA adherence from 94% to 99.8%. Engineered a storage optimization layer using Delta Lake partitioning, Z-ordering, and compaction that cut compute costs by 35% and query latency by 40%. Built a self-serve analytics framework enabling 5+ business teams to independently access curated datasets, reducing ad-hoc requests by 60%. Drove data governance across 50+ tables, establishing lineage documentation and data contracts.

Azure Databricks PySpark Delta Lake Python ADLS Gen2 Data Governance

2022

Data Engineer

Adastra Corporation · Toronto, ON · Mar 2022 – Dec 2024

Delivered enterprise data projects across TD Bank and the NFL:

NFL — Operational Data Engineer (Nov 2024 – Dec 2024)

Monitored and maintained daily data ingestion processes across multiple platforms serving NFL operational reporting. Troubleshot ingestion failures, latency issues, and pipeline bottlenecks. Implemented automated checks, alerts, and validation frameworks to improve reliability and reduce downtime.

TD Bank — ED&A Frontier (Cloud Data Engineer)

Designed a full-cycle distributed ETL platform on Azure Databricks processing 5+ TB of financial data daily. Led and mentored a 3-engineer squad. Achieved a 60% reduction in Spark job execution time through AQE tuning, shuffle optimization, and right-sized clusters. Reduced storage costs 25% via partition pruning and Parquet compression.

BMO — ASAP Machine Learning Platform (Mar 2022 – May 2023)

Built 8 ML feature pipelines powering ASAP credit risk models over billions of customer records within tight latency SLAs. Achieved 50% processing speed improvement via Spark tuning, broadcast joins, and caching. Developed a modular PySpark library and Airflow DAG orchestration that reduced pipeline failure rate by 70%.

TD Bank — DaaS Cloud Migration (Jun 2024 – Oct 2024)

Led root-cause analysis and remediation of 20+ data discrepancies between on-prem and cloud, achieving 100% reconciliation accuracy across billions of records. Migrated legacy SSIS and T-SQL pipelines to modular PySpark, eliminating reporting errors by 45% and cutting pipeline execution time by 30%.

PySpark Azure Databricks SSIS → PySpark Migration Airflow ML Feature Pipelines T-SQL

2021

Senior Data Engineer

Desjardins · Montreal, QC · Nov 2021 – Mar 2022

Built and maintained complex multi-source data pipelines serving 5+ internal product teams, resolving scalability bottlenecks and reducing ad-hoc query SLA breaches by 40%. Conducted data profiling and quality analysis across TB-scale datasets, implementing cleansing and enrichment strategies aligned with enterprise data integrity standards. Collaborated end-to-end with project managers and stakeholders across Agile delivery cycles.

Data Pipelines Data Quality Agile SQL TB-scale Data

2018

Data Engineer / Data Integrator

Sun Life Financial · Toronto, ON · Oct 2018 – Nov 2021

Engineered and scaled the Centralized Campaign View platform using SAS and Spark, consolidating TB-scale data from 8+ digital channels to support 20+ Agile Marketing Pods. Developed Spark ingestion pipelines handling terabytes of customer interaction data for personalized campaign targeting of 3M+ policyholders. Automated batch jobs via SAS Macros, cutting manual processing by 50% and supporting $100M+ in annual marketing spend.

SAS Apache Spark SparkSQL Python Campaign Analytics

2015

Data Analyst

JF Insurance Agency Group Inc. · Toronto, ON · May 2015 – Oct 2018

Built SAS/SQL ETL scripts to extract and transform data from multiple sources, enabling post-campaign analytics (ROI, response rate, incremental revenue) that informed strategic budget allocation. Delivered campaign performance reports using KPIs — ROI, response rate, new subscriber growth — that drove data-backed decisions across sales and marketing leadership.

SAS SQL ETL Campaign Analytics

Projects

AI Engineering Project

01 / DataScope

DataScope

A production-grade AI data platform with three integrated systems: a SQL Agent (LLaMA 3.3 70B + DuckDB) that converts natural language to SQL and explains results in plain English; a RAG pipeline (ChromaDB + ONNX embeddings) for business document Q&A with source citations and hallucination prevention; and a multi-agent orchestrator that routes queries via intent classification. Full observability via MLflow, served as a FastAPI REST API with Streamlit UI, containerized with Docker Compose, deployed via GitHub Actions CI/CD.

LLaMA 3.3 70B Groq API RAG DuckDB ChromaDB ONNX FastAPI Streamlit MLflow Docker GitHub Actions

View on GitHub →

Integrated AI agents

70B

LLaMA model parameters

Hallucinations (RAG cited)

E2E

CI/CD via GitHub Actions

02 / Cloud Migration · TD Bank

DaaS Cloud Migration & Reconciliation

Led on-prem to Azure cloud migration for TD Bank's critical data infrastructure. Achieved 100% reconciliation accuracy across systems serving billions of records. Migrated legacy SSIS packages and T-SQL stored procedures to modular PySpark pipelines, eliminating reporting errors by 45% and cutting execution time by 30%. Maintained CI/CD hygiene via Bitbucket with zero-downtime deployments.

Azure Databricks SSIS Migration PySpark T-SQL CI/CD

Case Study →

Insights

Delta Lake

How We Cut TD Bank's Query Latency by 40% with Z-Ordering and Compaction

Most teams use Delta Lake but skip the tuning. Z-ordering, compaction strategies, and partition pruning are where the real performance gains live — here's exactly what we did.

10 min read · Delta Lake · Azure Databricks

ML Engineering

Building ML Feature Pipelines That Actually Hold Up in Production

After building production ML feature pipelines for major financial institutions, here's what makes the difference between pipelines that work in dev and ones that hold up under real SLAs.

12 min read · PySpark · ML Features

Data Migration

100% Reconciliation: Lessons from a Billion-Record On-Prem to Cloud Migration

Cloud migrations fail at data reconciliation. Here's the systematic approach — data profiling, iterative validation, Alteryx-based guardrails — that got us to perfect accuracy at TD Bank.

9 min read · Cloud Migration · Data Quality

Scott
Shi.

About

Skills

Certifications

Experience

Projects

DataScope

DaaS Cloud Migration & Reconciliation

Insights

How We Cut TD Bank's Query Latency by 40% with Z-Ordering and Compaction

Building ML Feature Pipelines That Actually Hold Up in Production

100% Reconciliation: Lessons from a Billion-Record On-Prem to Cloud Migration

Let's Connect

Scott Shi.

About

Skills

Certifications

Experience

Projects

DataScope

DaaS Cloud Migration & Reconciliation

Insights

How We Cut TD Bank's Query Latency by 40% with Z-Ordering and Compaction

Building ML Feature Pipelines That Actually Hold Up in Production

100% Reconciliation: Lessons from a Billion-Record On-Prem to Cloud Migration

Let's Connect

Scott
Shi.