AI Engineer & Data Engineer with 10+ years across Canada's top financial institutions. Currently building production-grade AI systems — LLM agents, RAG pipelines, multi-agent orchestrators — while architecting Azure Databricks platforms ingesting 40M+ records daily at TD Bank.
I'm Scott Shi, an AI Engineer & Data Engineer
at TD Bank, Toronto, currently building
production-grade AI systems alongside enterprise data platforms.
10+ years across Canada's top financial institutions — TD, Sun Life,
Desjardins, and the NFL — delivering pipelines,
cloud migrations, and ML feature stores that run at real scale.
On the AI side, I'm actively building LLM-based agents, RAG pipelines,
and multi-agent orchestrators.
Architected and delivered an end-to-end data platform on Azure Databricks (PySpark / Delta Lake) ingesting 40M+ records daily, improving pipeline SLA adherence from 94% to 99.8%. Engineered a storage optimization layer using Delta Lake partitioning, Z-ordering, and compaction that cut compute costs by 35% and query latency by 40%. Built a self-serve analytics framework enabling 5+ business teams to independently access curated datasets, reducing ad-hoc requests by 60%. Drove data governance across 50+ tables, establishing lineage documentation and data contracts.
Delivered enterprise data projects across TD Bank and the NFL:
Monitored and maintained daily data ingestion processes across multiple platforms serving NFL operational reporting. Troubleshot ingestion failures, latency issues, and pipeline bottlenecks. Implemented automated checks, alerts, and validation frameworks to improve reliability and reduce downtime.
Designed a full-cycle distributed ETL platform on Azure Databricks processing 5+ TB of financial data daily. Led and mentored a 3-engineer squad. Achieved a 60% reduction in Spark job execution time through AQE tuning, shuffle optimization, and right-sized clusters. Reduced storage costs 25% via partition pruning and Parquet compression.
Built 8 ML feature pipelines powering ASAP credit risk models over billions of customer records within tight latency SLAs. Achieved 50% processing speed improvement via Spark tuning, broadcast joins, and caching. Developed a modular PySpark library and Airflow DAG orchestration that reduced pipeline failure rate by 70%.
Led root-cause analysis and remediation of 20+ data discrepancies between on-prem and cloud, achieving 100% reconciliation accuracy across billions of records. Migrated legacy SSIS and T-SQL pipelines to modular PySpark, eliminating reporting errors by 45% and cutting pipeline execution time by 30%.
Built and maintained complex multi-source data pipelines serving 5+ internal product teams, resolving scalability bottlenecks and reducing ad-hoc query SLA breaches by 40%. Conducted data profiling and quality analysis across TB-scale datasets, implementing cleansing and enrichment strategies aligned with enterprise data integrity standards. Collaborated end-to-end with project managers and stakeholders across Agile delivery cycles.
Engineered and scaled the Centralized Campaign View platform using SAS and Spark, consolidating TB-scale data from 8+ digital channels to support 20+ Agile Marketing Pods. Developed Spark ingestion pipelines handling terabytes of customer interaction data for personalized campaign targeting of 3M+ policyholders. Automated batch jobs via SAS Macros, cutting manual processing by 50% and supporting $100M+ in annual marketing spend.
Built SAS/SQL ETL scripts to extract and transform data from multiple sources, enabling post-campaign analytics (ROI, response rate, incremental revenue) that informed strategic budget allocation. Delivered campaign performance reports using KPIs — ROI, response rate, new subscriber growth — that drove data-backed decisions across sales and marketing leadership.
A production-grade AI data platform with three integrated systems: a SQL Agent (LLaMA 3.3 70B + DuckDB) that converts natural language to SQL and explains results in plain English; a RAG pipeline (ChromaDB + ONNX embeddings) for business document Q&A with source citations and hallucination prevention; and a multi-agent orchestrator that routes queries via intent classification. Full observability via MLflow, served as a FastAPI REST API with Streamlit UI, containerized with Docker Compose, deployed via GitHub Actions CI/CD.
Led on-prem to Azure cloud migration for TD Bank's critical data infrastructure. Achieved 100% reconciliation accuracy across systems serving billions of records. Migrated legacy SSIS packages and T-SQL stored procedures to modular PySpark pipelines, eliminating reporting errors by 45% and cutting execution time by 30%. Maintained CI/CD hygiene via Bitbucket with zero-downtime deployments.
Most teams use Delta Lake but skip the tuning. Z-ordering, compaction strategies, and partition pruning are where the real performance gains live — here's exactly what we did.
After building production ML feature pipelines for major financial institutions, here's what makes the difference between pipelines that work in dev and ones that hold up under real SLAs.
Cloud migrations fail at data reconciliation. Here's the systematic approach — data profiling, iterative validation, Alteryx-based guardrails — that got us to perfect accuracy at TD Bank.