Clean Data, Smarter AI: Build, Validate, and Govern AI-Ready Data Pipelines

Before each live online session, Tech Training will provide a Zoom link for live online classes, along with any required class materials.

Explore the tools, techniques, and architectural patterns that data teams use to profile, clean, govern, and trace data throughout its lifecycle, from raw ingestion to model-ready gold datasets.

Program Description

AI models are only as reliable as the data they learn from. In this workshop, explore the tools, techniques, and architectural patterns that data teams use to profile, clean, govern, and trace data throughout its lifecycle. Work with profiling tools, implement a medallion pipeline in dbt, apply validation gates, handle PII, detect bias in training data, and process unstructured sources such as PDFs and images.

This workshop is best suited for data engineers, data scientists, and ML practitioners who suspect data quality is limiting model performance.

Learning Objectives

Learners will have the opportunity to:
1. Profile datasets using tools like ydata-profiling and Great Expectations to identify quality failures across accuracy, completeness, latency, and bias dimensions
2. Design and implement a medallion architecture (Bronze, Silver, Gold) in dbt with validation gates between each tier
3. Trace data lineage using OpenLineage and dbt to support audit and compliance workflows
4. Detect and redact PII using Presidio and custom NER pipelines, including handling right-to-be-forgotten requirements
5. Apply systematic bias detection frameworks to training data and explore mitigation strategies
6. Process unstructured sources including PDFs, images, and audio using extraction tools and connect them to the medallion framework
7. Evaluate chunking and embedding quality as a data quality concern, not just an engineering decision

Topic Outline

Topics include:

Session 1: The Data Quality Problem
- Why data quality breaks AI: live demo of clean vs. dirty data in a RAG pipeline
- The four dimensions of data quality: accuracy, completeness, latency, and bias
- Profiling and diagnostics with ydata-profiling and Great Expectations
- Pattern recognition: common failure modes in AI training data
- Lab: Data quality audit across three datasets of varying quality

Session 2: Medallion Architecture and Transformation Design
- Bronze, Silver, Gold: quality guarantees at each tier and design patterns for AI workloads
- Implementing medallion tiers in dbt: writing, testing, and layering transformations
- Validation gates with Great Expectations or Soda between tier transitions
- Mapping medallion patterns to lakehouse (Databricks Delta Lake) and warehouse architectures
- Lab: Build a three-tier dbt pipeline from raw ingestion to model-ready data

Session 3: Lineage, Governance, and Bias
- Implementing data lineage with OpenLineage and dbt for full traceability
- PII detection and redaction pipelines using Presidio and custom NER
- Right to be forgotten: technical patterns for compliance in training datasets
- Systematic bias detection frameworks and case studies of bias propagation
- Lab: Lineage audit and PII sweep on a sample dbt project

Session 4: Unstructured Data Quality and Capstone
- OCR and NLP-based extraction with unstructured.io: handling PDFs, images, and audio
- Applying quality standards to unstructured extraction output
- Chunking strategy and embedding quality as data quality concerns
- Capstone lab: end-to-end pipeline from mixed corpus to governed gold dataset with full lineage

Custom training workshops are available for this program

Technology training sessions structured around individual or group learning objectives. Learn more about custom training

Special Group Rates

For groups of 5 or more within the same team or department, special rates are available. Please contact techtraining@stanford.edu for more details.

University IT Technology Training sessions are available to a wide range of participants, including Stanford University staff, faculty, students, and employees of Stanford Hospitals & Clinics, such as Stanford Health Care, Stanford Health Care Tri-Valley, Stanford Medicine Partners, and Stanford Medicine Children's Health.

Additionally, some of these programs are open to interested individuals not affiliated with Stanford, allowing for broader community engagement and learning opportunities.