Engineering Log — Rui Bian | AI & Data Engineering

AI & Machine Learning

01AI / Orchestration

LangGraph Agentic Pipeline Orchestrator

Multi-step SPED data pipeline required manual sequential execution every week — steps were skipped, order was wrong, and failures were silent.

LangGraph StateGraph with gate node + conditional routing: downstream stages only run when ingestion succeeds
Fan-out parallelism to 4 concurrent stages (LRE, IEP compliance, predictions, LEA reports) with a custom TypedDict reducer for safe concurrent state writes
Local Ollama LLM (qwen2.5:7b) analyzes failure logs and suggests fixes — student data never leaves the server
Tkinter GUI monitor with real-time streaming log output for non-technical operators

LangGraphOllamaqwen2.5:7bPythonsubprocessTkinter

↑ 75–90% runtime reduction · replaced 4–6 hr manual weekly process with one command

02Machine Learning

6-Domain Student Outcome Prediction System

Districts only learned about poor student outcomes after the fact — no forward-looking indicators existed for proactive intervention.

6-domain prediction: CAASPP ELA, CAASPP Math, ELPAC, Chronic Absenteeism, College/Career, Suspension — each with a dedicated analyzer class
Feature engineering from multi-year SQL Server data: prior year scores, year-over-year deltas, 3-year rolling means, trend slopes
Per-district model selection across 4 candidate models (Random Forest, Linear, ARIMA, Holt-Winters) ranked by R² on held-out validation data
OpenAI (GPT-4o-mini) + Gemini (gemini-1.5-flash) fallback chain generates plain-language administrative narratives with hallucination validation

scikit-learnstatsmodelsARIMAHolt-WintersOpenAI APIGemini APImatplotlib

Deployed 20+ districts · proactive intervention 4–6 months before year-end outcomes

03ML Forecasting

Per-School Suspension Rate Forecasting

Suspension rates varied wildly by school and had different seasonal patterns — a single global model was unreliable for planning intervention resources.

5 models trained per school: ETS (Holt-Winters), ARIMA, Prophet, Random Forest, linear baseline — with MAE evaluation against 2024–2025 actuals
Ensemble of top 3 (ETS + ARIMA + Prophet) weighted by inverse validation error
Per-school comparison charts showing all model projections vs. actuals; summary CSV ranked by RMSE
ARIMA exception handling for schools with too-short history to converge

ProphetARIMAETSscikit-learnstatsmodelsmatplotlibseaborn

Ensemble outperformed any single model in 70%+ of schools tested

04AI / Algorithm

School Recommendation Engine

No systematic way existed to find nearby California schools that consistently outperform a given school across multiple indicators, while handling real-world data gaps correctly.

Haversine 30-mile radius search narrows a statewide dataset to geographically relevant candidates
Progressive multi-indicator filter: applies each available metric sequentially — schools without data for an indicator are excluded from that step only, not globally
Correctly distinguishes "no data" from "underperforming" — a critical design decision that naive threshold filtering misses
Folium interactive map output + ranked summary CSV by number of indicators outperformed

PythonpandasHaversineFoliumgeopygooglemaps

6 indicators · statewide coverage · data-gap-aware recommendation algorithm

Data Engineering & ETL

05Data Engineering

SEIS Input Data Pipeline

Weekly manual combining, normalizing, and routing of SEIS export files from 25 districts into a single warehouse-ready format.

Multi-source ingestion combining per-district SEIS CSV exports with configurable column normalization
JCCS merge: 4 sub-entities aggregated into a single district record before routing
File routing by LEA code to district-specific destinations; CLI stage flags for selective re-runs

Pythonpandaspathlibargparse

Eliminated weekly manual file processing across 25 LEAs

06ETL / Validation

CALPADS Extract Uploader

25+ districts' CALPADS extract files arrived with delimiter errors, wrong LEA codes, and encoding issues — manual upload was silently corrupting warehouse data.

Pre-flight LEA code validation in column 3 against expected values before any upload begins
Auto-fix: trailing delimiter detection with `--fix` (preview) and `--apply-fix` (correct in-place) CLI flags
Idempotent truncate+reload pattern: safe to rerun after partial failure; SQLAlchemy fast_executemany for bulk performance
Freshness upsert to SQL Server DataDateTime table after each successful load — visible in Power BI staleness indicators

PythonpandasSQLAlchemypyodbcargparse

Auto-fix resolved 80% of errors · eliminated silent data corruption

07ETL / Automation

CALPADS ODS Download Pipeline

CALPADS ODS reports needed regular download for 25+ LEAs across both REST API-accessible and portal-only formats, using two different credential sets.

REST API with token auth for 5 report types; Selenium browser automation for 2 portal-only SSRS reports
Dual-account LEA routing: `lea_data.py` maps each district to its assigned CALPADS account automatically
`--download` / `--upload` mode flags: re-run uploads without re-downloading already-retrieved files

PythonrequestsSeleniumpandasSQLAlchemy

25+ districts · 7 report types · replaced 45-min manual weekly session

08ETL / SFTP

SouthFork SIRAS Pipeline

A rural LEA's SIRAS export schema was entirely different from the county's required 80+ column format — requiring weekly manual field mapping, date normalization, and disability code translation.

Playwright downloads raw SEIS export in single unattended session
80+ column pandas transform: disability code lookup dicts (30+ entries), grade level code translation, multi-format date normalization to SIRAS required format
Column count validation before output; Paramiko SFTP upload to county server
subprocess orchestrator chains all 3 stages with stage-level error logging

PythonPlaywrightpandasparamiko

Automated weekly SIRAS delivery for rural LEA · eliminated manual schema translation

09ETL / Automation

IVA SIS Extract Pipeline

Charter school's Aeries SIS CALPADS extracts required navigating a complex UI, unzipping downloads, renaming files to standard convention, and uploading via SFTP each cycle.

Playwright automation with Kendo UI-aware selectors navigates Aeries SIS, selects each CALPADS extract type, and downloads the generated ZIP
Subprocess unzip + regex-based rename to `IVA_<type>.txt` naming convention
Paramiko SFTP uploads all `IVA_*.txt` files to staging server; each stage independently runnable for reruns

PythonPlaywrightparamikozipfile

End-to-end automation replacing multi-step manual portal + upload workflow

10ETL / Format Engineering

BEST Academy STAS Attendance Pipeline

Charter school's SIS generated attendance in a proprietary CSV format entirely incompatible with the CALPADS STAS 23-field caret-delimited fixed format required for state submission.

Playwright downloads the attendance CSV from the SIS portal in an unattended session
23-field STAS converter: maps source columns, derives `other_days` (total minus present minus absent), calculates dynamic school-year string
Paramiko SFTP delivers the formatted file; python-dotenv for credential management

PythonPlaywrightpandasparamikopython-dotenv

Automated weekly CALPADS attendance submission · replaced manual format conversion

11Data Quality

SEIS Data Corrector

SEIS exports contained incorrect district name assignments caused by student transfer timing, propagating errors into downstream compliance reports.

Modular pipeline: combine raw exports → correct via reference mapping → organize into per-district CSVs
Backup of originals before any correction; slugified output filenames for cross-platform compatibility
Reference mapping table updated independently of the pipeline code

Pythonpandaspython-slugify

Correct district assignments in all downstream compliance reporting

12Data Engineering

File Format Conversion Utilities

Education data files arrived in varying formats and encodings (CSV/TXT, UTF-8/CP1252) incompatible with legacy CALPADS upload systems.

CSV-to-TXT streaming converter: memory-efficient row-by-row processing for large files
UTF-8 to CP1252 re-encoder: handles characters outside the CP1252 range with configurable replacement
Multi-year historical CSV concatenator: combines annual exports with consistent header management

Pythonpandas

All file types uploadable to legacy CALPADS systems without manual intervention

13Data Quality

Extract Checker — Pre-Upload Validation

Mis-routed or misconfigured CALPADS extract files would silently overwrite correct warehouse data with no error raised during upload.

Pre-flight scan checks the LEA code in column 3 of every extract file against expected values for each of 17 districts
Generates a pass/fail report before any upload begins — bad files never reach the database
Config-driven: expected LEA codes maintained in a separate mapping file, not hardcoded

Pythonpandas

Prevented warehouse corruption · all bad files caught at validation stage

14Data Quality

Verify Dropouts — Enrollment Verification

Students exit-coded as dropouts may have re-enrolled elsewhere, inflating dropout rates in federal compliance reports — verifying each SSID manually against live CALPADS was taking 3–4 hours per LEA.

ThreadPoolExecutor parallel Selenium workers: each worker has its own ChromeDriver instance, logs to a per-worker queue
Deduplication runs before workers start; batch REST API write-back with chunked requests and single-item fallback on 404
Worker count and batch size configurable via environment variables; JSON audit logs with timing per worker

PythonSeleniumThreadPoolExecutorrequests

3–4 hr → 25–35 min per LEA · 8× speedup with 8 parallel workers

Compliance & Analytics

15Compliance

IEP Compliance Pipeline

21 LEAs were manually tracking IEP annual review and triennial re-evaluation deadlines in spreadsheets — missing a federal IDEA deadline is a compliance violation.

SQL query identifies students with IEPs due within configurable windows
Playwright downloads invitation and assessment plan PDFs from SEIS with MFA detection and handling
PDF date parser extracts meeting dates, assessment dates, and invitation dates across multiple form layouts
Computes days until/past deadline per student; applies green/yellow/orange/red/grey risk scoring based on encoded California IDEA rules (annual ≤365 days, triennial ≤3 years, assessment plan ≤60 days)

PythonPlaywrightpdfplumberopenpyxlSQLAlchemyPower BI

21 districts · zero missed IEP deadlines after deployment

16Compliance

LRE Compliance Reporting

15 LEAs needed IDEA Least Restrictive Environment placement percentages calculated from SEIS data and compared against California state benchmarks — a calculation that required deep understanding of federal LRE category definitions.

4 federal LRE placement categories (≥80% GE time, 40–79%, <40%, separate school) correctly encoded from IDEA regulation, validated against known state reports
Preschool LRE uses different category definitions; JCCS 4 sub-entities aggregate separately before district rollup
Per-LEA Jupyter notebooks run headlessly via nbconvert runner; SQL Server write-back for Power BI

PythonpandasJupyternbconvertSQLAlchemy

15 districts · automated LRE reporting · replaced days-long manual Excel process

17Analytics

EAP At-Risk Student Identification

14 LEAs had no consistent method to identify students needing SST referrals, 504 plans, or SEL support — each coordinator used a different ad-hoc process.

Multi-source join: CAASPP assessment scores, ELPAC results, discipline records (suspension days), and SPED indicators at the student level
3 policy-encoded risk flags: SST (academic performance thresholds), 504 (eligibility indicators), SEL (suspension ≥5 days)
Color-coded Excel reports via openpyxl; SQL Server write-back for Power BI weekly review dashboards

PythonpandasopenpyxlSQLAlchemypyodbc

14 LEAs · replaced 14 ad-hoc processes with one consistent system

18Analytics

Student Service Calculations & Predictions

15+ LEAs had no automated way to calculate current special education service hours by type (SLP, OT, PT) or project future service demand for resource planning.

Per-LEA Jupyter notebooks calculate service hours by type from SEIS data; nbconvert runner executes all notebooks headlessly in batch
Service hour projections using trend analysis for resource planning across the school year
openpyxl formatted reports; Selenium assists with data extraction from web portals where needed

PythonpandasJupyternbconvertopenpyxlSelenium

15+ LEAs · weekly automated service calculation replacing manual Excel aggregation

19Demo / BI

IEP Compliance Demo Environment

Client prospects needed to experience the full iTAAP platform without accessing production student data — a demo environment required matching compliance logic without real records.

Complete sanitized Power BI demo suite (10+ .pbix files) across all dashboard types, using synthetic data
Python script exports 60+ IEP compliance color-coding rules (Initial eval, Annual IEP, Tri-Annual) across 3 compliance dimensions to machine-readable legend CSV
Documentation generation: rule descriptions formatted for both human review and system reference

Power BI DesktopPythonpandasopenpyxl

Enabled client demos across all iTAAP dashboard types without production data exposure

Automation & Web Scraping

20Automation

SEIS Column Selection Automation

Each SEIS data export required manually clicking ~103 checkboxes in the SEIS portal per LEA, taking 30–60 minutes per cycle — and any missed column required starting over for that LEA.

103 required columns externalized to `columns_needed.txt`; per-LEA credentials in git-ignored `credentials.json`
Selenium iterates through the column list, selects each checkbox, triggers export download, and saves with timestamps
Explicit waits (not fixed sleeps), retry logic for transient failures, and per-run timestamped log file

PythonSeleniumChromeDriver

30–60 min → ~8 min unattended · eliminated missed-column errors entirely

21Automation

CALPADS 8.1a ODS Report Scraper

CALPADS 8.1a Student Profile Exits report is embedded inside a nested SSRS iframe and cannot be accessed via API — download required manual portal navigation every week.

Playwright handles CALPADS login, outer page navigation, iframe frame-switching (nested context manager)
Dynamic date parameterization injected into SSRS report viewer; nested popup + download event capture
Custom date-stamped filename written to output directory on success

PythonPlaywright

Weekly CALPADS SSRS report automated · consistent date-stamped output

22Automation

SBCUSD SEIS Multi-Report Automation

A large school district needed 4 different SEIS report types downloaded weekly from different portal sections, with optional MFA and a cross-month datepicker that reset on navigation.

Single-session Playwright automation: MFA detection via regex match on page text before attempting bypass
Saved search URL execution for 2 report types; Service Tracker navigation for 1; MFA-aware general download for 1
Cross-month datepicker traversal: detects current displayed month and clicks navigation arrows as needed
Paramiko SFTP upload after all 4 downloads complete in one session

PythonPlaywrightparamiko

4 SEIS report types automated in a single unattended session for a large district

23Automation

Automated Weekly Email Alert System

18 school sites had no regular visibility into their performance metrics — administrators only saw data when they logged into Power BI, which many never did.

Gmail API OAuth2 pipeline (credentials.json → token.json) with scoped send-only permissions
Dynamic metric selection: rotates 1 General Ed + 1 SPED metric from a configurable pool of 12 indicators each week
SQL Server queries build HTML-formatted data tables with inline styling for email clients
Test mode: single flag sends to internal addresses only — safe for non-technical staff to test

PythonGmail APIOAuth2pandasSQLAlchemy

18 school sites · 12 metric pool · zero-touch weekly delivery

24Automation

SQL Stored Procedure Runner

14 stored procedures per district database needed sequential execution with a validation checkpoint in the middle — manual runs caused missed steps, wrong order, and no audit record.

Python CLI + Tkinter GUI runner executes 3 SP groups per database in correct sequence
VerifyDropOuts checkpoint between groups: if it fails, downstream SPs are blocked (prevents calculations on unverified dropout data)
Each SP execution is timed and logged to a timestamped CSV with stdout/stderr capture

PythonpyodbcTkinter

Zero missed SP steps or wrong-order executions since deployment

25Automation

Power BI Bulk Refresh Automation

150+ .pbix files across 6 directories needed regular refresh after each pipeline run — manually opening and refreshing each file in Power BI Desktop was taking hours.

PowerShell WinForms GUI with folder picker and progress display
Smart completion detection: monitors Power BI Desktop UI state (title bar text) + CPU-idle fallback for files that don't update the title
Resume-from-log: skips files already refreshed in a previous run; configurable concurrency limits to avoid memory exhaustion

PowerShellWinFormsWindows Process API

150+ .pbix files · single-click automated bulk refresh with audit log

26Web Scraping

ASHA Pro Finder Async Scraper

Districts needed to identify California SLP and audiologist contractors from the ASHA directory — no bulk export existed, requiring individual profile page visits for each of thousands of providers.

Phase 1: async Playwright paginates the Coveo search API with POST requests (up to 6,000 profiles), extracting IDs and metadata via BeautifulSoup regex matching on inconsistent HTML attributes
Phase 2: async profile scraping extracts 18 fields (name, certifications, phone, address, specialty, education) per page
Per-record error isolation: failed pages log a warning and continue — the full run never aborts on a single bad page
Results streamed to UTF-8-sig CSV incrementally; configurable concurrency limits to avoid rate limiting

asyncioPlaywrightBeautifulSoupPythonpandas

~6,000 SLP & audiologist profiles · 18 fields per record · complete California directory

Business Intelligence & Reporting

27Business Intelligence

Power BI Reporting Suite

25+ districts needed consistent, self-service reporting across 9 analytics domains — building and maintaining dashboards manually for each district was not scalable.

9 dashboard types × 20+ districts = 150+ production .pbix files: iTAAP main, Mini, EAP, IEP Compliance, SPP, Student Service, Service Projection, SST/504, SchoolCard
Template-based deployment: changes propagate to all districts by updating the template and redistributing
All dashboards connect live to SQL Server; automated refresh via PowerShell automation after each pipeline run

Power BI DesktopSQL Server (live)DAXPythonPowerShell

150+ production dashboards · 25+ districts · 9 dashboard types · weekly automated refresh

28SQL / Analytics

SQL Server Stored Procedure Library

Complex compliance metrics needed consistent calculation logic across 94 district databases — ad-hoc queries per district produced inconsistent results that broke reporting.

14 SP types replicated per district: chronic absenteeism with day-multiplier projection, IDEA Indicators 9–10 disproportionality (age filtering, ethnicity normalization, LRE bucketing, ~320KB), LRE calculation, geospatial comparable schools (geography data type), ELPAC, graduation cohorts, CCR
Window functions, dynamic SQL, statistical mode calculation for ethnicity majority determination
Geospatial comparable schools SP uses SQL Server geography type for radius-based school matching

T-SQLSQL Servergeography data type

~94 district SP instances · powers all Power BI dashboards · consistent compliance metrics

29Observability

Data Freshness Tracker

Power BI dashboard consumers had no way to know if they were looking at fresh or stale data — two silent pipeline failures went unnoticed until decisions were already made.

Python monitors file timestamps for all pipeline output directories using pathlib
Upserts last-modified timestamp to SQL Server DataDateTime table after each successful pipeline stage
Power BI reads the DataDateTime table live — each dashboard shows a "last updated" indicator in the corner

PythonpathlibpandasSQLAlchemy

Caught 2 silent pipeline failures before decisions were made on stale data

30Tooling

Power BI File Data Extractor

Power BI .pbix files are opaque ZIP containers — auditing embedded data models or dashboard structure required opening each file in Power BI Desktop manually, one at a time.

pbix_extractor.py treats .pbix as ZIP archive and uses pbixray to extract all embedded data model tables to CSV for programmatic audit
pbix_analyzer.py reads the Report/Layout JSON file inside .pbix — encoded as UTF-16-LE with a binary preamble — to extract visual structure, DAX measures, and report metadata
CLI design: directory-level batch processing across all .pbix files in a folder

Pythonpbixrayzipfilepandas

Programmatic .pbix audit without Power BI Desktop · supports 150+ file reporting suite

Software Engineering

31Go / Systems

SBCUSD Aeries API Go Client

12 Aeries SIS dataset types needed to be ingested into MongoDB for analytics — high-volume iteration over all schools required reliable per-entity error recovery that Python threading made complex.

Go CLI with `-fetch=all` and comma-separated selective mode; `flag` package with `regexp` validation of year format
Dynamic high school discovery: queries the loaded schools collection for `HighGradeLevel ≥ 9` — no hardcoded school lists, no maintenance when schools change
Per-school error recovery with idiomatic Go `(value, error)` returns — a single API failure logs and continues without aborting the full run
`init()` guard validates config placeholders before any network call; MongoDB ping confirms connectivity at startup
Idempotent `clearCollection` + `InsertMany` pattern safe to rerun without duplicate data

GoMongoDBnet/httpgo.mongodb.org/mongo-driverencoding/json

12 dataset types · dynamic school classification · zero maintenance · compiled single binary

32Web App

California School Radar Map

Administrators had no interactive way to find and compare nearby schools on multiple performance indicators geographically — static reports required a GIS analyst to produce and were instantly outdated.

Streamlit app backed by SQL Server CTE view joining 12 indicators (CAASPP, ELPAC, chronic absenteeism, suspension, graduation, CCR) across 10,000+ schools
Vectorized Haversine distance search using NumPy broadcasting: sub-second proximity filtering on the full dataset without a spatial index
UI: zip-code geocoding jump, click-to-move map centering, dynamic indicator selection from sidebar, Streamlit session state preserving position/selections across interactions
Radar chart renders any two selected schools side-by-side across all 6 performance indicators for direct comparison

PythonStreamlitPlotly MapboxNumPypandasSQL Server

10,000+ California schools mapped · live proximity filtering · sub-second performance