Rui Bian | Senior Applied Data Scientist & AI Analytics Engineer

System Capabilities

ML Systems

Predictive Modeling Classification Anomaly Detection Model Evaluation & Monitoring Feature Engineering SHAP Explainability scikit-learn XGBoost ARIMA / Holt-Winters Prophet PyTorch MLflow Model Operationalization

Data Engineering

Batch & Streaming Pipelines ETL Design Data Quality Systems SQL Server / T-SQL Apache Spark Microsoft Fabric pandas / numpy SQLAlchemy Playwright Selenium asyncio SFTP Automation

Production Systems

FastAPI / Flask REST APIs LangGraph Ollama (Local LLM) OpenAI API Gemini API Docker CI/CD Azure AWS MongoDB Multi-Tenant Architecture

Languages & Tools

Python Go (Golang) SQL / T-SQL PowerShell Power BI / DAX Streamlit Plotly Mapbox Git

Selected Production ML Systems

End-to-end production systems — ML design, pipeline architecture, API deployment, and operational monitoring — all serving 30+ California school districts.

01 AI / Orchestration

Agentic ML Pipeline Orchestration System

ProblemMulti-step data pipelines required manual execution, missed steps, and had no failure diagnosis.

DesignLangGraph state machine with gate-node conditional routing + 4-stage fan-out parallelism; chose local Ollama over cloud LLM — student records never leave the server, eliminating compliance risk at architecture level; per-district fault isolation at node level — single failure triggers LLM diagnosis without cascading; multi-tenant config eliminates per-district code changes.

LangGraphOllamaPythonsubprocessTkinter

↓ 75–90% runtime · 30+ districts · 100K+ records/run · node-level fault isolation · zero data egress

02 Machine Learning

Multi-Domain Risk Prediction & Decision-Support System

ProblemDistricts had no early visibility into student risk across academic, attendance, and behavioral domains.

DesignChose per-district model selection over global model — accounts for demographic and policy variation across districts; dual-API fallback chain (GPT-4 → Gemini → structured JSON) achieves graceful degradation — report generation survives any single API outage; idempotent batch scoring over SQL Server warehouse.

scikit-learnARIMAHolt-WintersOpenAI APIGemini API

20+ districts · 6 concurrent domains · 3-layer graceful degradation · idempotent reruns

03 Compliance Automation

Automated Compliance Monitoring & Risk Scoring System

Problem21 districts tracked federal IEP deadlines manually in spreadsheets, with no automated risk scoring or visibility.

Design4-stage pipeline with per-district fault isolation: SQL extraction → MFA-aware Playwright → PDF date parsing → IDEA risk scoring; chose computable rule encoding over heuristic date logic — federal thresholds (365-day annual, 3-year triennial, 60-day assessment) enforced as auditable first-class rules; single-district failure never aborts multi-district run.

PlaywrightPDF parsingSQL ServerPower BI

21 districts · 0 missed IEP deadlines · computable-rule compliance · district-level fault isolation

04 Visualization

Geospatial School Analytics & Search Platform

ProblemNo interactive way to geographically explore and compare school performance across California at scale.

DesignChose vectorized NumPy Haversine over SQL-side distance queries — moves compute to application layer for predictable sub-second response at 10K+ scale; progressive filtering enforces a correctness invariant: “no data” explicitly distinguished from “underperforming” — prevents false negatives on sparse-data regions; radar chart across 6 indicators.

StreamlitPlotly MapboxNumPySQL Server

10K+ schools · sub-second search · correctness-first design · no false negatives on sparse data

05 Data Engineering

Hybrid API/Automation Government Data Ingestion Pipeline

Problem30+ districts required 45+ minutes of manual portal navigation per cycle to retrieve compliance reports.

DesignChose dual-mode over Selenium-only — REST API routes provide reliable structured ingestion where available; Selenium reserved only for portal-locked flows, minimizing automation surface area; automatic LEA-to-credential routing handles 30+ districts without per-district code changes; idempotent truncate-reload prevents state corruption after partial failures.

PythonREST APISeleniumSQLAlchemy

30+ districts · 7 report types · minimal Selenium surface area · idempotent truncate-reload

06 Go / APIs

High-Reliability Concurrent SIS Data Ingestion Service

Problem12 SIS dataset types needed reliable automated ingestion into MongoDB with no single-school failure causing full-run aborts.

DesignChose Go over Python — goroutines provide lower memory overhead than asyncio for high-concurrency API fan-out; per-goroutine error isolation: individual timeouts and auth failures are logged and skipped without blocking concurrent streams; retry semantics scoped at school granularity, not dataset level; dynamic school discovery eliminates hardcoded configuration.

GoMongoDBREST APInet/http

12 dataset types · goroutine fan-out · per-school fault isolation · retry at school granularity

Project Index

All production systems, tools, and research — filter by category.

Experience

2023 — Present

Lead Data Scientist

Expatiate Communications · Pasadena, CA

Sole architect and owner of 32 production systems serving 30+ school districts; standardized district onboarding to configuration files — reduced new-district setup from weeks of engineering work to hours of configuration
Built LangGraph agentic orchestration achieving 75–90% pipeline runtime reduction; designed 6-domain outcome prediction with per-district model selection; enabled compliance teams to run multi-district risk pipelines without engineering support
Deployed operator-first platform on Microsoft Fabric: non-technical staff run complex compliance workflows via GUI launchers; automated Gmail API reporting eliminated manual weekly reporting burden across 18 school sites

2017 — 2022

Ph.D. Researcher — Computer Engineering

University of Delaware · Newark, DE

Built internet-scale data collection systems using Python + AWS, processing millions of network probes for BGP routing and proxy ecosystem analysis
Published at IEEE INFOCOM 2024, Elsevier Computer Networks 2022, ACM SIGCOMM CCR 2019 — GPA 3.96 / 4.0
TPC member and reviewer: IEEE INFOCOM, DSN, TNSE, Computer Networks

Prior Experience

Engineering & Research Roles

7+ years across research and industry positions

Broad engineering background spanning systems, data, and applied research prior to doctoral studies
B.S. in Engineering — University of Science and Technology of China (USTC)

Platform Design Principles

Three non-negotiable engineering constraints applied across all 32 production systems — not policies, but structural guarantees.

Fault Isolation by Default

Every system isolates failures at the tenant boundary. One school or district failing never cascades to others — enforced at the goroutine or node level, not by try/catch wrapping.

Per-district error recovery node in LangGraph pipelines — failure triggers LLM diagnosis, not pipeline abort
Per-goroutine isolation in Go SIS Service — individual timeouts logged and skipped without blocking concurrent streams
Single MFA failure never aborts IEP Compliance Pipeline across 21 districts
Pre-flight Extract_checker validates all inputs before any database write

Idempotent Operations

All ETL pipelines use truncate-reload semantics — never append-only. Any pipeline is safe to rerun after partial failure without data corruption, duplication, or manual cleanup.

Truncate-reload in CALPADS Pipeline — consistent state guaranteed regardless of failure point
Idempotent batch scoring in Risk Prediction System — results reproducible given same input snapshot
30+ districts on shared infrastructure with strict tenant data isolation
Test mode in every user-facing system — no accidental production sends or overwrites

Privacy by Architecture

AI inference on student data runs locally via Ollama — a structural constraint, not a configuration option. Eliminates PII exposure risk regardless of downstream code changes or vendor policy.

Local Ollama qwen2.5:7b for failure analysis — zero student data sent to cloud LLM APIs
No student PII in any external API payload — enforced by design, not by access control policy
Multi-tenant architecture: district A cannot access district B data by construction
Structured audit log maintained locally for all pipeline runs

Engineering Philosophy

Reliability is a product feature. Systems handling public-sector student data must remain correct under partial failure, degraded upstream inputs, and operational retries. Design approach prioritizes isolation boundaries, deterministic recovery semantics, and operational transparency — not as optimizations, but as first-class requirements that shape every architectural decision from the start.

Systems scale humans, not just compute. Every system I build is operated by non-technical staff: compliance coordinators, district program managers, school administrators. Architectural decisions account for the human operating layer — test modes before production sends, color-coded risk signals instead of raw scores, single-command automation for workflows that previously required engineering involvement. The measure of leverage is not pipeline throughput, but whether your systems enable people who couldn't do the work before to do it reliably now.

High-Impact Engineering

Transforming EdTech Intelligence at Scale

Expatiate Communications Lead Data Scientist

The Challenge

School districts lacked predictive visibility into IEP compliance, academic progress, and operational risk — relying on slow, fragmented manual data aggregation across disparate assessment platforms.

Architecture

Designed a LangGraph agentic pipeline orchestrator with gate-node conditional routing, fan-out parallelism across 4 concurrent stages, and a local Ollama LLM (qwen2.5:7b) that analyzes failures and suggests fixes — no student data leaves the server
Built a 6-domain ML prediction system (CAASPP ELA/Math, ELPAC, Chronic Absenteeism, College/Career, Suspension) using scikit-learn, ARIMA, and Holt-Winters with per-district model selection; integrated OpenAI + Gemini APIs for plain-language administrative narratives
Automated IEP compliance tracking across 21 districts: Playwright PDF download with MFA handling → PDF date parsing → green/yellow/orange/red deadline risk scoring → Power BI dashboards; zero missed IEP deadlines after deployment
Engineered 32 production tools covering ETL, compliance reporting, browser automation (Playwright + Selenium), async web scraping, SFTP delivery, a Go REST API client to MongoDB, and 150+ Power BI dashboards across 9 dashboard types
Built automated Gmail API email alert system delivering data-driven weekly performance summaries to 18 school sites with dynamic metric selection from 12 indicators

Business Impact

Platform deployed across 30+ California school districts. LangGraph agentic automation achieved a 75–90% reduction in pipeline processing time. Replaced dozens of hours of weekly manual work — data collection, compliance tracking, report generation, and stakeholder communication — with single-command automated pipelines.

Internet-Scale Transparent Proxy Analysis

University of Delaware Ph.D. Research · IEEE INFOCOM 2024

The Challenge

Transparent proxies silently intercept and modify web traffic without user awareness, but their true prevalence, behavior, and network-wide impact were poorly understood at scale.

Methodology

Designed a large-scale active measurement system to detect and fingerprint transparent proxies across global internet paths
Built Python-based data collection and analysis pipelines processing millions of network probes
Developed novel detection heuristics combining HTTP header analysis and TCP-level signals

Academic Impact

Published at IEEE INFOCOM 2024 — one of the top-ranked venues in computer networking, revealing the significant hidden influence of transparent proxies on internet traffic integrity.

Mapping the Open Proxy Ecosystem

University of Delaware Ph.D. Research · Computer Networks 2022

The Challenge

The open proxy landscape — used for anonymization, censorship circumvention, and malicious activity — had never been comprehensively characterized in terms of scale, geography, and behavior.

Methodology

Crawled, scanned, and analyzed 436,000+ open proxies across the global internet
Built large-scale data collection infrastructure using Python and AWS for distributed scanning
Applied statistical modeling and traffic analysis to characterize proxy behavior, uptime, and abuse patterns

Academic Impact

Published in Computer Networks (Elsevier), 2022 — delivering the first comprehensive analysis of the open proxy ecosystem and its security implications at internet scale.

Anycast Routing & Remote Peering Effects

University of Delaware Ph.D. Research · ACM SIGCOMM CCR 2019

The Challenge

Remote peering in BGP networks was known to distort anycast routing decisions, but the extent of this unintended impact on global traffic distribution — including for major cloud providers — had not been passively quantified.

Methodology

Developed a passive BGP measurement methodology to infer anycast catchment boundaries without active probing
Analyzed global BGP routing tables and AS-path data across hundreds of vantage points
Correlated routing anomalies with remote peering relationships at internet exchange points (IXPs)

Academic Impact

Published in ACM SIGCOMM Computer Communication Review, 2019 — a flagship networking venue — establishing foundational methodology for passive anycast analysis used in subsequent internet measurement research.

In Progress

AI Cloaking & Content Differentiation on the Open Web

Independent Research Target: IMC / WWW / USENIX Security

The Challenge

As AI crawlers become ubiquitous, websites are moving beyond binary blocking (robots.txt) to a more sophisticated, unmeasured tactic: returning HTTP 200 responses to both humans and AI bots, but serving degraded, watermarked, or "poisoned" content specifically to crawlers like GPTBot.

Methodology

Twin-crawler framework (Playwright) visiting Tranco Top 10,000 domains — once as a standard browser UA, once as GPTBot
DOM tree structural comparison and text similarity scoring via Jaccard & TF-IDF cosine similarity
Sector-level taxonomy: paywall injection, text truncation, gibberish poisoning, visual watermarking

Novelty

Unlike prior work measuring blocking, this measures deception — filling a critical gap in understanding how the web's content landscape diverges between human and AI readers.

In Progress

LLM-Hallucinated Infrastructure Domains as an Attack Surface

Independent Research Target: NDSS / CCS / USENIX Security

The Challenge

LLMs are widely used to generate Infrastructure-as-Code (Terraform, Kubernetes YAML, Nginx configs). If a model hallucinates a plausible but unregistered domain endpoint, an attacker could register that domain to intercept live API traffic or credentials from deployed systems.

Methodology

1,000+ DevOps-focused prompts submitted to GPT-4o, Claude 3.5 Sonnet, and Llama-3-70B
Regex extraction of all generated domains, filtered against known public registries
DNS resolution + Registrar API queries to quantify hallucination rate and live registrability of phantom endpoints

Novelty

Distinct from package hallucination studies — this targets DNS-level infrastructure interception, a critical supply chain risk not previously measured in the LLM security literature.

Professional Credentials

DC

DataCamp

4 Active Certifications · Issued 2026 · Valid through 2028

AI Engineer for Developers Associate

AI Engineer for Data Scientists Associate

Data Scientist Associate

Data Engineer Associate

G

Google Cybersecurity Professional Certificate

Coursera · Issued Aug 2023

Thought Leadership

Selected Publications & Patents

Silent Observers Make a Difference: A Large-scale Analysis of Transparent Proxies on the Internet.

Rui Bian et al. | IEEE INFOCOM, 2024.

Shining a Light on Dark Places: A Comprehensive Analysis of Open Proxy Ecosystem.

Rui Bian et al. | Computer Networks, 2022.

Towards Passive Analysis of Anycast in Global Routing: Unintended Impact of Remote Peering.

Rui Bian et al. | ACM SIGCOMM CCR, 2019.

Patent: Manufacturing method of micro lens / 一种微透镜的制作方法.

Gang Liu, Ying Xiong, Rui Bian et al. | CN104614936B.

Academic Service

Extensive peer review contributions ensuring the integrity and quality of high-tier network science and security venues.

Key TPC / Reviewer Roles:

IEEE INFOCOM ('17, '18, '19, '20, '21)
IEEE/IFIP DSN ('19, '21, '22)
IEEE Transactions on Network Science and Engineering (TNSE)
Computer Networks
IEEE ITEC, IEEE RTC, IEEE SmartSys

Let's Connect

Open to Senior Data Scientist, AI/ML Engineer, and leadership roles. Based in Los Angeles — open to hybrid and remote.

Email LinkedIn GitHub

Resume

Loading GitHub activity…

Senior ML Engineer & Data Scientist

30+

150+

System Capabilities

ML Systems

Data Engineering

Production Systems

Languages & Tools

Selected Production ML Systems

Agentic ML Pipeline Orchestration System

Multi-Domain Risk Prediction & Decision-Support System

Automated Compliance Monitoring & Risk Scoring System

Geospatial School Analytics & Search Platform

Hybrid API/Automation Government Data Ingestion Pipeline

High-Reliability Concurrent SIS Data Ingestion Service

Project Index

Experience

Platform Design Principles

Fault Isolation by Default

Idempotent Operations

Privacy by Architecture

Engineering Philosophy

High-Impact Engineering

Transforming EdTech Intelligence at Scale

The Challenge

Architecture

Business Impact

Internet-Scale Transparent Proxy Analysis

The Challenge

Methodology

Academic Impact

Mapping the Open Proxy Ecosystem

The Challenge

Methodology

Academic Impact

Anycast Routing & Remote Peering Effects

The Challenge

Methodology

Academic Impact

AI Cloaking & Content Differentiation on the Open Web

The Challenge

Methodology

Novelty

LLM-Hallucinated Infrastructure Domains as an Attack Surface

The Challenge

Methodology

Novelty

Professional Credentials

DataCamp

Google Cybersecurity Professional Certificate

Thought Leadership

Selected Publications & Patents

Silent Observers Make a Difference: A Large-scale Analysis of Transparent Proxies on the Internet.

Shining a Light on Dark Places: A Comprehensive Analysis of Open Proxy Ecosystem.

Towards Passive Analysis of Anycast in Global Routing: Unintended Impact of Remote Peering.

Patent: Manufacturing method of micro lens / 一种微透镜的制作方法.

Academic Service

Key TPC / Reviewer Roles:

Let's Connect

Top Languages

Recent Commits

Send a Message

Rui's AI Assistant