Open to Opportunities

Senior ML Engineer & Data Scientist

Building scalable ML systems for high-volume, real-world decisioning problems

Ph.D. Computer Engineering · Cybersecurity & Distributed Systems · Large-Scale Data Processing

  • Designed and deployed production ML systems serving 30+ school districts — risk scoring, compliance automation, predictive analytics
  • End-to-end systems ownership: batch & real-time pipelines · model operationalization · AWS / Azure · Python · SQL Server
  • Strong focus on model reliability, system scalability, and production monitoring · IEEE INFOCOM · ACM SIGCOMM

30+

School Districts

150+

Power BI Dashboards

Rui Bian, PhD — AI & Data Science Leader
0 Systems
Shipped
0+ School
Districts Served
0+ Power BI
Dashboards
0 School Sites
Automated
0K+ California
Schools Mapped
0% Pipeline
Time Saved

System Capabilities

ML Systems

Predictive Modeling Classification Anomaly Detection Model Evaluation & Monitoring Feature Engineering SHAP Explainability scikit-learn XGBoost ARIMA / Holt-Winters Prophet PyTorch MLflow Model Operationalization

Data Engineering

Batch & Streaming Pipelines ETL Design Data Quality Systems SQL Server / T-SQL Apache Spark Microsoft Fabric pandas / numpy SQLAlchemy Playwright Selenium asyncio SFTP Automation

Production Systems

FastAPI / Flask REST APIs LangGraph Ollama (Local LLM) OpenAI API Gemini API Docker CI/CD Azure AWS MongoDB Multi-Tenant Architecture

Languages & Tools

Python Go (Golang) SQL / T-SQL PowerShell Power BI / DAX Streamlit Plotly Mapbox Git

Selected Production ML Systems

End-to-end production systems — ML design, pipeline architecture, API deployment, and operational monitoring — all serving 30+ California school districts.

01 AI / Orchestration

Agentic ML Pipeline Orchestration System

ProblemMulti-step data pipelines required manual execution, missed steps, and had no failure diagnosis.
DesignLangGraph state machine with gate-node conditional routing + 4-stage fan-out parallelism; chose local Ollama over cloud LLM — student records never leave the server, eliminating compliance risk at architecture level; per-district fault isolation at node level — single failure triggers LLM diagnosis without cascading; multi-tenant config eliminates per-district code changes.
LangGraphOllamaPythonsubprocessTkinter
↓ 75–90% runtime · 30+ districts · 100K+ records/run · node-level fault isolation · zero data egress
02 Machine Learning

Multi-Domain Risk Prediction & Decision-Support System

ProblemDistricts had no early visibility into student risk across academic, attendance, and behavioral domains.
DesignChose per-district model selection over global model — accounts for demographic and policy variation across districts; dual-API fallback chain (GPT-4 → Gemini → structured JSON) achieves graceful degradation — report generation survives any single API outage; idempotent batch scoring over SQL Server warehouse.
scikit-learnARIMAHolt-WintersOpenAI APIGemini API
20+ districts · 6 concurrent domains · 3-layer graceful degradation · idempotent reruns
03 Compliance Automation

Automated Compliance Monitoring & Risk Scoring System

Problem21 districts tracked federal IEP deadlines manually in spreadsheets, with no automated risk scoring or visibility.
Design4-stage pipeline with per-district fault isolation: SQL extraction → MFA-aware Playwright → PDF date parsing → IDEA risk scoring; chose computable rule encoding over heuristic date logic — federal thresholds (365-day annual, 3-year triennial, 60-day assessment) enforced as auditable first-class rules; single-district failure never aborts multi-district run.
PlaywrightPDF parsingSQL ServerPower BI
21 districts · 0 missed IEP deadlines · computable-rule compliance · district-level fault isolation
04 Visualization

Geospatial School Analytics & Search Platform

ProblemNo interactive way to geographically explore and compare school performance across California at scale.
DesignChose vectorized NumPy Haversine over SQL-side distance queries — moves compute to application layer for predictable sub-second response at 10K+ scale; progressive filtering enforces a correctness invariant: “no data” explicitly distinguished from “underperforming” — prevents false negatives on sparse-data regions; radar chart across 6 indicators.
StreamlitPlotly MapboxNumPySQL Server
10K+ schools · sub-second search · correctness-first design · no false negatives on sparse data
05 Data Engineering

Hybrid API/Automation Government Data Ingestion Pipeline

Problem30+ districts required 45+ minutes of manual portal navigation per cycle to retrieve compliance reports.
DesignChose dual-mode over Selenium-only — REST API routes provide reliable structured ingestion where available; Selenium reserved only for portal-locked flows, minimizing automation surface area; automatic LEA-to-credential routing handles 30+ districts without per-district code changes; idempotent truncate-reload prevents state corruption after partial failures.
PythonREST APISeleniumSQLAlchemy
30+ districts · 7 report types · minimal Selenium surface area · idempotent truncate-reload
06 Go / APIs

High-Reliability Concurrent SIS Data Ingestion Service

Problem12 SIS dataset types needed reliable automated ingestion into MongoDB with no single-school failure causing full-run aborts.
DesignChose Go over Python — goroutines provide lower memory overhead than asyncio for high-concurrency API fan-out; per-goroutine error isolation: individual timeouts and auth failures are logged and skipped without blocking concurrent streams; retry semantics scoped at school granularity, not dataset level; dynamic school discovery eliminates hardcoded configuration.
GoMongoDBREST APInet/http
12 dataset types · goroutine fan-out · per-school fault isolation · retry at school granularity

Experience

2023 — Present
Lead Data Scientist

Expatiate Communications · Pasadena, CA

  • Sole architect and owner of 32 production systems serving 30+ school districts; standardized district onboarding to configuration files — reduced new-district setup from weeks of engineering work to hours of configuration
  • Built LangGraph agentic orchestration achieving 75–90% pipeline runtime reduction; designed 6-domain outcome prediction with per-district model selection; enabled compliance teams to run multi-district risk pipelines without engineering support
  • Deployed operator-first platform on Microsoft Fabric: non-technical staff run complex compliance workflows via GUI launchers; automated Gmail API reporting eliminated manual weekly reporting burden across 18 school sites
2017 — 2022
Ph.D. Researcher — Computer Engineering

University of Delaware · Newark, DE

  • Built internet-scale data collection systems using Python + AWS, processing millions of network probes for BGP routing and proxy ecosystem analysis
  • Published at IEEE INFOCOM 2024, Elsevier Computer Networks 2022, ACM SIGCOMM CCR 2019 — GPA 3.96 / 4.0
  • TPC member and reviewer: IEEE INFOCOM, DSN, TNSE, Computer Networks
Prior Experience
Engineering & Research Roles

7+ years across research and industry positions

  • Broad engineering background spanning systems, data, and applied research prior to doctoral studies
  • B.S. in Engineering — University of Science and Technology of China (USTC)

Platform Design Principles

Three non-negotiable engineering constraints applied across all 32 production systems — not policies, but structural guarantees.

Fault Isolation by Default

Every system isolates failures at the tenant boundary. One school or district failing never cascades to others — enforced at the goroutine or node level, not by try/catch wrapping.

  • Per-district error recovery node in LangGraph pipelines — failure triggers LLM diagnosis, not pipeline abort
  • Per-goroutine isolation in Go SIS Service — individual timeouts logged and skipped without blocking concurrent streams
  • Single MFA failure never aborts IEP Compliance Pipeline across 21 districts
  • Pre-flight Extract_checker validates all inputs before any database write

Idempotent Operations

All ETL pipelines use truncate-reload semantics — never append-only. Any pipeline is safe to rerun after partial failure without data corruption, duplication, or manual cleanup.

  • Truncate-reload in CALPADS Pipeline — consistent state guaranteed regardless of failure point
  • Idempotent batch scoring in Risk Prediction System — results reproducible given same input snapshot
  • 30+ districts on shared infrastructure with strict tenant data isolation
  • Test mode in every user-facing system — no accidental production sends or overwrites

Privacy by Architecture

AI inference on student data runs locally via Ollama — a structural constraint, not a configuration option. Eliminates PII exposure risk regardless of downstream code changes or vendor policy.

  • Local Ollama qwen2.5:7b for failure analysis — zero student data sent to cloud LLM APIs
  • No student PII in any external API payload — enforced by design, not by access control policy
  • Multi-tenant architecture: district A cannot access district B data by construction
  • Structured audit log maintained locally for all pipeline runs

Engineering Philosophy

Reliability is a product feature. Systems handling public-sector student data must remain correct under partial failure, degraded upstream inputs, and operational retries. Design approach prioritizes isolation boundaries, deterministic recovery semantics, and operational transparency — not as optimizations, but as first-class requirements that shape every architectural decision from the start.
Systems scale humans, not just compute. Every system I build is operated by non-technical staff: compliance coordinators, district program managers, school administrators. Architectural decisions account for the human operating layer — test modes before production sends, color-coded risk signals instead of raw scores, single-command automation for workflows that previously required engineering involvement. The measure of leverage is not pipeline throughput, but whether your systems enable people who couldn't do the work before to do it reliably now.

High-Impact Engineering

Transforming EdTech Intelligence at Scale

Expatiate Communications Lead Data Scientist

The Challenge

School districts lacked predictive visibility into IEP compliance, academic progress, and operational risk — relying on slow, fragmented manual data aggregation across disparate assessment platforms.

Architecture

  • Designed a LangGraph agentic pipeline orchestrator with gate-node conditional routing, fan-out parallelism across 4 concurrent stages, and a local Ollama LLM (qwen2.5:7b) that analyzes failures and suggests fixes — no student data leaves the server
  • Built a 6-domain ML prediction system (CAASPP ELA/Math, ELPAC, Chronic Absenteeism, College/Career, Suspension) using scikit-learn, ARIMA, and Holt-Winters with per-district model selection; integrated OpenAI + Gemini APIs for plain-language administrative narratives
  • Automated IEP compliance tracking across 21 districts: Playwright PDF download with MFA handling → PDF date parsing → green/yellow/orange/red deadline risk scoring → Power BI dashboards; zero missed IEP deadlines after deployment
  • Engineered 32 production tools covering ETL, compliance reporting, browser automation (Playwright + Selenium), async web scraping, SFTP delivery, a Go REST API client to MongoDB, and 150+ Power BI dashboards across 9 dashboard types
  • Built automated Gmail API email alert system delivering data-driven weekly performance summaries to 18 school sites with dynamic metric selection from 12 indicators

Business Impact

Platform deployed across 30+ California school districts. LangGraph agentic automation achieved a 75–90% reduction in pipeline processing time. Replaced dozens of hours of weekly manual work — data collection, compliance tracking, report generation, and stakeholder communication — with single-command automated pipelines.

Internet-Scale Transparent Proxy Analysis

University of Delaware Ph.D. Research · IEEE INFOCOM 2024

The Challenge

Transparent proxies silently intercept and modify web traffic without user awareness, but their true prevalence, behavior, and network-wide impact were poorly understood at scale.

Methodology

  • Designed a large-scale active measurement system to detect and fingerprint transparent proxies across global internet paths
  • Built Python-based data collection and analysis pipelines processing millions of network probes
  • Developed novel detection heuristics combining HTTP header analysis and TCP-level signals

Academic Impact

Published at IEEE INFOCOM 2024 — one of the top-ranked venues in computer networking, revealing the significant hidden influence of transparent proxies on internet traffic integrity.

Mapping the Open Proxy Ecosystem

University of Delaware Ph.D. Research · Computer Networks 2022

The Challenge

The open proxy landscape — used for anonymization, censorship circumvention, and malicious activity — had never been comprehensively characterized in terms of scale, geography, and behavior.

Methodology

  • Crawled, scanned, and analyzed 436,000+ open proxies across the global internet
  • Built large-scale data collection infrastructure using Python and AWS for distributed scanning
  • Applied statistical modeling and traffic analysis to characterize proxy behavior, uptime, and abuse patterns

Academic Impact

Published in Computer Networks (Elsevier), 2022 — delivering the first comprehensive analysis of the open proxy ecosystem and its security implications at internet scale.

Anycast Routing & Remote Peering Effects

University of Delaware Ph.D. Research · ACM SIGCOMM CCR 2019

The Challenge

Remote peering in BGP networks was known to distort anycast routing decisions, but the extent of this unintended impact on global traffic distribution — including for major cloud providers — had not been passively quantified.

Methodology

  • Developed a passive BGP measurement methodology to infer anycast catchment boundaries without active probing
  • Analyzed global BGP routing tables and AS-path data across hundreds of vantage points
  • Correlated routing anomalies with remote peering relationships at internet exchange points (IXPs)

Academic Impact

Published in ACM SIGCOMM Computer Communication Review, 2019 — a flagship networking venue — establishing foundational methodology for passive anycast analysis used in subsequent internet measurement research.

In Progress

AI Cloaking & Content Differentiation on the Open Web

Independent Research Target: IMC / WWW / USENIX Security

The Challenge

As AI crawlers become ubiquitous, websites are moving beyond binary blocking (robots.txt) to a more sophisticated, unmeasured tactic: returning HTTP 200 responses to both humans and AI bots, but serving degraded, watermarked, or "poisoned" content specifically to crawlers like GPTBot.

Methodology

  • Twin-crawler framework (Playwright) visiting Tranco Top 10,000 domains — once as a standard browser UA, once as GPTBot
  • DOM tree structural comparison and text similarity scoring via Jaccard & TF-IDF cosine similarity
  • Sector-level taxonomy: paywall injection, text truncation, gibberish poisoning, visual watermarking

Novelty

Unlike prior work measuring blocking, this measures deception — filling a critical gap in understanding how the web's content landscape diverges between human and AI readers.

In Progress

LLM-Hallucinated Infrastructure Domains as an Attack Surface

Independent Research Target: NDSS / CCS / USENIX Security

The Challenge

LLMs are widely used to generate Infrastructure-as-Code (Terraform, Kubernetes YAML, Nginx configs). If a model hallucinates a plausible but unregistered domain endpoint, an attacker could register that domain to intercept live API traffic or credentials from deployed systems.

Methodology

  • 1,000+ DevOps-focused prompts submitted to GPT-4o, Claude 3.5 Sonnet, and Llama-3-70B
  • Regex extraction of all generated domains, filtered against known public registries
  • DNS resolution + Registrar API queries to quantify hallucination rate and live registrability of phantom endpoints

Novelty

Distinct from package hallucination studies — this targets DNS-level infrastructure interception, a critical supply chain risk not previously measured in the LLM security literature.

Professional Credentials

DC

DataCamp

4 Active Certifications · Issued 2026 · Valid through 2028

AI Engineer for Developers Associate
AI Engineer for Data Scientists Associate
Data Scientist Associate
Data Engineer Associate
G

Google Cybersecurity Professional Certificate

Coursera · Issued Aug 2023

Thought Leadership

Selected Publications & Patents

Silent Observers Make a Difference: A Large-scale Analysis of Transparent Proxies on the Internet.

Rui Bian et al. | IEEE INFOCOM, 2024.

Shining a Light on Dark Places: A Comprehensive Analysis of Open Proxy Ecosystem.

Rui Bian et al. | Computer Networks, 2022.

Towards Passive Analysis of Anycast in Global Routing: Unintended Impact of Remote Peering.

Rui Bian et al. | ACM SIGCOMM CCR, 2019.

Patent: Manufacturing method of micro lens / 一种微透镜的制作方法.

Gang Liu, Ying Xiong, Rui Bian et al. | CN104614936B.

Academic Service

Extensive peer review contributions ensuring the integrity and quality of high-tier network science and security venues.

Key TPC / Reviewer Roles:

  • IEEE INFOCOM ('17, '18, '19, '20, '21)
  • IEEE/IFIP DSN ('19, '21, '22)
  • IEEE Transactions on Network Science and Engineering (TNSE)
  • Computer Networks
  • IEEE ITEC, IEEE RTC, IEEE SmartSys

Let's Connect

Open to Senior Data Scientist, AI/ML Engineer, and leadership roles. Based in Los Angeles — open to hybrid and remote.

Email LinkedIn GitHub Resume

Rui's AI Matchmaker

Hi! I'm an AI assistant trained on Rui's background. How can I help you evaluate his fit for your team?