Open to Opportunities

Scaling AI Capabilities from Research to Production.

AI & Data Science Leader with a Ph.D. in Computer Engineering. Built and deployed the iTAAP production data platform across 25+ California school districts — spanning agentic AI orchestration, multi-model ML prediction, browser automation, geospatial visualization, and full-lifecycle ETL. Published at IEEE INFOCOM, ACM SIGCOMM, and Elsevier.

25+

School Districts

150+

Power BI Dashboards

Rui Bian, PhD — AI & Data Science Leader
0 Systems
Shipped
0+ School
Districts Served
0+ Power BI
Dashboards
0 School Sites
Automated
0K+ California
Schools Mapped
0% Pipeline
Time Saved

Experience

2023 — Present
Lead Data Scientist

Expatiate Communications · Pasadena, CA

  • Architected iTAAP data platform on Microsoft Fabric serving nearly 30 school districts
  • Designed Agentic AI ETL system — achieved 90% reduction in data processing time
  • Built Assessment Integration Engine (iReady, NWEA, IXL → State Testing levels) and Power BI/DAX compliance dashboards
  • Developed iTAAP Insights Email Alerts and secure cross-team data endpoints
  • Mentors data science interns on internal tooling and automation workflows
2017 — 2022
Ph.D. Researcher — Computer Engineering

University of Delaware · Newark, DE

  • Conducted internet-scale measurement research on BGP routing, transparent proxies, and open proxy ecosystems
  • Published at IEEE INFOCOM 2024, Computer Networks (Elsevier) 2022, and ACM SIGCOMM CCR 2019
  • Built large-scale Python/AWS data collection pipelines processing millions of network probes
  • TPC member and reviewer: IEEE INFOCOM, DSN, TNSE, Computer Networks, and others
  • GPA: 3.96 / 4.0 — graduated December 2022
Prior Experience
Engineering & Research Roles

7+ years across research and industry positions

  • Broad engineering background spanning systems, data, and applied research prior to doctoral studies
  • B.S. in Engineering — University of Science and Technology of China (USTC)

Core Strengths

Innovation

First-principles thinking — designing novel systems when off-the-shelf tools fall short, and choosing the right technology for each problem rather than defaulting to the familiar.

  • LangGraph + local Ollama LLM for privacy-preserving agentic ETL — student data never leaves the server
  • Progressive filtering algorithm that separates "no data" from "underperforming" — handles real-world data gaps correctly
  • Dual-API fallback chain (OpenAI → Gemini → structured JSON) — report generation never fails on a single API outage
  • Go CLI in a Python-dominant stack — deliberate language selection for reliability, not default

Reliability

Production systems designed to fail gracefully, recover automatically, and be safe for non-technical operators — with test modes, audit trails, and idempotent reruns.

  • Idempotent ETL pipelines (truncate + reload) — always safe to rerun after failure without data corruption
  • Per-school error recovery — a single API or browser failure never aborts the full multi-district run
  • Pre-flight validation (Extract_checker) catches bad files before any database write
  • Test mode in every user-facing system — no accidental production sends or overwrites

Domain Expertise

Deep knowledge across two very different fields — California K-12 education compliance and internet-scale network measurement — enabling contributions that generalist engineers cannot make.

  • Encoded federal IEP legal rules (IDEA) as computable thresholds: annual 365-day, triennial 3-year, assessment plan 60-day
  • IDEA LRE placement calculations matching California state benchmark categories
  • CALPADS, SEIS, Aeries SIS, CAASPP, ELPAC data structures and compliance requirements
  • IEEE INFOCOM 2024, ACM SIGCOMM CCR 2019, Elsevier Computer Networks 2022 — peer-reviewed network science publications

Leadership

Solo architect, builder, and owner of a 32-project production platform serving 25+ independent organizations — from design to deployment to ongoing support.

  • Multi-tenant architecture scales to new districts without code changes — onboarding is configuration, not engineering
  • Designed every system for non-technical operators: GUI tools, color-coded reports, double-click launchers
  • TPC member and reviewer: IEEE INFOCOM ('17–'21), IEEE/IFIP DSN ('19–'22), Elsevier Computer Networks, TNSE
  • Mentors data science interns on internal tooling, automation, and production engineering practices

High-Impact Engineering

Transforming EdTech Intelligence at Scale

Expatiate Communications Lead Data Scientist

The Challenge

School districts lacked predictive visibility into IEP compliance, academic progress, and operational risk — relying on slow, fragmented manual data aggregation across disparate assessment platforms.

Architecture

  • Designed a LangGraph agentic pipeline orchestrator with gate-node conditional routing, fan-out parallelism across 4 concurrent stages, and a local Ollama LLM (qwen2.5:7b) that analyzes failures and suggests fixes — no student data leaves the server
  • Built a 6-domain ML prediction system (CAASPP ELA/Math, ELPAC, Chronic Absenteeism, College/Career, Suspension) using scikit-learn, ARIMA, and Holt-Winters with per-district model selection; integrated OpenAI + Gemini APIs for plain-language administrative narratives
  • Automated IEP compliance tracking across 21 districts: Playwright PDF download with MFA handling → PDF date parsing → green/yellow/orange/red deadline risk scoring → Power BI dashboards; zero missed IEP deadlines after deployment
  • Engineered 32 production tools covering ETL, compliance reporting, browser automation (Playwright + Selenium), async web scraping, SFTP delivery, a Go REST API client to MongoDB, and 150+ Power BI dashboards across 9 dashboard types
  • Built automated Gmail API email alert system delivering data-driven weekly performance summaries to 18 school sites with dynamic metric selection from 12 indicators

Business Impact

Platform deployed across 25+ California school districts. LangGraph agentic automation achieved a 75–90% reduction in pipeline processing time. Replaced dozens of hours of weekly manual work — data collection, compliance tracking, report generation, and stakeholder communication — with single-command automated pipelines.

Internet-Scale Transparent Proxy Analysis

University of Delaware Ph.D. Research · IEEE INFOCOM 2024

The Challenge

Transparent proxies silently intercept and modify web traffic without user awareness, but their true prevalence, behavior, and network-wide impact were poorly understood at scale.

Methodology

  • Designed a large-scale active measurement system to detect and fingerprint transparent proxies across global internet paths
  • Built Python-based data collection and analysis pipelines processing millions of network probes
  • Developed novel detection heuristics combining HTTP header analysis and TCP-level signals

Academic Impact

Published at IEEE INFOCOM 2024 — one of the top-ranked venues in computer networking, revealing the significant hidden influence of transparent proxies on internet traffic integrity.

Mapping the Open Proxy Ecosystem

University of Delaware Ph.D. Research · Computer Networks 2022

The Challenge

The open proxy landscape — used for anonymization, censorship circumvention, and malicious activity — had never been comprehensively characterized in terms of scale, geography, and behavior.

Methodology

  • Crawled, scanned, and analyzed 436,000+ open proxies across the global internet
  • Built large-scale data collection infrastructure using Python and AWS for distributed scanning
  • Applied statistical modeling and traffic analysis to characterize proxy behavior, uptime, and abuse patterns

Academic Impact

Published in Computer Networks (Elsevier), 2022 — delivering the first comprehensive analysis of the open proxy ecosystem and its security implications at internet scale.

Anycast Routing & Remote Peering Effects

University of Delaware Ph.D. Research · ACM SIGCOMM CCR 2019

The Challenge

Remote peering in BGP networks was known to distort anycast routing decisions, but the extent of this unintended impact on global traffic distribution — including for major cloud providers — had not been passively quantified.

Methodology

  • Developed a passive BGP measurement methodology to infer anycast catchment boundaries without active probing
  • Analyzed global BGP routing tables and AS-path data across hundreds of vantage points
  • Correlated routing anomalies with remote peering relationships at internet exchange points (IXPs)

Academic Impact

Published in ACM SIGCOMM Computer Communication Review, 2019 — a flagship networking venue — establishing foundational methodology for passive anycast analysis used in subsequent internet measurement research.

In Progress

AI Cloaking & Content Differentiation on the Open Web

Independent Research Target: IMC / WWW / USENIX Security

The Challenge

As AI crawlers become ubiquitous, websites are moving beyond binary blocking (robots.txt) to a more sophisticated, unmeasured tactic: returning HTTP 200 responses to both humans and AI bots, but serving degraded, watermarked, or "poisoned" content specifically to crawlers like GPTBot.

Methodology

  • Twin-crawler framework (Playwright) visiting Tranco Top 10,000 domains — once as a standard browser UA, once as GPTBot
  • DOM tree structural comparison and text similarity scoring via Jaccard & TF-IDF cosine similarity
  • Sector-level taxonomy: paywall injection, text truncation, gibberish poisoning, visual watermarking

Novelty

Unlike prior work measuring blocking, this measures deception — filling a critical gap in understanding how the web's content landscape diverges between human and AI readers.

In Progress

LLM-Hallucinated Infrastructure Domains as an Attack Surface

Independent Research Target: NDSS / CCS / USENIX Security

The Challenge

LLMs are widely used to generate Infrastructure-as-Code (Terraform, Kubernetes YAML, Nginx configs). If a model hallucinates a plausible but unregistered domain endpoint, an attacker could register that domain to intercept live API traffic or credentials from deployed systems.

Methodology

  • 1,000+ DevOps-focused prompts submitted to GPT-4o, Claude 3.5 Sonnet, and Llama-3-70B
  • Regex extraction of all generated domains, filtered against known public registries
  • DNS resolution + Registrar API queries to quantify hallucination rate and live registrability of phantom endpoints

Novelty

Distinct from package hallucination studies — this targets DNS-level infrastructure interception, a critical supply chain risk not previously measured in the LLM security literature.

Production Portfolio

Selected work from the iTAAP production platform — agentic AI orchestration, multi-model ML forecasting, full-lifecycle ETL, compliance automation, geospatial visualization, and systems engineering across 25+ California school districts.

01 AI / Orchestration

LangGraph SPED Pipeline Orchestrator

Agentic LangGraph state machine with gate-node conditional routing, 4-stage fan-out parallelism, and local Ollama LLM that diagnoses failures and suggests fixes. Student data never leaves the server.

LangGraphOllamaPythonsubprocessTkinter
↑ 75–90% runtime reduction · replaced 4–6 hr manual process
02 Machine Learning

6-Domain Student Outcome Prediction

Per-district ML system predicting CAASPP ELA/Math, ELPAC, Chronic Absenteeism, College/Career, and Suspension. Per-district model selection across RF / Linear / ARIMA / Holt-Winters with OpenAI + Gemini narrative generation.

scikit-learnARIMAHolt-WintersOpenAI APIGemini API
20+ districts · 6 prediction domains · AI-written reports
03 ML Forecasting

Per-School Suspension Rate Forecasting

5 models trained per school (ETS, ARIMA, Prophet, Random Forest, linear baseline). Ensemble of top 3 weighted by inverse validation error. Per-school comparison charts and ranked summary CSV.

ProphetARIMAETSscikit-learnstatsmodels
Ensemble outperformed any single model in 70%+ of schools
04 AI / Algorithm

School Recommendation Engine

Progressive multi-indicator filtering algorithm that finds nearby California schools outperforming a given school. Handles data gaps correctly — "no data" is not treated as "underperforming." Haversine radius search + Folium map output.

PythonHaversineFoliumpandasgeopy
6 indicators · data-gap-aware algorithm · statewide coverage
05 Compliance

IEP Compliance Pipeline

4-step pipeline encoding federal IDEA law as computable thresholds: SQL query → Playwright PDF download with MFA handling → PDF date extraction → green/yellow/orange/red risk scoring published to Power BI dashboards.

PlaywrightPDF parsingSQL ServerPower BI
21 districts · hundreds of students tracked · zero missed deadlines
06 Analytics

At-Risk Student Identification

Cross-domain risk flagging across 14 LEAs: joins CAASPP, ELPAC, discipline, and SPED data to assign SST, 504, and SEL flags. Color-coded Excel reports via openpyxl + SQL Server write-back for Power BI dashboards.

PythonpandasopenpyxlSQL ServerPower BI
14 LEAs · replaced 14 ad-hoc processes with one standard system
07 Visualization

California School Radar Map

Streamlit app with Plotly Mapbox scatter map, vectorized Haversine radius search, zip-code geocoding, click-to-move centering, and radar chart comparison across 6 performance indicators for any two schools side by side.

StreamlitPlotly MapboxNumPySQL Server
10,000+ California schools · live proximity filtering · sub-second
08 Data Pipeline

CALPADS ODS Hybrid Pipeline

Dual-mode pipeline: REST API with token auth for 5 report types, Selenium browser automation for portal-only reports. Dual-credential routing maps each LEA to its assigned CALPADS account automatically.

PythonrequestsSeleniumSQLAlchemy
25+ districts · 7 report types · replaced 45-min manual session
09 Web Scraping

ASHA Pro Finder Async Scraper

Async two-phase Playwright scraper: Phase 1 paginates Coveo search API to collect profile IDs; Phase 2 scrapes each profile with BeautifulSoup regex ID matching. Per-record error isolation — failures log and continue without aborting.

asyncioPlaywrightBeautifulSoupPython
~6,000 SLP & audiologist profiles · name · cert · phone · address
10 Data Engineering

CALPADS Multi-District Extract Uploader

Validation-first ETL: pre-flight LEA code check across 25+ districts, auto-fix for trailing delimiter errors (dry-run + apply-fix flags), idempotent truncate+reload, freshness tracking upserted to SQL Server for Power BI staleness indicators.

PythonpandasSQLAlchemypyodbc
Auto-fix resolved 80% of errors · eliminated silent data corruption
11 Go / Systems

Aeries SIS API Go Client

Go CLI fetching 12 Aeries SIS dataset types into MongoDB. Dynamically discovers high schools by HighGradeLevel field — no hardcoded lists. Per-school error recovery continues on API failure. `init()` guard validates config before any network call.

GoMongoDBREST APInet/http
12 dataset types · dynamic high school discovery · zero maintenance
12 Automation

Gmail API Weekly Alert System

OAuth2 Gmail API pipeline querying SQL Server and dynamically selecting metrics from a configurable pool of 12 indicators. Constructs HTML-formatted tables and delivers weekly performance summaries to 18 school sites. Test mode prevents accidental sends.

Gmail APIOAuth2PythonSQL Server
18 school sites · 12 dynamic indicators · fully automated weekly

Tech Ecosystem

LLM & Agentic AI

LangGraph Ollama (Local LLM) OpenAI API Gemini API Agentic AI Design RAG Prompt Engineering Fan-out Parallelism MCP

ML & Forecasting

scikit-learn Prophet ARIMA Holt-Winters (ETS) Ensemble Methods Feature Engineering statsmodels Predictive Modeling

Data Engineering

Python / pandas / numpy SQL Server / T-SQL SQLAlchemy / pyodbc ETL Pipelines Playwright Selenium asyncio Paramiko / SFTP

Visualization & Systems

Streamlit Plotly Mapbox Power BI / DAX Folium Go (Golang) MongoDB Multi-tenant Architecture Gmail API / OAuth2

Professional Credentials

DC

DataCamp

4 Active Certifications · Issued 2026 · Valid through 2028

AI Engineer for Developers Associate
AI Engineer for Data Scientists Associate
Data Scientist Associate
Data Engineer Associate
G

Google Cybersecurity Professional Certificate

Coursera · Issued Aug 2023

Thought Leadership

Selected Publications & Patents

Silent Observers Make a Difference: A Large-scale Analysis of Transparent Proxies on the Internet.

Rui Bian et al. | IEEE INFOCOM, 2024.

Shining a Light on Dark Places: A Comprehensive Analysis of Open Proxy Ecosystem.

Rui Bian et al. | Computer Networks, 2022.

Towards Passive Analysis of Anycast in Global Routing: Unintended Impact of Remote Peering.

Rui Bian et al. | ACM SIGCOMM CCR, 2019.

Patent: Manufacturing method of micro lens / 一种微透镜的制作方法.

Gang Liu, Ying Xiong, Rui Bian et al. | CN104614936B.

Academic Service

Extensive peer review contributions ensuring the integrity and quality of high-tier network science and security venues.

Key TPC / Reviewer Roles:

  • IEEE INFOCOM ('17, '18, '19, '20, '21)
  • IEEE/IFIP DSN ('19, '21, '22)
  • IEEE Transactions on Network Science and Engineering (TNSE)
  • Computer Networks
  • IEEE ITEC, IEEE RTC, IEEE SmartSys

Let's Connect

Open to Senior Data Scientist, AI/ML Engineer, and leadership roles. Based in Los Angeles — open to hybrid and remote.

Email LinkedIn GitHub Resume

Rui's AI Matchmaker

Hi! I'm an AI assistant trained on Rui's background. How can I help you evaluate his fit for your team?