Hi, I'm
Focused on video generation, image generation, and multimodal AI research. Also a passionate content creator making AI-powered short films.
AI Researcher × Full-Stack Builder × Content Creator
Career Direction
AI Engineer / Machine Learning Engineer
Research Areas
Triple Identity
Full-Stack Builder
Product design, algorithm R&D, engineering, testing & deployment — end-to-end capability
AI Researcher
Deep dive into video generation and multimodal domains, tracking cutting-edge papers
Content Creator
Directing, filming & editing — creating AI short films and cinematic driving footage
Cross-disciplinary skill stack
Full-Stack Builder focused on LLM fine-tuning, agent systems, RAG architecture, and production-oriented backend delivery, differentiated by causal inference and measurement skills.
LLM & GenAI Engineering
Core strengths around model integration, fine-tuning, alignment, and inference optimization.
Agent Systems
Product-oriented agent orchestration, tool use, workflow automation, and guardrail design.
RAG & Knowledge Systems
Retrieval, knowledge organization, query transformation, and context engineering across document AI systems.
Machine Learning & Multimodal
A combined view of classical ML, deep learning, and multimodal modeling that matches an applied-AI profile.
Optimization, Infra & MLOps
Distributed training, inference optimization, service APIs, and deployment-minded engineering support.
Causal Inference & Analytics
The strongest differentiator for showing that you can measure impact, not just build models or workflows.
Full-stack AI platforms, document intelligence systems, and model-tuning workflows
Agentic GraphRAG (Vertical Domain)
No Neo4j: LangExtract builds a Python-dict knowledge graph of entities + relations alongside a Chroma store, and a 3-tool agent picks vector / graph / hybrid retrieval with multi-hop traversal. Extractions carry char_interval for traceability.
NL2SQL Data-Analysis Agent
A Vanna-forked ReAct agent that turns a one-sentence question into SQL, runs it on MySQL, and returns a table + chart + explanation. Accuracy comes from RAG over three Milvus collections (DDL / business docs / historical SQL).
AI Document Review Agent v2.0
Full-stack document review: MinerU parses the PDF, a LangChain v1.1 + DeepSeek pipeline flags grammar issues and over-definitive language, streams each onto the PDF at its bounding box, with custom rules and human-in-the-loop review.
OpenClaw Skill Development
A practical study of OpenClaw's Skill system (teach an agent via SKILL.md, not code plugins), a complete Daily Briefing skill built from scratch, and a Lobster workflow chaining search → summarize → approve → push.
Harness Engineering in Practice
Output quality = model capability × design level. The four pillars of engineering an agent runtime (codebase-as-truth / mechanized constraints / feedback loops / entropy mgmt). Measured: model unchanged, the Harness alone lifts Terminal Bench 52.8% → 66.5%.
Agent Long/Short-Term Memory System
Short-term SessionManager (truncation MAX_HISTORY=20 + rolling summary) + long-term MEMORY.md (flips to RAG past 2000 tokens), unified by a MemoryManager hub; production swaps in mem0 (LLM judge ADD/UPDATE/DELETE/NONE) + Milvus.
OpenClaw Multi-Agent Orchestration
Multi-agent reduced to three MCP primitives (spawn/send/history), with six modes on top (Hub-Spoke/Pipeline/Hierarchical/Routing/P2P/Fleet). Understanding Hub one-directional dispatch, the subagent-layer sessions_send ban, and why P2P has zero production cases.
Multimodal Vision LLM (PandaGPT)
ImageBind binds 6 modalities into one space, a single linear projection feeds Vicuna — PandaGPT trains only on image-text yet emergently understands audio/depth. Plus VPT visual-prompt tuning for pathology downstream transfer.
TensorRT Inference Optimization
Shipping a trained model to the edge: ONNX → TensorRT engine build → layer/tensor fusion (Conv+BN+ReLU collapses into one CBR kernel) → INT8/FP16 PTQ calibration → a custom NMS plugin (IPluginV2) → SSD object-detection inference. The senior MLSys piece the portfolio lacks.
YOLOv12 Steel Surface Defect Detection
An Ultralytics YOLOv12 detector trained on NEU-DET: 6 defect classes, ~5000 images, full train → val → predict pipeline for automated steel quality inspection. A reproducible recipe + an illustrative inference demo.
AI Analyst — an LLM that builds its own models
An LLM acting as an analyst: it orchestrates tools via Function-Calling — Text2SQL (create_sql_agent) pulls features from MySQL, then it fits interpretable models on the fly (linear regression to decompose spend + a decision tree to find drivers) and returns an actionable recommendation. The net-new angle is an LLM that builds its own models, not NL→SQL→chart.
PF-Net 3D Point-Cloud Completion
A different data modality: 3D unordered point sets. GAN-based completion with PF-Net (Point Fractal Network) — on ShapeNet-Part, a self-supervised 512-point crop as GT, a multi-scale FPS encoder (1920-d) + residual pyramid decoder fill the hole coarse(64)→center2(128)→fine(512), constrained by Chamfer Distance + an adversarial loss.
Cross-Platform Spatial Interaction Layer (Quest + Vision Pro)
A study-derived case from the SpatialXR Unity video courses: OpenXR at the base, a forking device layer (Meta XR SDK vs PolySpatial/Metal), and a constant XR Interaction Toolkit on top. One chain: hand-skeleton → pinch → ray → grab → poke World-Space UI. A companion runnable Unity project is open-sourced.
Colocated Large-Space Multiplayer MR
A study-derived case from the SpatialXR video courses: the core hard problem of colocated multiplayer MR — headsets' local frames converge to ONE shared origin via spatial anchors + alignment, plus player/object state sync over a public-internet relay, targeting Pico large-space. The Netcode SDK is unverified (video-only). Not a shipped Unity app.
Academic research and technical exploration
Efficient Video Generation with Diffusion Models
CVPR 2026Your Name, et al.
A novel efficient video diffusion architecture that significantly reduces computational cost while maintaining generation quality.
A Unified Framework for Multimodal Temporal Understanding
NeurIPS 2025Your Name, et al.
A unified multimodal temporal understanding framework integrating visual, language, and audio signals for temporal reasoning.
Technical insights and reflections
AI Short Films · Cinematic Driving · Visual Stories
AI-Generated Cyber City
A cyberpunk city short film generated with Sora and Runway
Mountain Road Sunset Drive
4K cinematic driving footage capturing sunset on mountain roads
AI × Traditional Animation
A traditional Chinese animation short made with AI tools
City Night Cruise
Night driving through the city with neon lights and traffic
Quick answers to a few common questions
I'm open to AI Engineer and Machine Learning Engineer roles, focused on video generation, image generation, and multimodal systems. Full-time positions or high-impact contract work are both welcome.
Let's connect