"""A 3-axis overview of the VLM landscape (the survey-standard organization): ① Architecture / objective axis — HOW vision meets language ② Capability / extension axis — WHAT it can do ③ Training-stage axis — HOW it is built (a left-to-right pipeline) Reuses gen_vlm.build_data() for names / colors / taglines / learn links, and renders a static (but linked + tooltip'd) layered diagram. Self-contained HTML. Run: .venv_robot_paradigms/bin/python gen_vlm_axes.py -> robot_vlm_axes.html """ import html as H import gen_vlm AXIS_A = ["Backbone", "Contrastive", "Masked", "Generative", "Bridge", "Native"] AXIS_B = ["Grounding", "AnyRes", "Video", "Document", "Unified", "Efficient", "MoE", "Agentic"] STAGES = [ ("① Vision pretraining", "learn to see (often label-free)", ["vit", "ssl"]), ("② VL pretraining / alignment", "align or generate over image+text", ["clip", "siglip", "flava", "beit3", "blip", "coca", "git", "chameleon"]), ("③ Connector / bridge", "wire a vision encoder into an LLM", ["flamingo", "blip2", "llava", "qwenvl"]), ("④ Instruction tuning", "become a helpful assistant", ["instructtune"]), ("⑤ Preference (RLHF / DPO)", "be truthful, not hallucinated", ["mmrlhf"]), ("⑥ Inference-time", "no weight change: tools & retrieval", ["agentic-vlm", "mm-rag", "frontier"]), ] def esc(t): return H.escape(str(t), quote=True) def render(): d = gen_vlm.build_data() byid = {p["id"]: p for p in d["paradigms"]} fam = {f["key"]: f for f in d["families"]} def chip(pid): p = byid.get(pid) if not p: return "" c = fam[p["family"]]["color"] url = (p.get("learn") or {}).get("url", "") a_open = '' % ( c, esc(url), esc(p["tagline"])) return a_open + esc(p["short"]) + "" def fam_card(fkey): f = fam[fkey] kids = [p for p in d["paradigms"] if p["family"] == fkey] chips = "".join(chip(p["id"]) for p in kids) return ('
The same models, organized the way recent surveys do: by architecture (how vision meets language), by capability (what it can do), and by training stage (how it's built). Hover a chip for its one-line idea; click to read.