Skip to content

InsightsNet/texannotate

Repository files navigation

LPSB: LaTeX-PDF Semantic Bridge (LaTeX Rainbow 2.0)

LPSB extracts semantic structure from LaTeX compilations and aligns it with PDF page coordinates. It produces a single event log (*.lpsb.json) that can be rebuilt into a tree (*.structure.json) and can be enriched with MathML from a LuaLaTeX pass.

Status / Version Notice (READ THIS FIRST)

This repository is LaTeX Rainbow 2.0: a complete rewrite / restructuring of the original codebase. It is under active development and not stable.

If you need a version that works today, use the v1 branch (the previous implementation).

Key Features

  • PDF structure events: emits start/end events for common PDF/UA-ish roles.
    • Structure: Document, Sect, Div
    • Headings: H1-H6
    • Lists: L, LI, Lbl
    • Tables: Table, TR (rows)
    • Figures: Figure, Caption
    • Inline: Strong, Em, Link, Reference
  • Math capture
    • PDFLaTeX: captures display math environments as Formula (e.g. equation, align).
    • LuaLaTeX extension (lpsb-luamath): captures all math including $...$ and can emit MathML via luamml.
  • Robust logs: solver.py includes fault-tolerant parsing for “dirty” JSON emitted by TeX.
  • Batchable via Docker: scripts run against arXiv source tarballs, producing reproducible output directories.

Quick Start

Build the recommended Docker image (MathML enabled)

The scripts prefer a local image tag lpsb-texlive:latest if available.

docker build -f docker/Dockerfile.latest -t lpsb-texlive:latest docker

Optional: build TeX Live 2023 image (for older arXiv toolchains)

Some arXiv sources pin TeX Live (via 00README.json / texlive_version). In particular, biblatex .bbl files can be version-strict.

If you want reproducible builds for TL2023 papers, build this tag too:

docker build -f docker/Dockerfile.tl2023 -t lpsb-texlive:TL2023-historic docker

Note: docker/Dockerfile.tl2023 also vendors luamml runtime into the historic image (under TEXMFLOCAL) so the LuaLaTeX pass can emit MathML even on frozen TeX Live releases.

Optional: build TeX Live 2022 image (for older biblatex .bbl formats)

Older arXiv sources (pre-00README.json) may ship a pre-generated biblatex .bbl that is format-version strict. For those, compiling with a newer TeX Live can fail (or silently diverge).

This repo provides a TL2022 historic image that also vendors luamml runtime (so the LuaLaTeX pass can still emit MathML):

docker build -f docker/Dockerfile.tl2022 -t lpsb-texlive:TL2022-historic docker

Single file (local directory)

Copy lpsb.sty into the same directory as your main.tex and add:

\usepackage{lpsb}

Then compile:

docker run --rm -v "$(pwd)":/workdir -w /workdir lpsb-texlive:latest \
  pdflatex -interaction=nonstopmode main.tex

Output: main.lpsb.json (Structure only)

Dual-Track Compilation (MathML Support)

To extract MathML and inline formulas ($...$), run a second pass with LuaLaTeX:

  1. Pass 1 (Structure): Run PDFLaTeX as above.
  2. Pass 2 (Math): Add \usepackage{lpsb-luamath} to your tex file (or inject it), then run:
docker run --rm -v "$(pwd)":/workdir -w /workdir lpsb-texlive:latest \
  lualatex -interaction=nonstopmode main.tex

Output: main.lpsb-math.json (Math events with MathML)

Note: The official batch scripts (script/test_lua_batch.sh) handle package injection automatically.

Build a structure tree

Use the Python solver to reconstruct the DOM tree:

python3 solver.py main.lpsb.json --output main.structure.json --validate

Optional: extract table cells (TR/TD) from the PDF

By default, LPSB does not hook tabular internals (to avoid TeX alignment/rule artifacts). If you need TR/TD, extract them from the compiled PDF using pdfplumber:

python3 solver.py main.lpsb.json --pdf main.pdf --extract-cells --output main.structure.json

Details: see docs/table_cells.md.

Optional (LuaLaTeX): table pass (TR/TD + colspan + bbox)

If you want TR/TD from a LuaLaTeX pass (similar to the MathML pipeline), see:

  • docs/lua_table_pass.md

Batch processing (arXiv source tarballs)

Put *.tar.gz into data/download/, then run:

# Recommended: one-shot gold pipeline
# - Stage A (gold): pdflatex -> PDF + *.lpsb.json
# - Stage B (enrich): lualatex -> *.lpsb-math.json (MathML) + *.lpsb-table.json
# - Merge: *.lpsb.merged.json
bash script/batch_compile_all.sh

Outputs:

  • compile_results/<paper>/: per-paper outputs (gold + enrich + logs)
    • _pdflatex/: gold PDF + *.lpsb.json
    • _lualatex/: *.lpsb-math.json (+ MathML when available) and *.lpsb-table.json
    • *.lpsb.merged.json: merged output (structure enriched with mathml and table events)

TeX Live selection:

  • New arXiv sources: read 00README.json / texlive_version.
  • Old arXiv sources (no 00README.json): if a biblatex .bbl exists, infer TeX Live from the header (% $ biblatex bbl format version X.Y $) and pick a matching historic image (heuristic).
  • The script prints: TeX Live selected: <year> (image: <docker-tag>).

Architecture

LPSB uses a 3-stage pipeline:

  1. Instrumentation (LaTeX): lpsb.sty hooks into LaTeX commands/environments and emits events + coordinates.
  2. Compilation (Docker): scripts compile papers in a container.
  3. Reconstruction (Python): solver.py parses the log, validates nesting, and builds a tree.

Optional:

  1. Math enrichment (LuaLaTeX): lpsb-luamath.sty + lpsb-math.lua captures math and emits MathML using luamml. merge_lpsb.py merges by context-aware IDs.
┌─────────────────────────────────────────────────────────────────────┐
│                         LaTeX source (.tex)                          │
└─────────────────────────────────────────────────────────────────────┘
                                 │
                                 ▼ (package injected / \usepackage{lpsb})
┌─────────────────────────────────────────────────────────────────────┐
│                          Structure pass (PDFLaTeX)                   │
│  lpsb.sty                                                             │
│  - hooks environments/commands                                        │
│  - emits start/end events + page/coords                               │
└─────────────────────────────────────────────────────────────────────┘
                │                                   │
                ▼                                   ▼
      ┌──────────────────────┐            ┌──────────────────────────┐
      │  PDF output (*.pdf)  │            │  event log (*.lpsb.json) │
      └──────────────────────┘            └──────────────────────────┘
                                                   │
                                                   ▼
                                         ┌──────────────────────────┐
                                         │ solver.py                │
                                         │ - parse/validate events  │
                                         │ - build structure tree   │
                                         └──────────────────────────┘
                                                   │
                                                   ▼
                                         ┌──────────────────────────┐
                                         │ *.structure.json (tree)  │
                                         └──────────────────────────┘

Optional MathML track (LuaLaTeX):

┌─────────────────────────────────────────────────────────────────────┐
│                          Math pass (LuaLaTeX)                        │
│  lpsb-luamath.sty + lpsb-math.lua                                    │
│  - captures inline + display math (incl. $...$)                      │
│  - emits Math events + MathML via luamml                             │
└─────────────────────────────────────────────────────────────────────┘
                                 │
                                 ▼
                      ┌──────────────────────────────┐
                      │ *.lpsb-math.json (Math/MathML)│
                      └──────────────────────────────┘
                                 │
                                 ▼
                      ┌──────────────────────────────┐
                      │ merge_lpsb.py / hybrid_merge  │
                      │ -> merged JSON (mathml fields)│
                      └──────────────────────────────┘

Current Capabilities & Status

Category Feature Status Notes
Structure Sections (Sect) Maps \section, \chapter etc.
Headings (H1-H6) Captures titles and hierarchy level
Blocks Lists (L, LI) itemize, enumerate, description
List Labels (Lbl) Captures 1., a), etc.
Paragraphs (P) Uses kernel paragraph hooks (para/begin, para/end)
Tables Table Container ✅ (floats) / ⚠️ (longtable) Float table/table* are instrumented. longtable hooks are disabled due to visual artifacts.
Rows (TR) ⚠️ Best-effort; longtable row semantics are not reliable via TeX hooks.
Cells (TD) ⚠️ (Lua) / ✅ (PDF-side) Lua TD capture is best-effort; recommended approach is PDF-side extraction (extract_cells.py).
Colspan/Rowspan ✅ (PDF-side) / ⚠️ (Lua) PDF-side extractor supports spans; Lua pass may infer colspan but is not authoritative.
Cell Bbox ✅ (Lua) / ✅ (PDF-side) Lua bbox is approximate; PDF-side tends to be more stable across engines.
Math Display Formulas equation, align, gather
Inline Formulas ✅ (LuaLaTeX) $...$ captured by lpsb-luamath
MathML ✅ (LuaLaTeX) Requires luamml runtime. Provided by lpsb-texlive:latest, and historic images lpsb-texlive:TL2022-historic / lpsb-texlive:TL2023-historic.
Refs Citations \cite links to bibliography
References \ref, \label linkages
Accessibility Captions Figures and Tables
Alt Text Not implemented

Known Limitations

  1. PDF is gold = PDFLaTeX: the pipeline treats the PDFLaTeX output as the visual “gold” PDF. The LuaLaTeX pass is for enrichment only.
  2. Inline math in PDFLaTeX: $...$ is not hooked in PDFLaTeX. Use the LuaLaTeX math pass (lpsb-luamath) for inline math + MathML.
  3. MathML depends on luamml: if your container image does not ship luamml, mathml will be empty. Use lpsb-texlive:latest or the provided historic images with vendored luamml.
  4. Table extraction is still messy:
    • Hooking tabular internals in TeX is fragile and can create visual artifacts.
    • longtable is especially fragile; hooks are disabled to preserve rendering.
    • Recommended: PDF-side cell extraction (extract_cells.py) for reliable TD/row/col/span.
  5. Weird macros win: heavily customized macros/classes/packages can bypass hooks or reorder output in ways that break event nesting.

Repo Layout / Tools

  • Core
    • lpsb.sty: structure/coordinate event emitter
    • solver.py: fault-tolerant loader + tree builder
  • Math (LuaLaTeX)
    • lpsb-luamath.sty, lpsb-math.lua: capture math + emit MathML (via luamml)
  • Merge
    • script/merge_lpsb.py: merge structure + math/table by IDs
    • hybrid_merge.sh: batch merge test_output/ + test_output_lua/
  • Scripts
    • script/test_batch.sh: PDFLaTeX batch pass
    • script/test_lua_batch.sh: LuaLaTeX batch pass (uses lpsb-texlive:latest if present)
    • script/test_lua_math.sh: small smoke test
  • Docker
    • docker/Dockerfile.latest: recommended image (lpsb-texlive:latest) with luamml

Future Work

  • PDF content stream parsing for semantic marker extraction (BDC/EMC)
  • Recover full affine transforms (better geometry for nested boxes)
  • Integrate with PDF toolkits (pdfplumber / PyMuPDF) for richer alignment/debugging
  • Add a small GUI/visualizer to inspect source↔PDF alignment

License

MIT License

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published