LPSB extracts semantic structure from LaTeX compilations and aligns it with PDF page coordinates. It produces a single event log (*.lpsb.json) that can be rebuilt into a tree (*.structure.json) and can be enriched with MathML from a LuaLaTeX pass.
This repository is LaTeX Rainbow 2.0: a complete rewrite / restructuring of the original codebase. It is under active development and not stable.
If you need a version that works today, use the v1 branch (the previous implementation).
- PDF structure events: emits start/end events for common PDF/UA-ish roles.
- Structure:
Document,Sect,Div - Headings:
H1-H6 - Lists:
L,LI,Lbl - Tables:
Table,TR(rows) - Figures:
Figure,Caption - Inline:
Strong,Em,Link,Reference
- Structure:
- Math capture
- PDFLaTeX: captures display math environments as
Formula(e.g.equation,align). - LuaLaTeX extension (
lpsb-luamath): captures all math including$...$and can emit MathML vialuamml.
- PDFLaTeX: captures display math environments as
- Robust logs:
solver.pyincludes fault-tolerant parsing for “dirty” JSON emitted by TeX. - Batchable via Docker: scripts run against arXiv source tarballs, producing reproducible output directories.
The scripts prefer a local image tag lpsb-texlive:latest if available.
docker build -f docker/Dockerfile.latest -t lpsb-texlive:latest dockerSome arXiv sources pin TeX Live (via 00README.json / texlive_version). In particular, biblatex .bbl files can be version-strict.
If you want reproducible builds for TL2023 papers, build this tag too:
docker build -f docker/Dockerfile.tl2023 -t lpsb-texlive:TL2023-historic dockerNote:
docker/Dockerfile.tl2023also vendorsluammlruntime into the historic image (underTEXMFLOCAL) so the LuaLaTeX pass can emit MathML even on frozen TeX Live releases.
Older arXiv sources (pre-00README.json) may ship a pre-generated biblatex .bbl that is format-version strict.
For those, compiling with a newer TeX Live can fail (or silently diverge).
This repo provides a TL2022 historic image that also vendors luamml runtime (so the LuaLaTeX pass can still emit MathML):
docker build -f docker/Dockerfile.tl2022 -t lpsb-texlive:TL2022-historic dockerCopy lpsb.sty into the same directory as your main.tex and add:
\usepackage{lpsb}Then compile:
docker run --rm -v "$(pwd)":/workdir -w /workdir lpsb-texlive:latest \
pdflatex -interaction=nonstopmode main.texOutput: main.lpsb.json (Structure only)
To extract MathML and inline formulas ($...$), run a second pass with LuaLaTeX:
- Pass 1 (Structure): Run PDFLaTeX as above.
- Pass 2 (Math): Add
\usepackage{lpsb-luamath}to your tex file (or inject it), then run:
docker run --rm -v "$(pwd)":/workdir -w /workdir lpsb-texlive:latest \
lualatex -interaction=nonstopmode main.texOutput: main.lpsb-math.json (Math events with MathML)
Note: The official batch scripts (
script/test_lua_batch.sh) handle package injection automatically.
Use the Python solver to reconstruct the DOM tree:
python3 solver.py main.lpsb.json --output main.structure.json --validateBy default, LPSB does not hook tabular internals (to avoid TeX alignment/rule artifacts). If you need TR/TD, extract them from the compiled PDF using pdfplumber:
python3 solver.py main.lpsb.json --pdf main.pdf --extract-cells --output main.structure.jsonDetails: see docs/table_cells.md.
If you want TR/TD from a LuaLaTeX pass (similar to the MathML pipeline), see:
docs/lua_table_pass.md
Put *.tar.gz into data/download/, then run:
# Recommended: one-shot gold pipeline
# - Stage A (gold): pdflatex -> PDF + *.lpsb.json
# - Stage B (enrich): lualatex -> *.lpsb-math.json (MathML) + *.lpsb-table.json
# - Merge: *.lpsb.merged.json
bash script/batch_compile_all.shOutputs:
compile_results/<paper>/: per-paper outputs (gold + enrich + logs)_pdflatex/: gold PDF +*.lpsb.json_lualatex/:*.lpsb-math.json(+ MathML when available) and*.lpsb-table.json*.lpsb.merged.json: merged output (structure enriched withmathmland table events)
TeX Live selection:
- New arXiv sources: read
00README.json/texlive_version.- Old arXiv sources (no
00README.json): if abiblatex.bblexists, infer TeX Live from the header (% $ biblatex bbl format version X.Y $) and pick a matching historic image (heuristic).- The script prints:
TeX Live selected: <year> (image: <docker-tag>).
LPSB uses a 3-stage pipeline:
- Instrumentation (LaTeX):
lpsb.styhooks into LaTeX commands/environments and emits events + coordinates. - Compilation (Docker): scripts compile papers in a container.
- Reconstruction (Python):
solver.pyparses the log, validates nesting, and builds a tree.
Optional:
- Math enrichment (LuaLaTeX):
lpsb-luamath.sty+lpsb-math.luacaptures math and emits MathML usingluamml.merge_lpsb.pymerges by context-aware IDs.
┌─────────────────────────────────────────────────────────────────────┐
│ LaTeX source (.tex) │
└─────────────────────────────────────────────────────────────────────┘
│
▼ (package injected / \usepackage{lpsb})
┌─────────────────────────────────────────────────────────────────────┐
│ Structure pass (PDFLaTeX) │
│ lpsb.sty │
│ - hooks environments/commands │
│ - emits start/end events + page/coords │
└─────────────────────────────────────────────────────────────────────┘
│ │
▼ ▼
┌──────────────────────┐ ┌──────────────────────────┐
│ PDF output (*.pdf) │ │ event log (*.lpsb.json) │
└──────────────────────┘ └──────────────────────────┘
│
▼
┌──────────────────────────┐
│ solver.py │
│ - parse/validate events │
│ - build structure tree │
└──────────────────────────┘
│
▼
┌──────────────────────────┐
│ *.structure.json (tree) │
└──────────────────────────┘
Optional MathML track (LuaLaTeX):
┌─────────────────────────────────────────────────────────────────────┐
│ Math pass (LuaLaTeX) │
│ lpsb-luamath.sty + lpsb-math.lua │
│ - captures inline + display math (incl. $...$) │
│ - emits Math events + MathML via luamml │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────┐
│ *.lpsb-math.json (Math/MathML)│
└──────────────────────────────┘
│
▼
┌──────────────────────────────┐
│ merge_lpsb.py / hybrid_merge │
│ -> merged JSON (mathml fields)│
└──────────────────────────────┘
| Category | Feature | Status | Notes |
|---|---|---|---|
| Structure | Sections (Sect) | ✅ | Maps \section, \chapter etc. |
| Headings (H1-H6) | ✅ | Captures titles and hierarchy level | |
| Blocks | Lists (L, LI) | ✅ | itemize, enumerate, description |
| List Labels (Lbl) | ✅ | Captures 1., a), • etc. |
|
| Paragraphs (P) | ✅ | Uses kernel paragraph hooks (para/begin, para/end) |
|
| Tables | Table Container | ✅ (floats) / |
Float table/table* are instrumented. longtable hooks are disabled due to visual artifacts. |
| Rows (TR) | Best-effort; longtable row semantics are not reliable via TeX hooks. | ||
| Cells (TD) | Lua TD capture is best-effort; recommended approach is PDF-side extraction (extract_cells.py). |
||
| Colspan/Rowspan | ✅ (PDF-side) / |
PDF-side extractor supports spans; Lua pass may infer colspan but is not authoritative. | |
| Cell Bbox | ✅ (Lua) / ✅ (PDF-side) | Lua bbox is approximate; PDF-side tends to be more stable across engines. | |
| Math | Display Formulas | ✅ | equation, align, gather |
| Inline Formulas | ✅ (LuaLaTeX) | $...$ captured by lpsb-luamath |
|
| MathML | ✅ (LuaLaTeX) | Requires luamml runtime. Provided by lpsb-texlive:latest, and historic images lpsb-texlive:TL2022-historic / lpsb-texlive:TL2023-historic. |
|
| Refs | Citations | ✅ | \cite links to bibliography |
| References | ✅ | \ref, \label linkages |
|
| Accessibility | Captions | ✅ | Figures and Tables |
| Alt Text | ❌ | Not implemented |
- PDF is gold = PDFLaTeX: the pipeline treats the PDFLaTeX output as the visual “gold” PDF. The LuaLaTeX pass is for enrichment only.
- Inline math in PDFLaTeX:
$...$is not hooked in PDFLaTeX. Use the LuaLaTeX math pass (lpsb-luamath) for inline math + MathML. - MathML depends on
luamml: if your container image does not shipluamml,mathmlwill be empty. Uselpsb-texlive:latestor the provided historic images with vendoredluamml. - Table extraction is still messy:
- Hooking
tabularinternals in TeX is fragile and can create visual artifacts. longtableis especially fragile; hooks are disabled to preserve rendering.- Recommended: PDF-side cell extraction (
extract_cells.py) for reliable TD/row/col/span.
- Hooking
- Weird macros win: heavily customized macros/classes/packages can bypass hooks or reorder output in ways that break event nesting.
- Core
lpsb.sty: structure/coordinate event emittersolver.py: fault-tolerant loader + tree builder
- Math (LuaLaTeX)
lpsb-luamath.sty,lpsb-math.lua: capture math + emit MathML (vialuamml)
- Merge
script/merge_lpsb.py: merge structure + math/table by IDshybrid_merge.sh: batch mergetest_output/+test_output_lua/
- Scripts
script/test_batch.sh: PDFLaTeX batch passscript/test_lua_batch.sh: LuaLaTeX batch pass (useslpsb-texlive:latestif present)script/test_lua_math.sh: small smoke test
- Docker
docker/Dockerfile.latest: recommended image (lpsb-texlive:latest) withluamml
- PDF content stream parsing for semantic marker extraction (BDC/EMC)
- Recover full affine transforms (better geometry for nested boxes)
- Integrate with PDF toolkits (pdfplumber / PyMuPDF) for richer alignment/debugging
- Add a small GUI/visualizer to inspect source↔PDF alignment
MIT License