HiPS: Hierarchical PDF Segmentation of Doctrinal Legal Books

arXiv CS Monday 08 June 2026, 04:00 UTC By Sabine Wehnert, Harikrishnan Changaramkulath, Ivan Habernal 1 min read

Key Points

Announce Type: replace Abstract: PDF parsers have recently improved on page-level layout understanding. However, recovering a document-global section hierarchy with reliable boundaries remains brittle for deeply structured books: many systems expose only page-local heading roles, assume shallow depth, or rely on high-quality PDF tags or Table of Contents (TOC) metadata, and public gold-standard data for deep book hierarchies is scarce. We present HiPS for hierarchical PDF segmentation of...

arXiv:2509.00909v2 Announce Type: replace Abstract: PDF parsers have recently improved on page-level layout understanding. However, recovering a document-global section hierarchy with reliable boundaries remains brittle for deeply structured books: many systems expose only page-local heading roles, assume shallow depth, or rely on high-quality PDF tags or Table of Contents (TOC) metadata, and public gold-standard data for deep book hierarchies is scarce. We present HiPS for hierarchical PDF segmentation of doctrinal legal books and make two main contributions. First, we release a gold-standard benchmark of 49 open-access law books with 9,812 manually curated headings, hierarchy levels, and page anchors, enabling evaluation of title detection, hierarchy reconstruction, and section boundary assignment. Second, we introduce complementary segmentation pipelines: a TOC-based parser for books with reliable outline metadata and a TOC-free LLM-refined pipeline that combines OCR whitespace cues, XML typography, and local context. Across a broad comparison against open-source parsers and multimodal/LLM baselines, the TOC-based pipeline is strongest when metadata is complete, while the LLM-refined pipeline improves heading precision, deep-level recovery, and boundary quality when metadata is missing or noisy.

TOC (ORG) LLM (ORG) OCR (ORG)

Originally published by arXiv CS Read original →

HiPS: Hierarchical PDF Segmentation of Doctrinal Legal Books

Related Stories

Apollo Wraps Up $35B Chip Deal for Anthropic

CoreWeave’s Credit Rebound Drives Cheaper Data Center Funding

AI windfall for the public? Trump signals shake-up for tech giants

Microsoft limits employee use of Anthropic's Claude Fable 5 over data retention concerns, The Verge reports