Counting Distinct (Non-)Crossing Substrings in Optimal Time

arXiv CS Friday 05 June 2026, 04:00 UTC By Haruki Umezaki, Hiroki Shibata, Dominik K\"oppl, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai 2 min read

Key Points

Announce Type: replace Abstract: Let $w$ be a string of length $n$. The problem of counting factors crossing a position -- Problem 64 from the textbook ``125 Problems in Text Algorithms'' [Crochemore, Lecroq, and Rytter, 2021] -- asks to count the number $\mathcal{C}(w,k)$ (resp. $\mathcal{N}(w,k)$) of distinct substrings in $w$ that have occurrences containing (resp.

arXiv:2506.22728v2 Announce Type: replace Abstract: Let $w$ be a string of length $n$. The problem of counting factors crossing a position -- Problem 64 from the textbook ``125 Problems in Text Algorithms'' [Crochemore, Lecroq, and Rytter, 2021] -- asks to count the number $\mathcal{C}(w,k)$ (resp. $\mathcal{N}(w,k)$) of distinct substrings in $w$ that have occurrences containing (resp. not containing) a position $k$ in $w$. The solutions provided in their textbook compute $\mathcal{C}(w,k)$ and $\mathcal{N}(w,k)$ in $O(n)$ time for a single position $k$ in $w$, and thus a direct application would require $O(n^2)$ time for all positions $k = 1, \ldots, n$ in $w$. Their solution is designed for constant-size alphabets. In this paper, we present new algorithms which compute $\mathcal{C}(w,k)$ in $O(n)$ total time for general ordered alphabets, and $\mathcal{N}(w,k)$ in $O(n)$ total time for linearly sortable alphabets,for all positions $k = 1, \ldots, n$ in $w$. We further derive model-dependent optimal bounds by separating the algorithms into preprocessing and linear-time postprocessing: for $\mathcal{C}$ the preprocessing is run reporting, and for $\mathcal{N}$ it is preprocessing based on longest previous non-overlapping factors (LPnF) and longest next factors (LNF). In particular, all values $\mathcal{C}(w,k)$ can be computed in $O(n\log n)$ time over general unordered alphabets in which direct accesses to alphabet characters are restricted to equality tests, and in $O(n\log\sigma)$ time in the word RAM model, where $\sigma$ denotes the number of distinct characters occurring in $w$. For $\mathcal{N}(w,k)$, the equality-testing complexity over general unordered alphabets is $\Theta(n^2)$. We also show that our upper bounds are optimal for all of the aforementioned alphabet assumptions and computation models.

Crochemore (ORG) Lecroq (LOCATION) Rytter (LOCATION) k)$ (LOCATION) \mathcal{C}(w (LOCATION) LNF (ORG) n)$ (ORG) RAM (ORG)

Originally published by arXiv CS Read original →

Counting Distinct (Non-)Crossing Substrings in Optimal Time

Related Stories

Worker bees build a 'royal palace' for the honeybee queen

Starlink rival Qianfan hits satellite milestone, but is it too slow and costly?

Insta360's Luna Ultra takes on DJI's Osmo Pocket gimbal cameras

CIBC to Offer SpaceX Access Through Canadian Depositary Receipt