far: a (f)ast re-write of the Unix utility p(ar)

2020-12-16 · # cs

[Video] [Source code] [Python] [C++] [Emacs Lisp by esrh]

par is a formatting tool that inserts line breaks to make the length of each line less than a set number of characters, usually 79 (terminals historically have 80 width). Unfortunately, par is incredibly complicated and introduces random whitespace. So I made my own.

For far to make the paragraphs look good, it minimizes the variance of each line. However, it's constrained to use the fewest number of lines possible, so it doesn't generate really short lines. Finally, it ignores the last line when minimizing variance and tries to make the last line shorter than average, because a typical paragraph usually looks better if the last line is shorter. To summarize,

Minimize the variance of the lengths of each line...
...subject to the constraint that the number of lines is smallest
Ignore the last line, while making sure it's shorter than average

far uses dynamic programming to minimize variance. It tokenizes the paragraph by splitting on whitespace, and each subproblem in the dynamic program is a suffix of this token list. If $X$ is a vector that represents the length of each line in a given paragraph with $\providecommand{\cardhelper}{ \@ifstar{\cardstar}{\cardnostar} } \providecommand{\cardnostar}[1]{\lvert #1 \rvert} \providecommand{\cardstar}[1]{\left \lvert #1 \right \rvert} \cardhelper{X} = n$ lines, then the variance of line lengths is defined as

\begin{aligned} \providecommand{\Varhelper}{ \@ifstar{\Varstar}{\Varnostar} } \providecommand{\Varnostar}[1]{\mathbb{V}\text{ar}[#1]} \providecommand{\Varstar}[1]{\mathbb{V}\text{ar}\left[#1\right]} \Varhelper{X} &= \providecommand{\Ehelper}{ \@ifstar{\Estar}{\Enostar} } \providecommand{\Enostar}[1]{\mathbb{E}[#1]} \providecommand{\Estar}[1]{\mathbb{E}\left[#1\right]} \Ehelper{(X - \providecommand{\Ehelper}{ \@ifstar{\Estar}{\Enostar} } \providecommand{\Enostar}[1]{\mathbb{E}[#1]} \providecommand{\Estar}[1]{\mathbb{E}\left[#1\right]} \Ehelper{X})^2} = \frac{1}{n} \sum_{x \in X} \left ( x - \frac{1}{n} \sum_{x \in X} x \right )^2 \end{aligned}

The fewest number of lines constraint means $n$ is constant and so is the sum of line lengths $\sum_{x \in X} x$ because it is determined by two things: the characters in the tokens and the number of spaces introduced by merging two tokens (combining the words "hello" and "world" onto the same line gives "hello world", with an additional space). The characters stay the same, and the number of spaces is fixed if the number of lines is fixed. Each token starts off as its own line, and each merge reduces the number of lines by 1, so if two solutions have the same number of lines, they must have done the same number of merges.

Thus, minimizing $\providecommand{\Varhelper}{ \@ifstar{\Varstar}{\Varnostar} } \providecommand{\Varnostar}[1]{\mathbb{V}\text{ar}[#1]} \providecommand{\Varstar}[1]{\mathbb{V}\text{ar}\left[#1\right]} \Varhelper{X}$ is equivalent to minimizing the sum of squares $\sum_{x \in X} x^2$ if the number of lines is fixed. Recall that we are trying to minimize variance over the entire paragraph. The overall paragraph has some mean line length $\mu \coloneqq \providecommand{\Ehelper}{ \@ifstar{\Estar}{\Enostar} } \providecommand{\Enostar}[1]{\mathbb{E}[#1]} \providecommand{\Estar}[1]{\mathbb{E}\left[#1\right]} \Ehelper{X}$ . Each line will contribute $(x - \mu)^2$ to the overall paragraph's variance. So we want to minimize

\begin{aligned} (x_1 - \mu)^2 + (x_2 - \mu)^2 + \dotsb + (x_n - \mu)^2 \end{aligned}

where $x_i$ is the length of the $i$ -th line. Expanding,

\begin{aligned} [x_1^2 - 2 x_1 \mu + \mu^2] + \dotsb + [x_n^2 - 2 x_n \mu + \mu^2] \end{aligned}

We can ignore the $\mu^2$ terms since $\mu = \frac{1}{n} \sum_{x \in X} x$ is constant from the aforementioned logic (both $n$ and the total number of characters is constant) and reorganize into

\begin{aligned} [x_1^2 + \dotsb + x_n^2] - 2 \mu (x_1 + \dotsb + x_n) \end{aligned}

The last term is a constant, so minimizing the variance of the overall paragraph is equivalent to minimizing the variance for a suffix of the paragraph (both are minimizing the sum of squares). This is just the variance of the subproblem, so the dynamic programming is valid since optimal substructure holds. In practice, I skip calculating variance entirely and simply compute the sum of squares. I also do dynamic programming on the variance of each prefix instead of each suffix so it's easier to ignore the contribution of the last line.

Note that the reduction could have been quickly observed by using the variance identity

\begin{aligned} \providecommand{\Varhelper}{ \@ifstar{\Varstar}{\Varnostar} } \providecommand{\Varnostar}[1]{\mathbb{V}\text{ar}[#1]} \providecommand{\Varstar}[1]{\mathbb{V}\text{ar}\left[#1\right]} \Varhelper{X} &= \providecommand{\Ehelper}{ \@ifstar{\Estar}{\Enostar} } \providecommand{\Enostar}[1]{\mathbb{E}[#1]} \providecommand{\Estar}[1]{\mathbb{E}\left[#1\right]} \Ehelper{X^2} - \providecommand{\Ehelper}{ \@ifstar{\Estar}{\Enostar} } \providecommand{\Enostar}[1]{\mathbb{E}[#1]} \providecommand{\Estar}[1]{\mathbb{E}\left[#1\right]} \Ehelper{X}^2 \end{aligned}

along with the fact that $\providecommand{\Ehelper}{ \@ifstar{\Estar}{\Enostar} } \providecommand{\Enostar}[1]{\mathbb{E}[#1]} \providecommand{\Estar}[1]{\mathbb{E}\left[#1\right]} \Ehelper{X}$ is constant. Indeed, we essentially repeated this derivation.

That's it! The algorithm runs in $\mathcal{O}(NK)$ where $N$ is the number of characters in the input text and $K$ is the desired width. Since $K$ is usually fixed to some small constant (79, 72, etc.), this is essentially linear in $n$ and I suspect most of the running time is bottlenecked by just I/O (reading the input text and printing out the formatted text). Running with a width of 79 on a 1 MB file with over 20,000 lines takes under 200 milliseconds. For 100 MB, fmt takes around 11.9 seconds, par takes 15.7, and far takes 16.6. So far is slightly slower than the others, but certainly not enough to be noticeable for "reasonable" inputs, especially if output is redirected into a file rather than displayed to terminal.

Examples

original paragraph:

xxxxx xxx xxx xxxx xxxxxxxxx xx x xxxxxxxxx x xxxx xxxx xxxxxxx xxxxxxxx xxx
xxxxxxxxx xxxxxxxx xx xx xxxxx xxxxx xxxx xx x xxxx xx xxxxxxxx xxxxxxxx xxxx
xxx xxxx xxxx xxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxx xxx xxxxx xx xxxx x xxxx
xxxxxxxx xxxx xxxx xx xxxxx xxxx xxxxx xxxx xxxxxxxxx xxx xxxxxxxxxxx xxxxxx
xxx xxxxxxxxx xxxx xxxx xx x xx xxxx xxx xxxx xx xxx xxx xxxxxxxxxxx xxxx xxxxx
x xxxxx xxxxxxx xxxxxxx xx xx xxxxxx xx xxxxx

fmt -w 72 (greedy algorithm):

xxxxx xxx xxx xxxx xxxxxxxxx xx x xxxxxxxxx x xxxx xxxx xxxxxxx xxxxxxxx
xxx xxxxxxxxx xxxxxxxx xx xx xxxxx xxxxx xxxx xx x xxxx xx xxxxxxxx
xxxxxxxx xxxx xxx xxxx xxxx xxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxx xxx
xxxxx xx xxxx x xxxx xxxxxxxx xxxx xxxx xx xxxxx xxxx xxxxx xxxx
xxxxxxxxx xxx xxxxxxxxxxx xxxxxx xxx xxxxxxxxx xxxx xxxx xx x xx xxxx
xxx xxxx xx xxx xxx xxxxxxxxxxx xxxx xxxxx x xxxxx xxxxxxx xxxxxxx xx xx
xxxxxx xx xxxxx

Looking at the output of the greedy algorithm, because it always forms a line if it fits, it creates highly variable line lengths. For example, there are many "valleys" where a line is shorter than the lines adjacent to it, like lines 2 and lines 4, giving the paragraph a jagged appearance overall.

par 72 (with PARINIT set to rTbgqR B=.,?'_A_a_@ Q=_s>|):

xxxxx xxx xxx xxxx xxxxxxxxx xx x xxxxxxxxx x xxxx xxxx xxxxxxx xxxxxxxx
xxx xxxxxxxxx xxxxxxxx xx xx xxxxx xxxxx xxxx xx x xxxx xx xxxxxxxx
xxxxxxxx xxxx xxx xxxx xxxx xxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxx
xxx xxxxx xx xxxx x xxxx xxxxxxxx xxxx xxxx xx xxxxx xxxx xxxxx xxxx
xxxxxxxxx xxx xxxxxxxxxxx xxxxxx xxx xxxxxxxxx xxxx xxxx xx x xx xxxx
xxx xxxx xx xxx xxx xxxxxxxxxxx xxxx xxxxx x xxxxx xxxxxxx xxxxxxx xx xx
xxxxxx xx xxxxx

par improves on fmt, but still creates a single large valley.

far 72:

xxxxx xxx xxx xxxx xxxxxxxxx xx x xxxxxxxxx x xxxx xxxx xxxxxxx
xxxxxxxx xxx xxxxxxxxx xxxxxxxx xx xx xxxxx xxxxx xxxx xx x xxxx
xx xxxxxxxx xxxxxxxx xxxx xxx xxxx xxxx xxx xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxx xxx xxxxx xx xxxx x xxxx xxxxxxxx xxxx xxxx xx xxxxx
xxxx xxxxx xxxx xxxxxxxxx xxx xxxxxxxxxxx xxxxxx xxx xxxxxxxxx
xxxx xxxx xx x xx xxxx xxx xxxx xx xxx xxx xxxxxxxxxxx xxxx xxxxx
x xxxxx xxxxxxx xxxxxxx xx xx xxxxxx xx xxxxx

Finally, I would argue that far creates the most aesthetically pleasing paragraph because it minimizes the variance, creating the smoothest paragraph edge.

It's probably possible to modify PARINIT for par to work properly on this example, and in general par works quite well, but it's hard to work through the documentation to find precisely what to do and the recommended PARINIT in the man page should work well. In contrast far works well "out of the box" and for better or for worse, only has a single configuration option –- the desired line width.

Uses

This program is useful whenever writing plaintext in a monospace text editor, e.g. when editing LaTeX, markdown files, college essays, and emails. It's especially useful in vim, which lets you set the option 'formatprg' so the operator gq formats using the external program.

Appendix

The paragraph reformatting problem can be phrased in terms of three different $\ell^p$ -norms.

If $X$ is a vector that represents the length of each line for a given paragraph, we want the every line in the paragraph to be shorter than $K$ characters, so $\max(X) = \providecommand{\normhelper}{ \@ifstar{\normstar}{\normnostar} } \providecommand{\normnostar}[1]{\lVert #1 \rVert} \providecommand{\normstar}[1]{\left \lVert #1 \right \rVert} \normhelper{X}_\infty \leq K$ . We want to only reformat the paragraph in such a way that the number of lines is the shortest possible over all paragraphs $P$ that obey the length constraint, that is,

\begin{aligned} \providecommand{\normhelper}{ \@ifstar{\normstar}{\normnostar} } \providecommand{\normnostar}[1]{\lVert #1 \rVert} \providecommand{\normstar}[1]{\left \lVert #1 \right \rVert} \normhelper{X}_0 = \min_{P: \, \providecommand{\normhelper}{ \@ifstar{\normstar}{\normnostar} } \providecommand{\normnostar}[1]{\lVert #1 \rVert} \providecommand{\normstar}[1]{\left \lVert #1 \right \rVert} \normhelper{P}_\infty \leq K} \; \providecommand{\normhelper}{ \@ifstar{\normstar}{\normnostar} } \providecommand{\normnostar}[1]{\lVert #1 \rVert} \providecommand{\normstar}[1]{\left \lVert #1 \right \rVert} \normhelper{P}_0 \end{aligned}

where we abuse the $\ell^0$ -"norm". Finally, the goal of minimizing variance is equivalent to minimizing the $\ell^2$ norm of $X$ since the $\ell^2$ norm is the sum of squares. In summary, we have the optimization problem taken over $\mathcal{P} \coloneqq \{ \text{valid paragraphs $ X $ such that $ \providecommand{\normhelper}{ \@ifstar{\normstar}{\normnostar} } \providecommand{\normnostar}[1]{\lVert #1 \rVert} \providecommand{\normstar}[1]{\left \lVert #1 \right \rVert} \normhelper{X}_\infty \leq K $} \}$ of

\begin{aligned} \min_{\substack{X \in \mathcal{P} \\ \providecommand{\normhelper}{ \@ifstar{\normstar}{\normnostar} } \providecommand{\normnostar}[1]{\lVert #1 \rVert} \providecommand{\normstar}[1]{\left \lVert #1 \right \rVert} \normhelper{X}_0 = n}} \; \providecommand{\normhelper}{ \@ifstar{\normstar}{\normnostar} } \providecommand{\normnostar}[1]{\lVert #1 \rVert} \providecommand{\normstar}[1]{\left \lVert #1 \right \rVert} \normhelper{X}_2 \\ n = \min_{P \in \mathcal{P}} \; \providecommand{\normhelper}{ \@ifstar{\normstar}{\normnostar} } \providecommand{\normnostar}[1]{\lVert #1 \rVert} \providecommand{\normstar}[1]{\left \lVert #1 \right \rVert} \normhelper{P}_0 \end{aligned}

where we have written the problem entirely in terms of $\ell^p$ -norms. This perspective doesn't really help when coming up with an efficient algorithm, but it's kind of cute.

さみしいも、たのしい。Page source. Last updated: 2024-01-21.