Format commit messages with commitmsgfmt

Aug 19, 2018

I made a tool to format commit messages better than fmt(1), Vim, and par(1). It’s called commitmsgfmt and in this post I will present it and analyse its performance.

Introducing commitmsgfmt
Performance analysis

Introducing commitmsgfmt

commitmsgfmt is a filter for formatting text, like fmt(1), but specialized for commit messages: its primary purpose is to wrap text and its secondary purpose is to know when not to do so. It is neither a linter¹ for commit messages nor a general-purpose text formatter.

commitmsgfmt was written first and foremost for use with Vim’s 'formatprg' option, which names an executable that takes input to format on standard input and produces the formatted result on standard output. commitmsgfmt’s default behaviour reflects the default Vim runtime settings for Git commit messages²; most importantly this implies a text width of 72 columns.

If you don’t use Vim or if you don’t want text wrapped at 72 characters you can probably still use commitmsgfmt; see Installation and usage for your options.

When to use it

commitmsgfmt was written first and foremost for use with Vim’s 'formatprg' option but works in other environments, too.

If you have ever had any trouble formatting commit messages and your preferred text editor can delegate text formatting to a third-party, perhaps commitmsgfmt can alleviate that pain.
If you have had trouble formatting commit messages but your text editor cannot delegate text formatting you can still use the alternative Git hook integration.
If you have never had trouble formatting commit messages it is probably because you either write very simple messages or never spend effort formatting them. commitmsgfmt can help in both cases but yields highest return on investment for structured messages with complex constructs. Unstructured messages, such as one very long line, will be wrapped but won’t magically read well, and structured messages without complex content won’t benefit much over using fmt.
If you think all of this is nonsense and that the solution to text overflowing terminals in the 21^st century is to abandon terminals, commitmsgfmt is very much for you.³ Or rather, it is for everyone you collaborate with. Even despite Git 2.1.0’s change to pager wrapping.

When not to use it

Despite all that, commitmsgfmt may not fit your needs.

If neither your VCS nor your text editor of choice provide any means of integration, commitmsgfmt cannot help you, period.⁴
commitmsgfmt does not care whether your commit is a merge or not and will gladly attempt to format either, but merge commit messages occasionally use non-standard constructs commitmsgfmt does not cope well with. If this does not apply to you, you don’t have a problem. If it does apply to you and you are not willing to leave commitmsgfmt disabled for merge commits the tool will cause friction. The Linux kernel and Git projects are affected by this limitation.
If you need a general-purpose text formatter stay with fmt or try to work out par. commitmsgfmt should mostly work fine but makes assumptions that don’t apply in other contexts.
commitmsgfmt uses the greedy minimum number of lines algorithm to wrap text, like fmt and Vim’s internal format function. The advantage of this is that it is extremely easy to write and the result is intuitive. The disadvantage is that it can produce text with a very ragged right-hand side. par uses a dynamic programming algorithm and is far better at giving the appearance of text justification, typically yielding a prettier result. It is not a design goal of commitmsgfmt to do one or the other but it was a design choice to mimic Vim and fmt for comparison purposes.
If you write in non-Latin script I have no idea how well commitmsgfmt will work for you. It is Unicode-aware but only to count “characters” intuitively and not break between code points.
commitmsgfmt is a third-party dependency, which introduces some practical problems even with ready-built binaries. It may just be easier to deal with not having commitmsgfmt than to deal with integrating it.

Motivation

My reasons for writing commitmsgfmt include boredom, a desire to play with Rust, and an unhealthy capacity for obsessing over the immaterial. But more than anything I was fed up with adjusting various formatting related options in Vim and never getting exactly the desired behaviour. I wanted to be able to reliably format an entire buffer and not have it blow up. I had these requirements:

Prose should wrap at the expected message width, ideally not leaving spurious trailing spaces (even though Git strips these before committing).
Lists should retain their structure after wrapping, with continuation lines indented to align with the list item’s first text character.
Both numbered and unnumbered lists should work.
Comments absolutely should not be uncommented, which would be destructive, and there is no reason to reformat them at all even if wrapping were safe.
Certain constructs should be exempt from formatting, such as literals and the IEEE-style references I use a lot.

The most satisfactory Vim configuration I managed achieves most of the critical points above: it wraps prose and lists and does not uncomment comments but it is not capable of conditionally suppressing wrapping settings and does not fulfill the minor aesthetic requirements. This is good enough that gqip (format in paragraph) generally does the right thing but often requires a bit of manual correction. Here it is:

setlocal formatoptions+=n
setlocal formatoptions+=q
" Amend default number-list formatting to include dash- and bullet lists.
setlocal formatlistpat=^\\s*[-*0-9]\\+[\]:.)}\\t\ ]\\s*

My goal was to use only

setlocal formatprg=commitmsgfmt

and have gggqG (go to first line, format to last line) Just Work™.

I did achieve that goal but then learned that it wasn’t quite the right goal. Instead I ended up with

setlocal formatprg=commitmsgfmt
nnoremap <buffer> <silent> <LocalLeader>gq :let b:cursorpos=winsaveview()<CR>gggqG:call winrestview(b:cursorpos)<CR>
nmap <buffer> gqip <LocalLeader>gq

which looks possibly even more complex than the starting point. It actually isn’t, though, because it doesn’t require maintenance and gggqG really does Just Work™ now. Here’s what the three lines do:

As a minimally necessary step, set 'formatprg' to commitmsgfmt.
Map <LocalLeader>gq to store the current cursor position, format the entire buffer, and finally restore the original cursor position. This is a convenience hack to work around an in hindsight obvious limitation of gggqG: Vim dutifully moves the cursor to the end of the buffer as instructed, but in practice that is often not a terribly useful place for the cursor to end up. This mapping avoids a mental context switch and as an added bonus I am much less likely to get the number of gs wrong. An equivalent mapping for visual mode may prove necessary over time but I don’t use visual mode often.
The gqip command is burned into my muscle memory but formatting a single paragraph doesn’t play nicely with a context sensitive filter: commitmsgfmt ends up interpreting the first line in the paragraph as the subject line and disconnects it from the remaining lines, breaking the paragraph into two. I could remove that rule from commitmsgfmt but I am not willing to, it serves a specific purpose. Instead I override the default behaviour with the mapping described above.

Installation and usage

Refer to the official project development repository at https://gitlab.com/mkjeldsen/commitmsgfmt for detailed, up-to-date installation and usage instructions.

In summary:

Vim: use as 'formatprg'.
Not Vim: either use or write something like Vim’s 'formatprg' or fall back to the provided Git hook.
Message body width limit is configurable.

Performance analysis

I’ve repeatedly tested commitmsgfmt on selections of “interesting” commit messages from the Git and Linux kernel projects, both of which have massive histories and require good commit hygiene. In the end I randomly selected 5,000 non-merge commits from Git v2.16.0 (13.3% of non-merges) and 33,568 non-merge commits from Linux v4.15 (5% of non-merges), ran commitmsgfmt on their messages, and counted the line differences between the original and reformatted messages to determine commitmsgfmt’s accuracy. The result and analysis follow here.

The analysis is not meant to show how commitmsgfmt compares to, say, Vim. As a filter, commitmsgfmt is only intended to be on par with Vim’s internal format function while much easier to use.

Conclusion

In a randomly selected sample of 13.3% of 37,362 total non-merge commits from Git, commitmsgfmt was able to exactly reproduce 69% of the original commit messages, and a further 12% within a margin of 1 differing line. In a randomly selected sample of 5% of 671,355 total non-merge commits from the Linux kernel, commitmsgfmt was able to exactly reproduce 69% of the original commit messages, and a further 6% within a margin of 1 differing line.

The majority of the errors in the remaining commits were due to scenarios I did not build commitmsgfmt to handle, prime among them literals indented less than 4 spaces. Some opportunities for improvement remain but the inherent lack of formal structure means I don’t expect to be able to raise the volume of successful replication to 90%.

Replication ability does not translate directly into performance during active use but seems like an acceptable measure.

Method

This repository includes the following subtree:

r/commitmsgfmt/
├── bench
├── git-result
├── linux-result
├── main.R
├── plot-subject-period.R
├── subject-period
└── subjects-with-periods

For the rest of this analysis, paths will be relative to the r/commitmsgfmt/ directory.

bench is a Bash script that can randomly select a number of commits from a local repository whose messages to reformat, and execute commitmsgfmt on those messages and compare the originals to the reformatted ones. For every unique line length w in a given message, bench executes commitmsgfmt --width=w once to find the minimal diffstat (“diffstat”) as detailed below. bench then prints to standard output the commit id, the number of lines in the message, the w that produced the diffstat, and said diffstat. It is necessary to try all widths because many messages are not wrapped to commitmsgfmt’s default of 72 characters. bench uses shuf(1) 8.28 from GNU coreutils, with /dev/urandom, to select commits; shuf uses an unspecified random number generator.

A new benchmark can be performed with something like
```
$ git clone https://git.kernel.org/pub/scm/git/git.git
$ ./bench --sample git 5000 > git-sample
$ ./bench git-sample > git-result
/tmp/tmp.A1eSTxxgxu
```
git-result and linux-result are the benchmarks on which this analysis is performed. To test commitmsgfmt on those same commits for reproducibility or regressions, these files can be turned back into commit samples for feeding into bench with something like
```
$ awk 'NR == 1 { print "git" }
       NR  > 1 { print $1 }' git-result > git-sample
```
main.R is an R script that generates most of the analysis output referenced here.
plot-subject-period.R plots a graph of commits whose subject lines end in a period, in proportion to all commits, per year.
subject-period is the data fed into plot-subject-period.R, generated by subjects-with-periods.
subjects-with-periods extracts the total number of commits per year and the number of commits whose subject lines end in a period.

The diffstat is the smallest total of added and removed lines in the diff between the original message and the reformatted messages. By corollary, the diffstat halved also gives an upper bound on the number of changed lines. For instance, given the two executions

Input	`--width=4`	`--width=8`
`Foo foo bar baz`	`Foo -foo bar baz +foo +bar +baz`	`Foo -foo bar baz +foo bar +baz`

the diffstat would be 3. I borrow the term from diffstat(1).

As proof that commitmsgfmt is non-trivial, pass a piece of prose to commitmsgfmt twice, with different values for --width. The prose must be at least two lines and at least one line must not be a construct exempt from wrapping, such as a literal. Observe that commitmsgfmt produces different results. For example, take git@b6947af2294e and execute

$ for w in 42 72; do
    git -C git show --short --format:format=%B b6947af2294e |
    commitmsgfmt --width $w; done

Analysis

I’m interested in a general idea of how commitmsgfmt performs, as well as insight into its limitations. I’ll start by getting a bird’s-eye view of data point distributions with the summary function. Here is a table showing that output, rounded to two decimal places, arranged by factor for easy side-by-side comparison:

`summary( )`	Min.	1st Qu.	Median	Mean	3rd Qu.	Max.	NA’s
`git$lines`	0.00	6.00	9.00	11.55	14.00	191.00	1
`linux$lines`	1.00	7.00	9.00	12.04	13.00	346.00
`git$width`	4.00	57.00	66.00	60.99	70.00	214.00	1
`linux$width`	2.00	58.00	68.00	64.16	72.00	967.00
`git$diffstat`	0.00	0.00	0.00	1.71	2.00	225.00	1
`linux$diffstat`	0.00	0.00	0.00	2.71	2.00	466.00

Messages tend to be a little under a dozen lines long and wrapped to a little below 70 characters per line. Both Git and Linux have abnormally wide lines, probably caused by the inclusion of some kind of system output. commitmsgfmt’s default width of 72, inherited from the default Vim runtime, seems justified; the 3^rd quartile values also happen to be the modes.

The diffstat rows look highly promising: more than 50% of messages were unchanged and 75% of messages deviate by only 2 changed lines. Reflowing a paragraph would produce a diffstat of at least 3 so that implies an in-line change with no added or removed breaks. The maxima are pretty bad but are hopefully outliers caused by pathological cases.

As for the NA: one commit, git@296fdc53bdd7, has no message. I’ll remove that with the na.omit function before further processing.

As mentioned, the diffstat counts every changed line at least twice so I can’t determine the precise proportion of changed and unchanged lines, but I can determine rough upper bounds from the halved total diffstat:

	Total lines	Diffstat upper bound	Proportion
Git	57,742	4,270	7.39%
Linux	404,036	45,400	11.24%

Framing this as 1 out of 10 lines changing does not look very promising with the mean message being longer than 10 lines. Fortunately the diffstat quartiles further up show us that isn’t what’s really happening. Visualising the diffstat distribution as an empirical cumulative distribution function can aid comprehension but at this point doesn’t tell us much new—a large volume of good results and a few bad results:

ECDF of "best-run" diffstats

Alternatively, drawn as a frequency polygon with geom_freqpoly(stat = "count") you can basically flip the graph upside down.

The signal seems to be somewhere in the last 10%. I can zoom in on that area with the quantiles function:

Diffstat per quantile	90	95	99	100
Git	5	8	19	225
Linux	7	13	37	466

Extreme outliers account for only 1% but relative to the median message length the 95-99 quantiles seem a bit challenged.

I’ll return to the low-performing messages in a moment but I’m curious about how a line gets to wrap at over 900 characters. I’ll go on a tangent by inspecting the maximum line and width messages:

> git[c(which.max(git$lines), which.max(git$width)), ]
                                            id lines width diffstat
2134 6d9617f4f77510b4fa76fbabae2a5f4a9604577f   191    60        0
2454 7d90095abe322f72820f334839afb75c23e009ff     9   214        2
> linux[c(which.max(linux$lines), which.max(linux$width)), ]
                                            id lines width diffstat
16704 7fdd69c5af2160236e97668bc1fb7d70855c66ae   346    56      466
21125 a207f5937630dd35bd2550620bef416937a1365e    66   967       34

The longest message in Git, git@6d9617f4f775, is straight-forward: neatly wrapped at 60 characters, containing 6 tables of measurements indented 4 spaces but no other special constructs. Replicated perfectly, but ultimately uninteresting. The longest message in Linux, linux@7fdd69c5af21, is dominated by an embedded patch. The patch is not indented, explaining the high (in fact maximum) diffstat, and had it been, the message would have replicated perfectly at a width of 56 characters.

The widest message in Git, git@7d90095abe32, is not wrapped at all. The body comprises a single bullet list, using *, and replicates perfectly at the specified width. It also wraps neatly with lower width specifications. The widest message in Linux, linux@a207f5937630, is owed to an unindented trace, but in this case the prose has unbalanced widths and the best commitmsgfmt could have done is a diffstat of 4 at width 76.

Returning to the diffstat analysis: I’m going to examine a small number of the worst offenders. The selection criteria is largely arbitrary, I just need enough commits to identify limitations but few enough that examining them is feasible. The 15 worst results, for instance⁵:

> head(git[with(git, order(diffstat, decreasing = TRUE)), ], 15)
                                           id lines width diffstat
3117 9e5972413b4873dc143c4046c6e74eb608ace32b   157    86      225
1070 36419c8ee41cecadf67dfeab2808ff2e5025ca52    94    70       84
2556 8262715b8eb157497ec1ee1cfcef778d526b2336    50    72       66
629  1f07c4e5cefec88d825045ade24eee71f6a2df47    78    70       62
3902 c755015f79ac03bb5afa0754c30e937887fc68ab    82    72       58
3396 adc446fe5d5f6dc1fb5edeaa9aa016ef94e70da1    54    66       52
3804 c251c83df276dc0bff4d008433268ad59b7a8df2    45    70       50
308  10450cf72b51baf3bac6a779fb4e47241af7ae5e    37    74       46
3136 9f50d32b9c20cc94b9882484ca9704af332a5622    87    68       46
4642 ed0b9d43097349f4d730472673c07f427480e14a    41    72       43
4479 e53e6b4433f264250c2e586167caf61721b0185c    50    73       42
1368 452320f1f53a579f891eba678993508e7cbf3339    85    60       41
1721 58babfffdeeecaa4d6edecaac1fb0c595218b801   102    70       40
3786 c189c4f2c42083b329605fb7b0583b29b73da086    40    68       40
4062 cf52b8f06389189bd32565c5c6adad75ac8a1a62    43    65       40
> head(linux[with(linux, order(diffstat, decreasing = TRUE)), ], 15)
                                            id lines width diffstat
16704 7fdd69c5af2160236e97668bc1fb7d70855c66ae   346    56      466
15235 74472233233f577eaa0ca6d6e17d9017b6e53150   308    57      450
16430 7d910c054be42515cd3e43f2e1bec8c536632de2   194    74      263
15678 77d2720059618b9b6e827a8b73831eb6c6fad63c   234    74      256
24745 bd4cf0ed331a275e9bf5a49e6d0fd55dffc551b8   205    69      226
5121  272725c7db4da1fd3229d944fc76d2e98e3a144e   169   114      202
8928  449809a66c1d0b1563dee84493e14bf3104d2d7e   146    76      194
26342 c92e8c02fe664155ac4234516e32544bec0f113d   140    89      187
18061 8a1435880f452430b41374d27ac4a33e7bd381ea   115   105      179
20853 9faaff593404a9c4e5abc6839a641635d7b9d0cd   167    78      179
4220  209045adc2bbdb2b315fa5539cec54d01cd3e7db   199   224      171
15479 7630b3e28dd827fffad13cc0aada14b00ec524d9   151   113      170
15741 7854ea6c28c6076050e24773eeb78e2925bd7411   182    72      170
24932 beb0f0a9fba1fa98b378329a9a5b0a73f25097ae   118    84      147
15495 765fb2f181cad669f2beb87842a05d8071f2be85   113    74      142

The poorest performing commit in Git, git@9e5972413b48, includes 9 tables of measurements, all at least 78 characters long and starting at column 0. Prose is actually wrapped at 70 characters and had the tables been indented the diffstat would have been 18 at width 70. The remaining errors are caused by setext-style headers whose tags were joined to their previous lines and could probably be fixed with some effort. I’ve already looked at the poorest performing commit in Linux; the runner-up, linux@74472233233f, also includes unindented output but in contrast the prose in that message would have replicated perfectly at width 54.
Tables often are not indented even when code samples are (git@36419c8ee41c). Tables come in too many variations for any kind of half-way sophisticated support to be possible. This is potentially problematic: if such listings are intentionally not indented because indentation is stylistically undesirable, commitmsgfmt’s need to process the entire message makes it useless compared to something like Vim’s gqip.
Some code samples are not indented (git@9f50d32b9c20, linux@449809a66c1d). This seems up to personal taste but there may be a case to be made for leading whitespace causing confusion, say, in code samples.
Some commits are simply inconsistently wrapped (git@1f07c4e5cefe, linux@bd4cf0ed331a), or possibly wrapped with an algorithm other than commitmsgfmt’s.
Subsequent paragraphs in list items with multiple paragraphs are not recognized as belonging to a list and have their indentation reset (linux@bd4cf0ed331a). Determining the impact of this limitation is… difficult. I would have to build a special tool that can preserve context to know whether a given indented block belongs to a list item or stands alone, at which point I would have essentially built support. The main obstacle is the lack of a clear indication of whether we’re formatting a list: an extra list item paragraph is indistinguishable from an improperly indented plain paragraph. A naive heuristic of list items consuming all paragraphs with leading spaces (resolving conflicts with literals) could be worth trying, though.
Literals and list items not preceded by a blank line get folded into the previous paragraph (git@c189c4f2c420, linux@7854ea6c28c6).
Tab-indented list items that span multiple lines have their continuation lines indented with spaces (derived from linux@7854ea6c28c6). Oops. This is a bug.
Usenet quotes get mangled (git@adc446fe5d5f). It would be trivial to not mangle them but proper support would have them reformatted as prose, preserving the original quote prefix; that takes more work. These get used but not often:
```
$ git -C git   log --no-merges --oneline --grep '^\s*>' | wc -l
108
$ git -C linux log --no-merges --oneline --grep '^\s*>' | wc -l
2130
```

The last thing I want to look at is what causes just a few errors. In particular, there are surprisingly many data points with diffstats 1 or 2:

`head(table( ))`	0	1	2	3	4	5
`git$diffstat`	3,475	12	594	80	316	44
`linux$diffstat`	23,338	489	1,765	781	2,119	483

The commits where the diffstat is 1 are interesting because commitmsgfmt either inserted or removed a single—probably blank—line; it didn’t modify an existing line. Here are the first 15 of those:

> head(git[git$diffstat == 1, ], 15)
                                          id lines width diffstat
807  28258afe912303f717fcdccaca124c1ae3e8e004     5    69        1
1269 4056c09114e66ce3c2368551f0122e83628750d6     4    55        1
1312 4280e5333354c6dddcd994bdacd3c6a11ac2da5e     4    19        1
1341 43bddeb43d86c2a8093aed0217137afd27eb821b     3    61        1
2280 74b46e32cb3907a4a062a0f11de5773054b7c71a     5    67        1
2680 886a39074be34d21afc6c1b8af1f7f4b3ef54dc5     4    38        1
2860 91af81a98ea5c5594c67a63abc933250e05c13c6     6    69        1
2896 93256315b2444601a35484f4fb76cd5723284201     5    66        1
3154 a028a98e9ae83231f0657fdb112f7d9c0cf0b98c     4    78        1
3956 ca016f0e4e7e508407fe17e121fcd520fbb7c865     5    67        1
4230 d7e3868cdfdc73c3de15296ecf32138a8308c07e     4    43        1
4359 df450923a2a08c50976f2d241a1c4992cf03b3a7     5    68        1
> head(linux[linux$diffstat == 1, ], 15)
                                          id lines width diffstat
4   000725d56a196e72dc22328324c5ec5506265736     6    72        1
119 00ca79a04bef1a1b30ef8afd992d905b6d986caf    11    67        1
127 00db8a8ecc91abeb46d1128a788a194018c51e77     7    45        1
204 017c59c042d01fc84cae7a8ea475861e702c77ab    13    72        1
256 01e1b69cfcdcfdd5b405165eaba29428f8b18a7c    10    73        1
328 0279b4cd86685b5eea467c1b74ce94f0add2c0a3     9    48        1
334 0287e43dda1a425da662f879dd27352021b0ca63    10    78        1
380 02d90fc343411d6dff26bbd64f0895a243e6f608     5    37        1
521 03e10e5ab5ba6511ddaf80085cf08c62e9336fa5     9    67        1
533 0400e3134b03336617138f9ebf2cd0f117ceef20     7    62        1
540 040fa1b9620cd019349414505828b2ffbeded7f8     8    39        1
543 041509db390cf97b09df0f51024f5d40407938db     9    44        1
616 04bedfa542d90ac7a1bbf28287e9861d0da21576    10    65        1
644 0501a3816e5b778830fc2157a6d6bb11a965fc2c    11    69        1
872 06a24dec10bc4014fc0974670627efed68f5da27    14    67        1

In Git all the single-line differences come from manually wrapped—or run-on—subject lines. None of them are longer than commitmsgfmt’s somewhat arbitrarily chosen subject line limit, but because the parser operates on lines the continuation lines are interpreted as body text and a “missing” blank line gets inserted before the first continuation line. Despite knowing that git log’s %s format placeholder joins consecutive subject lines to form a single subject I failed to anticipate this scenario.

With the command

$ git -C git rev-list --no-merges HEAD |
    while IFS= read c; do
    git -C git show --format=format:%B -s $c |
    awk -v c=$c 'NR == 2 { if ($0) print c }'; done

I find 101 messages whose second line—that is, the expected separator between the subject line and the message body—is non-empty. Most of those commits are from 2005, with none after 2007. This style evidently is a historical artifact not worth accommodating.

Relatedly, I can assess the sensibility of the choice of 90 columns as a hard upper limit on subject line length by inspecting actually long subject lines. The following command extracts the latest 50 commits with subject lines of at least 90 characters:

$ git -C git log --no-merges --format='%h %s' --abbrev=15 |
    grep --perl-regexp '^.{15} .{90}' |
    head -n 50 |
    while IFS=' ' read id s; do git show -s $id; done

The output has a lot of overlap with that of the previous command; it shows that very long subject lines only happen a few times a year since 2007, where the latest peak was. The diffstat-1 commits above also mostly date back to 2005-2007, none later than 2010.

In Linux the story is a little different. Among the 15 commits singled out above are a few wrapped or run-on subject lines, like in Git, but most of them are a curious case of a redundant blank line at the end of the message, which commitmsgfmt filters out. It’s not clear where this extra line comes from; passing git show -s --format=format:%B for commits linux@000725d56a19 and linux@00ca79a04bef through xxd(1) clearly shows a second trailing 0A in the latter and git cat-file commit corroborates that, yet with git show -s --format=raw, which is supposed to show the stored value, there is no difference, and patches generated with git format-patch are structurally identical. It could be a change in Git but grep -C3 trailing git/Documentation/RelNotes/*, though a very course-grained inspection, turns up nothing relevant.

Next up, where the diffstat is 2 commitmsgfmt made an in-place modification to an existing line that did not cause it to be wrapped differently:

> head(git[git$diffstat == 2, ], 15)
                                          id lines width diffstat
13  00889c8ca7017d7b9d07d47843e3a3e245a1de26     3    43        2
14  008c208c2c54f7bb97bfb7bc5dc606a4eb0d6837     7    57        2
39  01eea6f355f35098cc5038e94622e30ed31a9267     4    46        2
54  029443070a4e5b0290a2d09f3707bc486d84a961    13    71        2
59  02f3389d9647378ed864ff1cdfb6f0238b64ee91     8    66        2
67  034016391c475e98c38a9b715cd670b8b2d0c619     3    32        2
71  03a182107fdb36170a72b8a3d94de2b52e3f6668     3    46        2
78  03d311eda2d8c2d23855b9d3e904c7648925ab56     7    65        2
87  043d76050d3136b8684b5a3938e8bc0c1f8483fd     4    46        2
97  0499b24ad664d7f6a33d4cfe4a11912ae455b039     8    71        2
101 04d24455ceb0129195f60af6b62981542433ecf7     3    35        2
103 04f89259a67ba1ec291f023b70278d41ed665a13    10    72        2
104 05094f987c2f141ec9b873ad6d6303b0aca173a4     3    37        2
107 0532a5e46b88cd70c952a2bf5dc63681be32a2d9     7    64        2
110 056211053b7516a57ff7a6dd02f503ecef6fca70     7    63        2
> head(linux[linux$diffstat == 2, ], 15)
                                          id lines width diffstat
39  004b4f2d4448cff7f13871c05d744b00a7c74d4a     8    62        2
68  0072032d7babc4347556c1863919f3c532d9cf5b     6    71        2
89  0095d58b4a91b9fb57aeb781909355b232517c64     4    36        2
100 00a6cae288138ce0444ab6f48a81da12afe557aa     7    70        2
120 00cde6748255a84beecfdea4caeaf7c9cd05a527     9    57        2
144 01062ad31b531f2a3bcbfd9f6267e4e0c1103010     7    67        2
168 013db7575bdfb57d295e9a27186f52c7547ef2d2     4    52        2
213 01945fa248ff6d34f5fdb8106118910b77b76f91    18    72        2
221 01a6221a6a51ec47b9ae3ed42c396f98dd488c7e     8    71        2
236 01bbf2c7ddc93479eecebf8495848c0f362130c5     9    70        2
238 01c37be40fc35137a8b97e5176017958d57e401c     7    62        2
248 01d883d44a1ca8dc77486635d428cba63e7fdadf    16    70        2
266 01f37f2d53e14a05b7fc3601d182f31ac3b35847     4    55        2
344 0293a509405dccecc30783a5d729d615b68d6a77     6    37        2
345 02947ecb0de7a011215568263fd48f3d5b0f8573     5    27        2

These are overwhelmingly dominated by the stripping of a terminating period in subject lines, interspersed with a few one-line literals that were not recognized. Retaining the offending periods would be trivial but removing them was a deliberate stylistic choice: aesthetically I prefer not ending titles and subjects with periods and treating commit subject lines the same way, and pragmatically I find that teaching authors to omit periods discourages excessively long subject. Fortunately, this practice has also fallen into disuse:

Subject lines ending in a period over time

Tangent: in extracting the data for the subject line period graph I made a curious discovery that can be illustrated with the command
$ git -C linux log --no-merges --date=iso --format='%ad %cd %h' |
    grep --extended-regexp '^(1970|20(0[1-4]|19|3.))'

Finally, where the diffstat is 3 commitmsgfmt probably added or removed a single line. What causes that?

> head(git[git$diffstat == 3, ], 15)
                                          id lines width diffstat
128 0673bb28d0b96393e526030c96e7dbfac2fc33a2     8    53        3
155 0795805053ad89a24954684fca2e69fea2bf50b9    11    67        3
208 0a61779994aa3de41d57bb85bd88a2f56c7ba7d8    24    64        3
224 0b9dca434f5d9208a26f47f7ec11453f1cfdfae8    11    58        3
229 0bdb28c9ccd85b1c606664154b6f6d39a4c315fd    10    65        3
238 0ca2972345d6de3b9eb845af4b0e9b701af120bd    18    68        3
318 10f5c526561604ba9677dc27643b5c9bfad36458    28    68        3
324 1126b419d6835f6b8c45ccfffc0ada9b09e32d87    21    64        3
333 1170e8026ae507c510619fceff969b1cf2900a28     9    72        3
354 124d80928d2a89e78212d03eb9fa3ba1aaa4d56b     6    53        3
417 1568fea01eb25a293f8dae570a31ca34d41b5442     7    61        3
473 17a8b25005bb09b03bee7ddac5412c7d29675eef    18    74        3
523 1a66a489d09e7b8629fa7e4184c78703f4eed335    16    68        3
623 1e41827d2d5cf0e4c6ebff91958fa47d69b7ff42    12    67        3
673 21246dbb9e0af278bc671e32e45a6cae4ad38b6f    18    70        3
> head(linux[linux$diffstat == 3, ], 15)
                                           id lines width diffstat
262  01ef66bbb65aa4db100b267778202d7657e244e4     9    66        3
274  0202775bc3a28f2436ea6ee13ef3eb0e8f237857    14    75        3
400  031005f78c8d0aebc17ddf7a34af9ffd48034d7d     7    65        3
406  0316fe8319ff62e527d0d91a3bc7df1c59eafae8    14    64        3
708  0571c7a4f58fc6070fb9d556e4896de864c2a593    20    26        3
788  06265442028254cce5a6f19dfa0bab11c88f06fe     8    44        3
895  06cd9a7dc8a58186060a91b6ddc031057435fd34    10    24        3
970  075db1502ffd4ff8c58020167484a6e123ae01a3    13   120        3
1170 0905bc94d5ad8a928eed26e0896857fb54dcb366     9    23        3
1227 0975b16274bad1f0bd5c5fd6ab759c5a9ee11949    12    55        3
1383 0abcc6df070687816b0ca0aefc3d64c62773063c     7    67        3
1429 0b0ef1d027008f019ced2d69e343bb1257326b12    10    79        3
1565 0c227c51b98c03c6e7fb4f342f930cf576292064    10    68        3
1646 0cb34dc2a37290f4069c5b01735c9725dc0a1b5c    12   128        3
1660 0ccff1a49def92d6b838a6da166c89004b3a4d0c     9    24        3

In Git it’s mainly orphan creation. This is a slightly unfortunate outcome. It’s impossible to say whether orphan elimination was deliberate in the offending (offended?) messages, such as git@0a61779994aa, but it is not a given that commitmsgfmt’s transformation is preferable and rather likely that it isn’t. The text limit could be relaxed for the last word of a paragraph but that is far from a reliable solution.

In the special case of git@0795805053ad the wrapped line is a single unindented URL. This result is not desirable either: a URL is more difficult to work with when wrapped, for instance when clicking on terminal output. Personally I work around that problem, and a related problem of ragged right text caused by not wrapping URLs in prose, by moving URLs into references, but that sometimes produces awkward sentences instead. An exception to never wrap inside URLs could remove this limitation and possibly upgrade references to general footnotes, but the problem is not very pronounced. At the 3^rd quartile of message widths, Git has 50 commits with URLs susceptible to this problem and Linux has 1079:

$ git -C git   log --oneline --grep '^\s\{0,3\}https\?://.\{64\}' | wc -l
50
$ git -C linux log --oneline --grep '^\s\{0,3\}https\?://.\{64\}' | wc -l
1079

The number of multi-word footnotes that would have been recognized by commitmsgfmt is less than half that, at 20 respectively 361:

$ git -C git   log --oneline --grep '^\[[[:digit:]]\+\]\(\s\+\S\+\)\{2\}' |
    wc -l
20
$ git -C linux log --oneline --grep '^\[[[:digit:]]\+\]\(\s\+\S\+\)\{2\}' |
    wc -l
361

In Linux there is more variance but the errors come down to undetected literals and orphans.

Threads to validity

Replication ability does not translate directly into performance during active use. This is trivially true; for instance, tee(1) has perfect replication ability because it doesn’t transform its input but consequently would also be useless for wrapping text. The analysis proves that commitmsgfmt transforms all input and that most transformations can be reduced to the identity transformation, however, there are probably false positives: anything falsely recognized as a literal will skip wrapping and will not be flagged as an error. As an indication of how bad this problem could be I can calculate the volume of lines that could possibly be recognized as literals:

$ git -C git   log --no-merges --format=%B | grep --count '^\( \{4\}\|\t\)'
30312
$ git -C git   log --no-merges --format=%B | wc -l
472642
$ git -C linux log --no-merges --format=%B | grep --count '^\( \{4\}\|\t\)'
345258
$ git -C linux log --no-merges --format=%B | wc -l
8761305

Less than 7% and 3%, which is non-trivial but hardly breaking. Ultimately there is no way to determine performance during active use other than active use.

Sampling and representativeness of samples throughout can be challenged. The Git sample is proportionally larger than the Linux sample, which could cause bias, while the Linux sample is perhaps unfairly small and for no other reason than my patience. The probability distribution of samples is basically undocumented, although that could be trivially resolved merely by using another tool. Even if these samples are fine, Git and Linux certainly are not implicitly representative of software projects in general, nor are they blessed in any way but for being perhaps the two oldest Git repositories.

Ignoring merge commits can be seen as discarding unfavourable data, thereby misrepresenting commitmsgfmt’s accuracy. Choosing projects that do not format merge commits specially would just be another form of that, but that form would be prone to overestimating commitmsgfmt’s accuracy by virtue of not distinguishing between non-merge commits and merge commits.

Unless you consider having your message mangled to be a passive-aggressive reprimand. ^[return]
Contributed by Tim Pope to match a famous commit message guideline he wrote over 10 years ago. ^[return]
Unfortunately you will probably have maximum integration difficulties. ^[return]
If there is an integration mechanism and it is not documented I will happily work with you to change that. ^[return]
I use head instead of sample for reproducibility. This could be said to cause overfitting but the original selection of commits is already random. ^[return]

Tags: git commitmsgfmt ergonomics

Back to posts