Format commit messages with commitmsgfmt
I made a tool to format commit messages better than fmt(1)
, Vim,
and par(1)
. It’s called commitmsgfmt
and in this post I will
present it and analyse its performance.
Introducing commitmsgfmt
commitmsgfmt
is a filter for formatting text, like fmt(1)
, but
specialized for commit messages: its primary purpose is to wrap text and
its secondary purpose is to know when not to do so. It is neither a
linter1 for commit messages nor a general-purpose text formatter.
commitmsgfmt
was written first and foremost for use with Vim’s
'formatprg'
option, which names an executable that takes input to
format on standard input and produces the formatted result on standard
output. commitmsgfmt
’s default behaviour reflects the default Vim
runtime settings for Git commit messages2; most importantly this
implies a text width of 72 columns.
If you don’t use Vim or if you don’t want text wrapped at 72 characters you can
probably still use commitmsgfmt
; see Installation and
usage for your options.
When to use it
commitmsgfmt
was written first and foremost for use with Vim’s
'formatprg'
option but works in other environments, too.
If you have ever had any trouble formatting commit messages and your preferred text editor can delegate text formatting to a third-party, perhaps
commitmsgfmt
can alleviate that pain.If you have had trouble formatting commit messages but your text editor cannot delegate text formatting you can still use the alternative Git hook integration.
If you have never had trouble formatting commit messages it is probably because you either write very simple messages or never spend effort formatting them.
commitmsgfmt
can help in both cases but yields highest return on investment for structured messages with complex constructs. Unstructured messages, such as one very long line, will be wrapped but won’t magically read well, and structured messages without complex content won’t benefit much over usingfmt
.If you think all of this is nonsense and that the solution to text overflowing terminals in the 21st century is to abandon terminals,
commitmsgfmt
is very much for you.3 Or rather, it is for everyone you collaborate with. Even despite Git 2.1.0’s change to pager wrapping.
When not to use it
Despite all that, commitmsgfmt
may not fit your needs.
If neither your VCS nor your text editor of choice provide any means of integration,
commitmsgfmt
cannot help you, period.4commitmsgfmt
does not care whether your commit is a merge or not and will gladly attempt to format either, but merge commit messages occasionally use non-standard constructscommitmsgfmt
does not cope well with. If this does not apply to you, you don’t have a problem. If it does apply to you and you are not willing to leavecommitmsgfmt
disabled for merge commits the tool will cause friction. The Linux kernel and Git projects are affected by this limitation.If you need a general-purpose text formatter stay with
fmt
or try to work outpar
.commitmsgfmt
should mostly work fine but makes assumptions that don’t apply in other contexts.commitmsgfmt
uses the greedy minimum number of lines algorithm to wrap text, likefmt
and Vim’s internal format function. The advantage of this is that it is extremely easy to write and the result is intuitive. The disadvantage is that it can produce text with a very ragged right-hand side.par
uses a dynamic programming algorithm and is far better at giving the appearance of text justification, typically yielding a prettier result. It is not a design goal ofcommitmsgfmt
to do one or the other but it was a design choice to mimic Vim andfmt
for comparison purposes.If you write in non-Latin script I have no idea how well
commitmsgfmt
will work for you. It is Unicode-aware but only to count “characters” intuitively and not break between code points.commitmsgfmt
is a third-party dependency, which introduces some practical problems even with ready-built binaries. It may just be easier to deal with not havingcommitmsgfmt
than to deal with integrating it.
Motivation
My reasons for writing commitmsgfmt
include boredom, a desire to play with
Rust, and an unhealthy capacity for obsessing over the immaterial. But more
than anything I was fed up with adjusting various formatting related options in
Vim and never getting exactly the desired behaviour. I wanted to be able to
reliably format an entire buffer and not have it blow up. I had these
requirements:
Prose should wrap at the expected message width, ideally not leaving spurious trailing spaces (even though Git strips these before committing).
Lists should retain their structure after wrapping, with continuation lines indented to align with the list item’s first text character.
Both numbered and unnumbered lists should work.
Comments absolutely should not be uncommented, which would be destructive, and there is no reason to reformat them at all even if wrapping were safe.
Certain constructs should be exempt from formatting, such as literals and the IEEE-style references I use a lot.
The most satisfactory Vim configuration I managed achieves most of the
critical points above: it wraps prose and lists and does not uncomment
comments but it is not capable of conditionally suppressing wrapping
settings and does not fulfill the minor aesthetic requirements. This is
good enough that gqip
(format in paragraph) generally does the right
thing but often requires a bit of manual correction. Here it is:
setlocal formatoptions+=n
setlocal formatoptions+=q
" Amend default number-list formatting to include dash- and bullet lists.
setlocal formatlistpat=^\\s*[-*0-9]\\+[\]:.)}\\t\ ]\\s*
My goal was to use only
setlocal formatprg=commitmsgfmt
and have gggqG
(go to first line, format to last line) Just Work™.
I did achieve that goal but then learned that it wasn’t quite the right goal. Instead I ended up with
setlocal formatprg=commitmsgfmt
nnoremap <buffer> <silent> <LocalLeader>gq :let b:cursorpos=winsaveview()<CR>gggqG:call winrestview(b:cursorpos)<CR>
nmap <buffer> gqip <LocalLeader>gq
which looks possibly even more complex than the starting point. It
actually isn’t, though, because it doesn’t require maintenance and
gggqG
really does Just Work™ now. Here’s what the three lines do:
As a minimally necessary step, set
'formatprg'
tocommitmsgfmt
.Map
<LocalLeader>gq
to store the current cursor position, format the entire buffer, and finally restore the original cursor position. This is a convenience hack to work around an in hindsight obvious limitation ofgggqG
: Vim dutifully moves the cursor to the end of the buffer as instructed, but in practice that is often not a terribly useful place for the cursor to end up. This mapping avoids a mental context switch and as an added bonus I am much less likely to get the number ofg
s wrong. An equivalent mapping for visual mode may prove necessary over time but I don’t use visual mode often.The
gqip
command is burned into my muscle memory but formatting a single paragraph doesn’t play nicely with a context sensitive filter:commitmsgfmt
ends up interpreting the first line in the paragraph as the subject line and disconnects it from the remaining lines, breaking the paragraph into two. I could remove that rule fromcommitmsgfmt
but I am not willing to, it serves a specific purpose. Instead I override the default behaviour with the mapping described above.
Installation and usage
Refer to the official project development repository at https://gitlab.com/mkjeldsen/commitmsgfmt for detailed, up-to-date installation and usage instructions.
In summary:
- Vim: use as
'formatprg'
. - Not Vim: either use or write something like Vim’s
'formatprg'
or fall back to the provided Git hook. - Message body width limit is configurable.
Performance analysis
I’ve repeatedly tested commitmsgfmt
on selections of “interesting” commit
messages from the Git and Linux kernel projects, both of which have massive
histories and require good commit hygiene. In the end I randomly selected 5,000
non-merge commits from Git v2.16.0 (13.3% of non-merges) and 33,568
non-merge commits from Linux v4.15 (5% of non-merges), ran
commitmsgfmt
on their messages, and counted the line differences between the
original and reformatted messages to determine commitmsgfmt
’s accuracy. The
result and analysis follow here.
The analysis is not meant to show how commitmsgfmt
compares to, say, Vim. As
a filter, commitmsgfmt
is only intended to be on par with Vim’s internal
format function while much easier to use.
Conclusion
In a randomly selected sample of 13.3% of 37,362 total non-merge commits from
Git, commitmsgfmt
was able to exactly reproduce 69% of the original commit
messages, and a further 12% within a margin of 1 differing line. In a randomly
selected sample of 5% of 671,355 total non-merge commits from the Linux kernel,
commitmsgfmt
was able to exactly reproduce 69% of the original commit
messages, and a further 6% within a margin of 1 differing line.
The majority of the errors in the remaining commits were due to scenarios I did
not build commitmsgfmt
to handle, prime among them literals indented less
than 4 spaces. Some opportunities for improvement remain but the inherent lack
of formal structure means I don’t expect to be able to raise the volume of
successful replication to 90%.
Replication ability does not translate directly into performance during active use but seems like an acceptable measure.
Method
This repository includes the following subtree:
r/commitmsgfmt/
├── bench
├── git-result
├── linux-result
├── main.R
├── plot-subject-period.R
├── subject-period
└── subjects-with-periods
For the rest of this analysis, paths will be relative to the r/commitmsgfmt/
directory.
bench
is a Bash script that can randomly select a number of commits from a local repository whose messages to reformat, and executecommitmsgfmt
on those messages and compare the originals to the reformatted ones. For every unique line lengthw
in a given message,bench
executescommitmsgfmt --width=w
once to find the minimal diffstat (“diffstat”) as detailed below.bench
then prints to standard output the commit id, the number of lines in the message, thew
that produced the diffstat, and said diffstat. It is necessary to try all widths because many messages are not wrapped tocommitmsgfmt
’s default of 72 characters.bench
usesshuf(1)
8.28 from GNU coreutils, with/dev/urandom
, to select commits;shuf
uses an unspecified random number generator.A new benchmark can be performed with something like
$ git clone https://git.kernel.org/pub/scm/git/git.git $ ./bench --sample git 5000 > git-sample $ ./bench git-sample > git-result /tmp/tmp.A1eSTxxgxu
git-result
andlinux-result
are the benchmarks on which this analysis is performed. To testcommitmsgfmt
on those same commits for reproducibility or regressions, these files can be turned back into commit samples for feeding intobench
with something like$ awk 'NR == 1 { print "git" } NR > 1 { print $1 }' git-result > git-sample
main.R
is an R script that generates most of the analysis output referenced here.plot-subject-period.R
plots a graph of commits whose subject lines end in a period, in proportion to all commits, per year.subject-period
is the data fed intoplot-subject-period.R
, generated bysubjects-with-periods
.subjects-with-periods
extracts the total number of commits per year and the number of commits whose subject lines end in a period.
The diffstat is the smallest total of added and removed lines in the diff between the original message and the reformatted messages. By corollary, the diffstat halved also gives an upper bound on the number of changed lines. For instance, given the two executions
Input | --width=4 |
--width=8 |
---|---|---|
|
|
|
the diffstat would be 3. I borrow the term from diffstat(1)
.
As proof that commitmsgfmt
is non-trivial, pass a piece of prose to
commitmsgfmt
twice, with different values for --width
. The prose must be at
least two lines and at least one line must not be a construct exempt from
wrapping, such as a literal. Observe that commitmsgfmt
produces different
results. For example, take git@b6947af2294e
and execute
$ for w in 42 72; do
git -C git show --short --format:format=%B b6947af2294e |
commitmsgfmt --width $w; done
Analysis
I’m interested in a general idea of how commitmsgfmt
performs, as well as
insight into its limitations. I’ll start by getting a bird’s-eye view of data
point distributions with the summary
function. Here is a table showing that
output, rounded to two decimal places, arranged by factor for easy side-by-side
comparison:
summary( ) |
Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | NA’s |
---|---|---|---|---|---|---|---|
git$lines |
0.00 | 6.00 | 9.00 | 11.55 | 14.00 | 191.00 | 1 |
linux$lines |
1.00 | 7.00 | 9.00 | 12.04 | 13.00 | 346.00 | |
git$width |
4.00 | 57.00 | 66.00 | 60.99 | 70.00 | 214.00 | 1 |
linux$width |
2.00 | 58.00 | 68.00 | 64.16 | 72.00 | 967.00 | |
git$diffstat |
0.00 | 0.00 | 0.00 | 1.71 | 2.00 | 225.00 | 1 |
linux$diffstat |
0.00 | 0.00 | 0.00 | 2.71 | 2.00 | 466.00 |
Messages tend to be a little under a dozen lines long and wrapped to a little
below 70 characters per line. Both Git and Linux have abnormally wide lines,
probably caused by the inclusion of some kind of system output.
commitmsgfmt
’s default width of 72, inherited from the default Vim runtime,
seems justified; the 3rd quartile values also happen to be the
modes.
The diffstat rows look highly promising: more than 50% of messages were unchanged and 75% of messages deviate by only 2 changed lines. Reflowing a paragraph would produce a diffstat of at least 3 so that implies an in-line change with no added or removed breaks. The maxima are pretty bad but are hopefully outliers caused by pathological cases.
As for the NA: one commit, git@296fdc53bdd7
, has no message. I’ll
remove that with the na.omit
function before further processing.
As mentioned, the diffstat counts every changed line at least twice so I can’t determine the precise proportion of changed and unchanged lines, but I can determine rough upper bounds from the halved total diffstat:
​ | Total lines | Diffstat upper bound | Proportion |
---|---|---|---|
Git | 57,742 | 4,270 | 7.39% |
Linux | 404,036 | 45,400 | 11.24% |
Framing this as 1 out of 10 lines changing does not look very promising with the mean message being longer than 10 lines. Fortunately the diffstat quartiles further up show us that isn’t what’s really happening. Visualising the diffstat distribution as an empirical cumulative distribution function can aid comprehension but at this point doesn’t tell us much new—a large volume of good results and a few bad results:
Alternatively, drawn as a frequency polygon with
geom_freqpoly(stat = "count")
you can basically flip the graph upside down.
The signal seems to be somewhere in the last 10%. I can zoom in on that area
with the quantiles
function:
Diffstat per quantile | 90 | 95 | 99 | 100 |
---|---|---|---|---|
Git | 5 | 8 | 19 | 225 |
Linux | 7 | 13 | 37 | 466 |
Extreme outliers account for only 1% but relative to the median message length the 95-99 quantiles seem a bit challenged.
I’ll return to the low-performing messages in a moment but I’m curious about how a line gets to wrap at over 900 characters. I’ll go on a tangent by inspecting the maximum line and width messages:
> git[c(which.max(git$lines), which.max(git$width)), ]
id lines width diffstat
2134 6d9617f4f77510b4fa76fbabae2a5f4a9604577f 191 60 0
2454 7d90095abe322f72820f334839afb75c23e009ff 9 214 2
> linux[c(which.max(linux$lines), which.max(linux$width)), ]
id lines width diffstat
16704 7fdd69c5af2160236e97668bc1fb7d70855c66ae 346 56 466
21125 a207f5937630dd35bd2550620bef416937a1365e 66 967 34
The longest message in Git, git@6d9617f4f775
, is straight-forward: neatly
wrapped at 60 characters, containing 6 tables of measurements indented 4 spaces
but no other special constructs. Replicated perfectly, but ultimately
uninteresting. The longest message in Linux, linux@7fdd69c5af21
, is
dominated by an embedded patch. The patch is not indented, explaining the high
(in fact maximum) diffstat, and had it been, the message would have replicated
perfectly at a width of 56 characters.
The widest message in Git, git@7d90095abe32
, is not wrapped at all. The
body comprises a single bullet list, using *
, and replicates perfectly at the
specified width. It also wraps neatly with lower width specifications. The
widest message in Linux, linux@a207f5937630
, is owed to an unindented
trace, but in this case the prose has unbalanced widths and the best
commitmsgfmt
could have done is a diffstat of 4 at width 76.
Returning to the diffstat analysis: I’m going to examine a small number of the worst offenders. The selection criteria is largely arbitrary, I just need enough commits to identify limitations but few enough that examining them is feasible. The 15 worst results, for instance5:
> head(git[with(git, order(diffstat, decreasing = TRUE)), ], 15)
id lines width diffstat
3117 9e5972413b4873dc143c4046c6e74eb608ace32b 157 86 225
1070 36419c8ee41cecadf67dfeab2808ff2e5025ca52 94 70 84
2556 8262715b8eb157497ec1ee1cfcef778d526b2336 50 72 66
629 1f07c4e5cefec88d825045ade24eee71f6a2df47 78 70 62
3902 c755015f79ac03bb5afa0754c30e937887fc68ab 82 72 58
3396 adc446fe5d5f6dc1fb5edeaa9aa016ef94e70da1 54 66 52
3804 c251c83df276dc0bff4d008433268ad59b7a8df2 45 70 50
308 10450cf72b51baf3bac6a779fb4e47241af7ae5e 37 74 46
3136 9f50d32b9c20cc94b9882484ca9704af332a5622 87 68 46
4642 ed0b9d43097349f4d730472673c07f427480e14a 41 72 43
4479 e53e6b4433f264250c2e586167caf61721b0185c 50 73 42
1368 452320f1f53a579f891eba678993508e7cbf3339 85 60 41
1721 58babfffdeeecaa4d6edecaac1fb0c595218b801 102 70 40
3786 c189c4f2c42083b329605fb7b0583b29b73da086 40 68 40
4062 cf52b8f06389189bd32565c5c6adad75ac8a1a62 43 65 40
> head(linux[with(linux, order(diffstat, decreasing = TRUE)), ], 15)
id lines width diffstat
16704 7fdd69c5af2160236e97668bc1fb7d70855c66ae 346 56 466
15235 74472233233f577eaa0ca6d6e17d9017b6e53150 308 57 450
16430 7d910c054be42515cd3e43f2e1bec8c536632de2 194 74 263
15678 77d2720059618b9b6e827a8b73831eb6c6fad63c 234 74 256
24745 bd4cf0ed331a275e9bf5a49e6d0fd55dffc551b8 205 69 226
5121 272725c7db4da1fd3229d944fc76d2e98e3a144e 169 114 202
8928 449809a66c1d0b1563dee84493e14bf3104d2d7e 146 76 194
26342 c92e8c02fe664155ac4234516e32544bec0f113d 140 89 187
18061 8a1435880f452430b41374d27ac4a33e7bd381ea 115 105 179
20853 9faaff593404a9c4e5abc6839a641635d7b9d0cd 167 78 179
4220 209045adc2bbdb2b315fa5539cec54d01cd3e7db 199 224 171
15479 7630b3e28dd827fffad13cc0aada14b00ec524d9 151 113 170
15741 7854ea6c28c6076050e24773eeb78e2925bd7411 182 72 170
24932 beb0f0a9fba1fa98b378329a9a5b0a73f25097ae 118 84 147
15495 765fb2f181cad669f2beb87842a05d8071f2be85 113 74 142
The poorest performing commit in Git,
git@9e5972413b48
, includes 9 tables of measurements, all at least 78 characters long and starting at column 0. Prose is actually wrapped at 70 characters and had the tables been indented the diffstat would have been 18 at width 70. The remaining errors are caused by setext-style headers whose tags were joined to their previous lines and could probably be fixed with some effort. I’ve already looked at the poorest performing commit in Linux; the runner-up,linux@74472233233f
, also includes unindented output but in contrast the prose in that message would have replicated perfectly at width 54.Tables often are not indented even when code samples are (
git@36419c8ee41c
). Tables come in too many variations for any kind of half-way sophisticated support to be possible. This is potentially problematic: if such listings are intentionally not indented because indentation is stylistically undesirable,commitmsgfmt
’s need to process the entire message makes it useless compared to something like Vim’sgqip
.Some code samples are not indented (
git@9f50d32b9c20
,linux@449809a66c1d
). This seems up to personal taste but there may be a case to be made for leading whitespace causing confusion, say, in code samples.Some commits are simply inconsistently wrapped (
git@1f07c4e5cefe
,linux@bd4cf0ed331a
), or possibly wrapped with an algorithm other thancommitmsgfmt
’s.Subsequent paragraphs in list items with multiple paragraphs are not recognized as belonging to a list and have their indentation reset (
linux@bd4cf0ed331a
). Determining the impact of this limitation is… difficult. I would have to build a special tool that can preserve context to know whether a given indented block belongs to a list item or stands alone, at which point I would have essentially built support. The main obstacle is the lack of a clear indication of whether we’re formatting a list: an extra list item paragraph is indistinguishable from an improperly indented plain paragraph. A naive heuristic of list items consuming all paragraphs with leading spaces (resolving conflicts with literals) could be worth trying, though.Literals and list items not preceded by a blank line get folded into the previous paragraph (
git@c189c4f2c420
,linux@7854ea6c28c6
).Tab-indented list items that span multiple lines have their continuation lines indented with spaces (derived from
linux@7854ea6c28c6
). Oops. This is a bug.Usenet quotes get mangled (
git@adc446fe5d5f
). It would be trivial to not mangle them but proper support would have them reformatted as prose, preserving the original quote prefix; that takes more work. These get used but not often:$ git -C git log --no-merges --oneline --grep '^\s*>' | wc -l 108 $ git -C linux log --no-merges --oneline --grep '^\s*>' | wc -l 2130
The last thing I want to look at is what causes just a few errors. In particular, there are surprisingly many data points with diffstats 1 or 2:
head(table( )) |
0 | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|---|
git$diffstat |
3,475 | 12 | 594 | 80 | 316 | 44 |
linux$diffstat |
23,338 | 489 | 1,765 | 781 | 2,119 | 483 |
The commits where the diffstat is 1 are interesting because commitmsgfmt
either inserted or removed a single—probably blank—line; it didn’t
modify an existing line. Here are the first 15 of those:
> head(git[git$diffstat == 1, ], 15)
id lines width diffstat
807 28258afe912303f717fcdccaca124c1ae3e8e004 5 69 1
1269 4056c09114e66ce3c2368551f0122e83628750d6 4 55 1
1312 4280e5333354c6dddcd994bdacd3c6a11ac2da5e 4 19 1
1341 43bddeb43d86c2a8093aed0217137afd27eb821b 3 61 1
2280 74b46e32cb3907a4a062a0f11de5773054b7c71a 5 67 1
2680 886a39074be34d21afc6c1b8af1f7f4b3ef54dc5 4 38 1
2860 91af81a98ea5c5594c67a63abc933250e05c13c6 6 69 1
2896 93256315b2444601a35484f4fb76cd5723284201 5 66 1
3154 a028a98e9ae83231f0657fdb112f7d9c0cf0b98c 4 78 1
3956 ca016f0e4e7e508407fe17e121fcd520fbb7c865 5 67 1
4230 d7e3868cdfdc73c3de15296ecf32138a8308c07e 4 43 1
4359 df450923a2a08c50976f2d241a1c4992cf03b3a7 5 68 1
> head(linux[linux$diffstat == 1, ], 15)
id lines width diffstat
4 000725d56a196e72dc22328324c5ec5506265736 6 72 1
119 00ca79a04bef1a1b30ef8afd992d905b6d986caf 11 67 1
127 00db8a8ecc91abeb46d1128a788a194018c51e77 7 45 1
204 017c59c042d01fc84cae7a8ea475861e702c77ab 13 72 1
256 01e1b69cfcdcfdd5b405165eaba29428f8b18a7c 10 73 1
328 0279b4cd86685b5eea467c1b74ce94f0add2c0a3 9 48 1
334 0287e43dda1a425da662f879dd27352021b0ca63 10 78 1
380 02d90fc343411d6dff26bbd64f0895a243e6f608 5 37 1
521 03e10e5ab5ba6511ddaf80085cf08c62e9336fa5 9 67 1
533 0400e3134b03336617138f9ebf2cd0f117ceef20 7 62 1
540 040fa1b9620cd019349414505828b2ffbeded7f8 8 39 1
543 041509db390cf97b09df0f51024f5d40407938db 9 44 1
616 04bedfa542d90ac7a1bbf28287e9861d0da21576 10 65 1
644 0501a3816e5b778830fc2157a6d6bb11a965fc2c 11 69 1
872 06a24dec10bc4014fc0974670627efed68f5da27 14 67 1
In Git all the single-line differences come from manually wrapped—or
run-on—subject lines. None of them are longer than commitmsgfmt
’s
somewhat arbitrarily chosen subject line limit, but because the parser operates
on lines the continuation lines are interpreted as body text and a “missing”
blank line gets inserted before the first continuation line. Despite knowing
that git log
’s %s
format placeholder joins consecutive subject lines to
form a single subject I failed to anticipate this scenario.
With the command
$ git -C git rev-list --no-merges HEAD |
while IFS= read c; do
git -C git show --format=format:%B -s $c |
awk -v c=$c 'NR == 2 { if ($0) print c }'; done
I find 101 messages whose second line—that is, the expected separator between the subject line and the message body—is non-empty. Most of those commits are from 2005, with none after 2007. This style evidently is a historical artifact not worth accommodating.
Relatedly, I can assess the sensibility of the choice of 90 columns as a hard upper limit on subject line length by inspecting actually long subject lines. The following command extracts the latest 50 commits with subject lines of at least 90 characters:
$ git -C git log --no-merges --format='%h %s' --abbrev=15 |
grep --perl-regexp '^.{15} .{90}' |
head -n 50 |
while IFS=' ' read id s; do git show -s $id; done
The output has a lot of overlap with that of the previous command; it shows that very long subject lines only happen a few times a year since 2007, where the latest peak was. The diffstat-1 commits above also mostly date back to 2005-2007, none later than 2010.
In Linux the story is a little different. Among the 15 commits singled out
above are a few wrapped or run-on subject lines, like in Git, but most of them
are a curious case of a redundant blank line at the end of the message, which
commitmsgfmt
filters out. It’s not clear where this extra line comes from;
passing git show -s --format=format:%B
for commits linux@000725d56a19
and linux@00ca79a04bef
through xxd(1)
clearly shows a second trailing
0A
in the latter and git cat-file commit
corroborates that, yet with git
show -s --format=raw
, which is supposed to show the stored value, there is no
difference, and patches generated with git format-patch
are structurally
identical. It could be a change in Git but grep -C3 trailing
git/Documentation/RelNotes/*
, though a very course-grained inspection, turns
up nothing relevant.
Next up, where the diffstat is 2 commitmsgfmt
made an in-place modification
to an existing line that did not cause it to be wrapped differently:
> head(git[git$diffstat == 2, ], 15)
id lines width diffstat
13 00889c8ca7017d7b9d07d47843e3a3e245a1de26 3 43 2
14 008c208c2c54f7bb97bfb7bc5dc606a4eb0d6837 7 57 2
39 01eea6f355f35098cc5038e94622e30ed31a9267 4 46 2
54 029443070a4e5b0290a2d09f3707bc486d84a961 13 71 2
59 02f3389d9647378ed864ff1cdfb6f0238b64ee91 8 66 2
67 034016391c475e98c38a9b715cd670b8b2d0c619 3 32 2
71 03a182107fdb36170a72b8a3d94de2b52e3f6668 3 46 2
78 03d311eda2d8c2d23855b9d3e904c7648925ab56 7 65 2
87 043d76050d3136b8684b5a3938e8bc0c1f8483fd 4 46 2
97 0499b24ad664d7f6a33d4cfe4a11912ae455b039 8 71 2
101 04d24455ceb0129195f60af6b62981542433ecf7 3 35 2
103 04f89259a67ba1ec291f023b70278d41ed665a13 10 72 2
104 05094f987c2f141ec9b873ad6d6303b0aca173a4 3 37 2
107 0532a5e46b88cd70c952a2bf5dc63681be32a2d9 7 64 2
110 056211053b7516a57ff7a6dd02f503ecef6fca70 7 63 2
> head(linux[linux$diffstat == 2, ], 15)
id lines width diffstat
39 004b4f2d4448cff7f13871c05d744b00a7c74d4a 8 62 2
68 0072032d7babc4347556c1863919f3c532d9cf5b 6 71 2
89 0095d58b4a91b9fb57aeb781909355b232517c64 4 36 2
100 00a6cae288138ce0444ab6f48a81da12afe557aa 7 70 2
120 00cde6748255a84beecfdea4caeaf7c9cd05a527 9 57 2
144 01062ad31b531f2a3bcbfd9f6267e4e0c1103010 7 67 2
168 013db7575bdfb57d295e9a27186f52c7547ef2d2 4 52 2
213 01945fa248ff6d34f5fdb8106118910b77b76f91 18 72 2
221 01a6221a6a51ec47b9ae3ed42c396f98dd488c7e 8 71 2
236 01bbf2c7ddc93479eecebf8495848c0f362130c5 9 70 2
238 01c37be40fc35137a8b97e5176017958d57e401c 7 62 2
248 01d883d44a1ca8dc77486635d428cba63e7fdadf 16 70 2
266 01f37f2d53e14a05b7fc3601d182f31ac3b35847 4 55 2
344 0293a509405dccecc30783a5d729d615b68d6a77 6 37 2
345 02947ecb0de7a011215568263fd48f3d5b0f8573 5 27 2
These are overwhelmingly dominated by the stripping of a terminating period in subject lines, interspersed with a few one-line literals that were not recognized. Retaining the offending periods would be trivial but removing them was a deliberate stylistic choice: aesthetically I prefer not ending titles and subjects with periods and treating commit subject lines the same way, and pragmatically I find that teaching authors to omit periods discourages excessively long subject. Fortunately, this practice has also fallen into disuse:
Tangent: in extracting the data for the subject line period graph I made a curious discovery that can be illustrated with the command
$ git -C linux log --no-merges --date=iso --format='%ad %cd %h' | grep --extended-regexp '^(1970|20(0[1-4]|19|3.))'
Finally, where the diffstat is 3 commitmsgfmt
probably added or removed
a single line. What causes that?
> head(git[git$diffstat == 3, ], 15)
id lines width diffstat
128 0673bb28d0b96393e526030c96e7dbfac2fc33a2 8 53 3
155 0795805053ad89a24954684fca2e69fea2bf50b9 11 67 3
208 0a61779994aa3de41d57bb85bd88a2f56c7ba7d8 24 64 3
224 0b9dca434f5d9208a26f47f7ec11453f1cfdfae8 11 58 3
229 0bdb28c9ccd85b1c606664154b6f6d39a4c315fd 10 65 3
238 0ca2972345d6de3b9eb845af4b0e9b701af120bd 18 68 3
318 10f5c526561604ba9677dc27643b5c9bfad36458 28 68 3
324 1126b419d6835f6b8c45ccfffc0ada9b09e32d87 21 64 3
333 1170e8026ae507c510619fceff969b1cf2900a28 9 72 3
354 124d80928d2a89e78212d03eb9fa3ba1aaa4d56b 6 53 3
417 1568fea01eb25a293f8dae570a31ca34d41b5442 7 61 3
473 17a8b25005bb09b03bee7ddac5412c7d29675eef 18 74 3
523 1a66a489d09e7b8629fa7e4184c78703f4eed335 16 68 3
623 1e41827d2d5cf0e4c6ebff91958fa47d69b7ff42 12 67 3
673 21246dbb9e0af278bc671e32e45a6cae4ad38b6f 18 70 3
> head(linux[linux$diffstat == 3, ], 15)
id lines width diffstat
262 01ef66bbb65aa4db100b267778202d7657e244e4 9 66 3
274 0202775bc3a28f2436ea6ee13ef3eb0e8f237857 14 75 3
400 031005f78c8d0aebc17ddf7a34af9ffd48034d7d 7 65 3
406 0316fe8319ff62e527d0d91a3bc7df1c59eafae8 14 64 3
708 0571c7a4f58fc6070fb9d556e4896de864c2a593 20 26 3
788 06265442028254cce5a6f19dfa0bab11c88f06fe 8 44 3
895 06cd9a7dc8a58186060a91b6ddc031057435fd34 10 24 3
970 075db1502ffd4ff8c58020167484a6e123ae01a3 13 120 3
1170 0905bc94d5ad8a928eed26e0896857fb54dcb366 9 23 3
1227 0975b16274bad1f0bd5c5fd6ab759c5a9ee11949 12 55 3
1383 0abcc6df070687816b0ca0aefc3d64c62773063c 7 67 3
1429 0b0ef1d027008f019ced2d69e343bb1257326b12 10 79 3
1565 0c227c51b98c03c6e7fb4f342f930cf576292064 10 68 3
1646 0cb34dc2a37290f4069c5b01735c9725dc0a1b5c 12 128 3
1660 0ccff1a49def92d6b838a6da166c89004b3a4d0c 9 24 3
In Git it’s mainly orphan creation. This is a slightly unfortunate
outcome. It’s impossible to say whether orphan elimination was deliberate in
the offending (offended?) messages, such as git@0a61779994aa
, but it is
not a given that commitmsgfmt
’s transformation is preferable and rather
likely that it isn’t. The text limit could be relaxed for the last word of a
paragraph but that is far from a reliable solution.
In the special case of git@0795805053ad
the wrapped line is a single
unindented URL. This result is not desirable either: a URL is more difficult to
work with when wrapped, for instance when clicking on terminal output.
Personally I work around that problem, and a related problem of ragged right
text caused by not wrapping URLs in prose, by moving URLs into references,
but that sometimes produces awkward sentences instead. An exception to never
wrap inside URLs could remove this limitation and possibly upgrade references
to general footnotes, but the problem is not very pronounced. At the
3rd quartile of message widths, Git has 50 commits with URLs
susceptible to this problem and Linux has 1079:
$ git -C git log --oneline --grep '^\s\{0,3\}https\?://.\{64\}' | wc -l
50
$ git -C linux log --oneline --grep '^\s\{0,3\}https\?://.\{64\}' | wc -l
1079
The number of multi-word footnotes that would have been recognized by
commitmsgfmt
is less than half that, at 20 respectively 361:
$ git -C git log --oneline --grep '^\[[[:digit:]]\+\]\(\s\+\S\+\)\{2\}' |
wc -l
20
$ git -C linux log --oneline --grep '^\[[[:digit:]]\+\]\(\s\+\S\+\)\{2\}' |
wc -l
361
In Linux there is more variance but the errors come down to undetected literals and orphans.
Threads to validity
Replication ability does not translate directly into performance during
active use. This is trivially true; for instance, tee(1)
has perfect
replication ability because it doesn’t transform its input but consequently
would also be useless for wrapping text. The analysis proves that
commitmsgfmt
transforms all input and that most transformations can be
reduced to the identity transformation, however, there are probably false
positives: anything falsely recognized as a literal will skip wrapping and will
not be flagged as an error. As an indication of how bad this problem could be
I can calculate the volume of lines that could possibly be recognized as
literals:
$ git -C git log --no-merges --format=%B | grep --count '^\( \{4\}\|\t\)'
30312
$ git -C git log --no-merges --format=%B | wc -l
472642
$ git -C linux log --no-merges --format=%B | grep --count '^\( \{4\}\|\t\)'
345258
$ git -C linux log --no-merges --format=%B | wc -l
8761305
Less than 7% and 3%, which is non-trivial but hardly breaking. Ultimately there is no way to determine performance during active use other than active use.
Sampling and representativeness of samples throughout can be challenged. The Git sample is proportionally larger than the Linux sample, which could cause bias, while the Linux sample is perhaps unfairly small and for no other reason than my patience. The probability distribution of samples is basically undocumented, although that could be trivially resolved merely by using another tool. Even if these samples are fine, Git and Linux certainly are not implicitly representative of software projects in general, nor are they blessed in any way but for being perhaps the two oldest Git repositories.
Ignoring merge commits can be seen as discarding unfavourable data, thereby
misrepresenting commitmsgfmt
’s accuracy. Choosing projects that do not format
merge commits specially would just be another form of that, but that form would
be prone to overestimating commitmsgfmt
’s accuracy by virtue of not
distinguishing between non-merge commits and merge commits.
- Unless you consider having your message mangled to be a passive-aggressive reprimand. [return]
- Contributed by Tim Pope to match a famous commit message guideline he wrote over 10 years ago. [return]
- Unfortunately you will probably have maximum integration difficulties. [return]
- If there is an integration mechanism and it is not documented I will happily work with you to change that. [return]
- I use
head
instead ofsample
for reproducibility. This could be said to cause overfitting but the original selection of commits is already random. [return]