In short, sort the names with this command (-k1,1 isn't needed because
the lines with names start with " -"):
LC_ALL=en_US.UTF-8 sort -k2,2 -k3,3 -k4,4 -k5,5
When THANKS was created, I wrote the names as "First Last" and attempted
to keep them sorted by last name / surname / family name. This works
with many names in THANKS, but it becomes complicated with names that
don't fit that pattern. For example, names that are written as
"Last First" can be manually sorted by family name, but only if one
knows which part of the name is the family name.[*] And of course,
the concept of first/last name doesn't apply to all names.
[*] xz had a co-maintainer who could help me with such names,
but fortunately he isn't working on the project anymore.
Adding the names in chronological order could have worked too, although
if something is contributed by multiple people, one would still have to
decide how to sort the names within the batch. Another downside would
be that if THANKS is updated in more than one work-in-progress branch,
merge conflicts would occur more often.
Don't attempt to sort by last name. Let's be happy that people tend to
provide names that can be expressed in a reasonable number of printable
Unicode characters. In practice, people have been even nicer: if the
native language doesn't use a Latin script alphabet, people often provide
a transliterated name (only or in addition to the original spelling),
which is very much appreciated by those who don't know the native script.
Treat the names as opaque strings or space-separated strings for sorting
purposes. This means that most names will now be sorted by first name.
There still are many choices how to sort:
(1) LC_ALL=en_US.UTF-8 sort
The project is in English, so this may sound like a logical choice.
However, spaces have a lower weight than letters, which results in
this order:
- A Ba
- Ab C
- A Bc
- A Bd
(2) LC_ALL=en_US.UTF-8 sort -k2,2
This first sorts by the first word and then by the rest of the
string. It's -k2,2 instead of -k1,1 to skip the leading dash.
- A Ba
- A Bc
- A Bd
- Ab C
I like this more than (1). One could add -k3,3 -k4,4 -k5,5 ... too.
With current THANKS it makes no difference but it might some day.
NOTE: The ordering in en_US.UTF-8 can differ between libc versions
and operating systems. Luckily it's not a big deal in THANKS.
(3) LC_ALL=en_US.UTF-8 sort -f -k2,2
Passing -f (--ignore-case) to sort affects sorting of single-byte
characters but not multibyte characters (GNU coreutils 9.9):
No -f With -f LC_ALL=C
Aa A.A A.A
A.A Aa Aa
Ää Ää Ä.Ä
Ä.Ä Ä.Ä Ää
In GNU coreutils, the THANKS file is sorted using "sort -f -k1,1".
There is also a basic check that the en_US.UTF-8 locale is
behaving as expected.
(4) LC_ALL=C sort
This sorts by byte order which in UTF-8 is the same as Unicode
code point order. With the strings in (1) and (2), this produces
the same result as in (2). The difference in (3) can be seen above.
The results differ from en_US.UTF-8 when a name component starts
with a lower case ASCII letter (like "von" or "de"). Worse, any
non-ASCII characters sort after ASCII chars. These properties might
look weird in English language text, although it's good to remember
that en_US.UTF-8 sorting can appear weird too if one's native
language isn't English.
The choice between (2) and (4) was difficult but I went with (2).
;-)