1 @c GNU Version-sort ordering documentation
3 @c Copyright (C) 2019--2023 Free Software Foundation, Inc.
5 @c Permission is granted to copy, distribute and/or modify this document
6 @c under the terms of the GNU Free Documentation License, Version 1.3 or
7 @c any later version published by the Free Software Foundation; with no
8 @c Invariant Sections, no Front-Cover Texts, and no Back-Cover
9 @c Texts. A copy of the license is included in the ``GNU Free
10 @c Documentation License'' file as part of this distribution.
12 @c Written by Assaf Gordon
14 @node Version sort ordering
15 @chapter Version sort ordering
19 @node Version sort overview
20 @section Version sort overview
22 @dfn{Version sort} puts items such as file names and lines of
23 text in an order that feels natural to people, when the text
24 contains a mixture of letters and digits.
26 Lexicographic sorting usually does not produce the order that one expects
27 because comparisons are made on a character-by-character basis.
29 Compare the sorting of the following items:
32 Lexicographic sort: Version Sort:
40 Version sort functionality in GNU Coreutils is available in the @samp{ls -v},
41 @samp{ls --sort=version}, @samp{sort -V}, and
42 @samp{sort --version-sort} commands.
46 @node Using version sort in GNU Coreutils
47 @subsection Using version sort in GNU Coreutils
49 Two GNU Coreutils programs use version sort: @command{ls} and @command{sort}.
51 To list files in version sort order, use @command{ls}
52 with the @option{-v} or @option{--sort=version} option:
55 default sort: version sort:
66 To sort text files in version sort order, use @command{sort} with
67 the @option{-V} or @option{--version-sort} option:
77 lexicographic order: version sort order:
79 $ sort input $ sort -V input
86 To sort a specific field in a file, use @option{-k/--key} with
87 @samp{V} type sorting, which is often combined with @samp{b} to
88 ignore leading blanks in the field:
96 $ sort -k 2bV,2 input2
103 @node Version sort and natural sort
104 @subsection Version sort and natural sort
106 In GNU Coreutils, the name @dfn{version sort} was chosen because it is based
107 on Debian GNU/Linux's algorithm of sorting packages' versions.
109 Its goal is to answer questions like
110 ``Which package is newer, @file{firefox-60.7.2} or @file{firefox-60.12.3}?''
112 In Coreutils this algorithm was slightly modified to work on more
113 general input such as textual strings and file names
114 (see @ref{Differences from Debian version sort}).
116 In other contexts, such as other programs and other programming
117 languages, a similar sorting functionality is called
118 @uref{https://en.wikipedia.org/wiki/Natural_sort_order,natural sort}.
121 @node Variations in version sort order
122 @subsection Variations in version sort order
124 Currently there is no standard for version sort.
126 That is: there is no one correct way or universally agreed-upon way to
127 order items. Each program and each programming language can decide its
128 own ordering algorithm and call it ``version sort'', ``natural sort'',
131 See @ref{Other version/natural sort implementations} for many examples of
132 differing sorting possibilities, each with its own rules and variations.
134 If you find a bug in the Coreutils implementation of version-sort, please
135 report it. @xref{Reporting version sort bugs}.
138 @node Version sort implementation
139 @section Version sort implementation
141 GNU Coreutils version sort is based on the ``upstream version''
143 @uref{https://www.debian.org/doc/debian-policy/ch-controlfields.html#version,
144 Debian's versioning scheme}.
146 This section describes the GNU Coreutils sort ordering rules.
148 The next section (@ref{Differences from Debian version
149 sort}) describes some differences between GNU Coreutils
150 and Debian version sort.
153 @node Version-sort ordering rules
154 @subsection Version-sort ordering rules
156 The version sort ordering rules are:
160 The strings are compared from left to right.
163 First the initial part of each string consisting entirely of non-digit
168 These two parts (either of which may be empty) are compared lexically.
169 If a difference is found it is returned.
172 The lexical comparison is a lexicographic comparison of byte strings,
177 ASCII letters sort before other bytes.
179 A tilde sorts before anything, even an empty string.
184 Then the initial part of the remainder of each string that contains
185 all the leading digits is determined. The numerical values represented by
186 these two parts are compared, and any difference found is returned as
187 the result of the comparison.
191 For these purposes an empty string (which can only occur at the end of
192 one or both version strings being compared) counts as zero.
195 Because the numerical value is used, non-identical strings can compare
196 equal. For example, @samp{123} compares equal to @samp{00123}, and
197 the empty string compares equal to @samp{0}.
201 These two steps (comparing and removing initial non-digit strings and
202 initial digit strings) are repeated until a difference is found or
203 both strings are exhausted.
206 Consider the version-sort comparison of two file names:
207 @file{foo07.7z} and @file{foo7a.7z}. The two strings will be broken
208 down to the following parts, and the parts compared respectively from
212 foo @r{vs} foo @r{(rule 2, non-digits)}
213 07 @r{vs} 7 @r{(rule 3, digits)}
214 . @r{vs} a. @r{(rule 2)}
215 7 @r{vs} 7 @r{(rule 3)}
216 z @r{vs} z @r{(rule 2)}
219 Comparison flow based on above algorithm:
223 The first parts (@samp{foo}) are identical.
226 The second parts (@samp{07} and @samp{7}) are compared numerically,
230 The third parts (@samp{.} vs @samp{a.}) are compared
231 lexically by ASCII value (rule 2.B).
234 The first byte of the first string (@samp{.}) is compared
235 to the first byte of the second string (@samp{a}).
238 Rule 2.B.a says letters sorts before non-letters.
239 Hence, @samp{a} comes before @samp{.}.
242 The returned result is that @file{foo7a.7z} comes before @file{foo07.7z}.
245 Result when using sort:
256 See @ref{Differences from Debian version sort} for
257 additional rules that extend the Debian algorithm in Coreutils.
260 @node Version sort is not the same as numeric sort
261 @subsection Version sort is not the same as numeric sort
263 Consider the following text file:
275 Numerical Sort: Version Sort:
277 $ sort -n input4 $ sort -V input4
287 Numeric sort (@samp{sort -n}) treats the entire string as a single numeric
288 value, and compares it to other values. For example, @samp{8.1}, @samp{8.10} and
289 @samp{8.100} are numerically equivalent, and are ordered together. Similarly,
290 @samp{8.49} is numerically less than @samp{8.5}, and appears before first.
292 Version sort (@samp{sort -V}) first breaks down the string into digit and
293 non-digit parts, and only then compares each part (see annotated
294 example in @ref{Version-sort ordering rules}).
296 Comparing the string @samp{8.1} to @samp{8.01}, first the
297 @samp{8}s are compared (and are identical), then the
298 dots (@samp{.}) are compared and are identical, and lastly the
299 remaining digits are compared numerically (@samp{1} and @samp{01}) --
300 which are numerically equal. Hence, @samp{8.01} and @samp{8.1}
301 are grouped together.
303 Similarly, comparing @samp{8.5} to @samp{8.49} -- the @samp{8}
304 and @samp{.} parts are identical, then the numeric values @samp{5} and
305 @samp{49} are compared. The resulting @samp{5} appears before @samp{49}.
307 This sorting order (where @samp{8.5} comes before @samp{8.49}) is common when
308 assigning versions to computer programs (while perhaps not intuitive
309 or ``natural'' for people).
311 @node Version sort punctuation
312 @subsection Version sort punctuation
314 Punctuation is sorted by ASCII order (rule 2.B).
317 $ touch 1.0.5_src.tar.gz 1.0_src.tar.gz
323 Why is @file{1.0.5_src.tar.gz} listed before @file{1.0_src.tar.gz}?
325 Based on the version-sort ordering rules, the strings are broken down
326 into the following parts:
329 1 @r{vs} 1 @r{(rule 3, all digits)}
330 . @r{vs} . @r{(rule 2, all non-digits)}
331 0 @r{vs} 0 @r{(rule 3)}
332 . @r{vs} _src.tar.gz @r{(rule 2)}
333 5 @r{vs} empty string @r{(no more bytes in the file name)}
334 _src.tar.gz @r{vs} empty string
337 The fourth parts (@samp{.} and @samp{_src.tar.gz}) are compared
338 lexically by ASCII order. The @samp{.} (ASCII value 46) is
339 less than @samp{_} (ASCII value 95) -- and should be listed before it.
341 Hence, @file{1.0.5_src.tar.gz} is listed first.
343 If a different byte appears instead of the underscore (for
344 example, percent sign @samp{%} ASCII value 37, which is less
345 than dot's ASCII value of 46), that file will be listed first:
348 $ touch 1.0.5_src.tar.gz 1.0%zzzzz.gz
353 The same reasoning applies to the following example, as @samp{.} with
354 ASCII value 46 is less than @samp{/} with ASCII value 47:
366 @node Punctuation vs letters
367 @subsection Punctuation vs letters
369 Rule 2.B.a says letters sort before non-letters
370 (after breaking down a string to digit and non-digit parts).
381 The input strings consist entirely of non-digits, and based on the
382 above algorithm have only one part, all non-digits
383 (@samp{a%} vs @samp{az}).
385 Each part is then compared lexically,
386 byte-by-byte; @samp{a} compares identically in both
389 Rule 2.B.a says a letter like @samp{z} sorts before
390 a non-letter like @samp{%} -- hence @samp{az} appears first (despite
391 @samp{z} having ASCII value of 122, much larger than @samp{%}
392 with ASCII value 37).
394 @node The tilde @samp{~}
395 @subsection The tilde @samp{~}
397 Rule 2.B.b says the tilde @samp{~} (ASCII 126) sorts
398 before other bytes, and before an empty string.
415 The sorting algorithm starts by breaking down the string into
416 non-digit (rule 2) and digit parts (rule 3).
418 In the above input file, only the last line in the input file starts
419 with a non-digit (@samp{~}). This is the first part. All other lines
420 in the input file start with a digit -- their first non-digit part is
423 Based on rule 2.B.b, tilde @samp{~} sorts before other bytes
424 and before the empty string -- hence it comes before all other strings,
425 and is listed first in the sorted output.
427 The remaining lines (@samp{1}, @samp{1%}, @samp{1.2}, @samp{1~})
428 follow similar logic: The digit part is extracted (1 for all strings)
429 and compares equal. The following extracted parts for the remaining
430 input lines are: empty part, @samp{%}, @samp{.}, @samp{~}.
432 Tilde sorts before all others, hence the line @samp{1~} appears next.
434 The remaining lines (@samp{1}, @samp{1%}, @samp{1.2}) are sorted based
435 on previously explained rules.
437 @node Version sort ignores locale
438 @subsection Version sort ignores locale
440 In version sort, Unicode characters are compared byte-by-byte according
441 to their binary representation, ignoring their Unicode value or the
444 Most commonly, Unicode characters are encoded as UTF-8 bytes; for
445 example, GREEK SMALL LETTER ALPHA (U+03B1, @samp{α}) is encoded as the
446 UTF-8 sequence @samp{0xCE 0xB1}). The encoding is compared
447 byte-by-byte, e.g., first @samp{0xCE} (decimal value 206) then
448 @samp{0xB1} (decimal value 177).
451 $ touch aa az "a%" "aα"
459 Ignoring the first letter (@samp{a}) which is identical in all
460 strings, the compared values are:
462 @samp{a} and @samp{z} are letters, and sort before
463 all other non-digits.
465 Then, percent sign @samp{%} (ASCII value 37) is compared to the
466 first byte of the UTF-8 sequence of @samp{α}, which is 0xCE or 206). The
467 value 37 is smaller, hence @samp{a%} is listed before @samp{aα}.
469 @node Differences from Debian version sort
470 @section Differences from Debian version sort
472 GNU Coreutils version sort differs slightly from the
473 official Debian algorithm, in order to accommodate more general usage
474 and file name listing.
477 @node Hyphen-minus and colon
478 @subsection Hyphen-minus @samp{-} and colon @samp{:}
480 In Debian's version string syntax the version consists of three parts:
482 [epoch:]upstream_version[-debian_revision]
484 The @samp{epoch} and @samp{debian_revision} parts are optional.
486 Example of such version strings:
497 If the @samp{debian_revision part} is not present,
498 hyphens @samp{-} are not allowed.
499 If epoch is not present, colons @samp{:} are not allowed.
501 If these parts are present, hyphen and/or colons can appear only once
502 in valid Debian version strings.
504 In GNU Coreutils, such restrictions are not reasonable (a file name can
505 have many hyphens, a line of text can have many colons).
507 As a result, in GNU Coreutils hyphens and colons are treated exactly
508 like all other punctuation, i.e., they are sorted after
509 letters. @xref{Version sort punctuation}.
511 In Debian, these characters are treated differently than in Coreutils:
512 a version string with hyphen will sort before similar strings without
522 $ if dpkg --compare-versions 1abb lt 1ab-cd
524 > else echo out of order
529 For further details, see @ref{Comparing two strings using Debian's
530 algorithm} and @uref{https://bugs.gnu.org/35939,GNU Bug 35939}.
532 @node Special priority in GNU Coreutils version sort
533 @subsection Special priority in GNU Coreutils version sort
535 In GNU Coreutils version sort, the following items have
536 special priority and sort before all other strings (listed in order):
539 @item The empty string
541 @item The string @samp{.} (a single dot, ASCII 46)
543 @item The string @samp{..} (two dots)
545 @item Strings starting with dot (@samp{.}) sort before
546 strings starting with any other byte.
552 $ printf '%s\n' a "" b "." c ".." ".d20" ".d3" | sort -V
562 These priorities make perfect sense for @samp{ls -v}: The special
563 files dot @samp{.} and dot-dot @samp{..} will be listed
564 first, followed by any hidden files (files starting with a dot),
565 followed by non-hidden files.
567 For @samp{sort -V} these priorities might seem arbitrary. However,
568 because the sorting code is shared between the @command{ls} and @command{sort}
569 program, the ordering rules are the same.
571 @node Special handling of file extensions
572 @subsection Special handling of file extensions
574 GNU Coreutils version sort implements specialized handling
575 of strings that look like file names with extensions.
576 This enables slightly more natural ordering of file
579 The following additional rules apply when comparing two strings where
580 both begin with non-@samp{.}. They also apply when comparing two
581 strings where both begin with @samp{.} but neither is @samp{.} or @samp{..}.
585 A suffix (i.e., a file extension) is defined as: a dot, followed by an
586 ASCII letter or tilde, followed by zero or more ASCII letters, digits,
587 or tildes; all repeated zero or more times, and ending at string end.
588 This is equivalent to matching the extended regular expression
589 @code{(\.[A-Za-z~][A-Za-z0-9~]*)*$} in the C locale.
590 The longest such match is used, except that a suffix is not
591 allowed to match an entire nonempty string.
594 The suffixes are temporarily removed, and the strings are compared
595 without them, using version sort (see @ref{Version-sort ordering
596 rules}) without special priority (see @ref{Special priority in GNU
597 Coreutils version sort}).
600 If the suffix-less strings do not compare equal, this comparison
601 result is used and the suffixes are effectively ignored.
604 If the suffix-less strings compare equal, the suffixes are restored
605 and the entire strings are compared using version sort.
612 @samp{hello-8.txt}: the suffix is @samp{.txt}
615 @samp{hello-8.2.txt}: the suffix is @samp{.txt}
616 (@samp{.2} is not included because the dot is not followed by a letter)
619 @samp{hello-8.0.12.tar.gz}: the suffix is @samp{.tar.gz} (@samp{.0.12}
623 @samp{hello-8.2}: no suffix (suffix is an empty string)
626 @samp{hello.foobar65}: the suffix is @samp{.foobar65}
629 @samp{gcc-c++-10.8.12-0.7rc2.fc9.tar.bz2}: the suffix is
630 @samp{.fc9.tar.bz2} (@samp{.7rc2} is not included as it begins with a digit)
633 @samp{.autom4te.cfg}: the suffix is the entire string.
640 Comparing @samp{hello-8.txt} to @samp{hello-8.2.12.txt}, the
641 @samp{.txt} suffix is temporarily removed from both strings.
644 Comparing @samp{foo-10.3.tar.gz} to @samp{foo-10.tar.xz}, the suffixes
645 @samp{.tar.gz} and @samp{.tar.xz} are temporarily removed from the
653 Comparing @samp{hello.foobar65} to @samp{hello.foobar4}, the suffixes
654 (@samp{.foobar65} and @samp{.foobar4}) are temporarily removed. The
655 remaining strings are identical (@samp{hello}). The suffixes are then
656 restored, and the entire strings are compared (@samp{hello.foobar4} comes
664 When comparing the strings @samp{hello-8.2.txt} and @samp{hello-8.10.txt}, the
665 suffixes (@samp{.txt}) are temporarily removed. The remaining strings
666 (@samp{hello-8.2} and @samp{hello-8.10}) are compared as previously described
667 (@samp{hello-8.2} comes first).
668 @slanted{(In this case the suffix removal algorithm
669 does not have a noticeable effect on the resulting order.)}
672 @b{How does the suffix-removal algorithm effect ordering results?}
674 Consider the comparison of hello-8.txt and hello-8.2.txt.
676 Without the suffix-removal algorithm, the strings will be broken down
677 to the following parts:
680 hello- @r{vs} hello- @r{(rule 2, all non-digits)}
681 8 @r{vs} 8 @r{(rule 3, all digits)}
682 .txt @r{vs} . @r{(rule 2)}
687 The comparison of the third parts (@samp{.} vs
688 @samp{.txt}) will determine that the shorter string comes first --
689 resulting in @file{hello-8.2.txt} appearing first.
691 Indeed this is the order in which Debian's @command{dpkg} compares the strings.
693 A more natural result is that @file{hello-8.txt} should come before
694 @file{hello-8.2.txt}, and this is where the suffix-removal comes into play:
696 The suffixes (@samp{.txt}) are removed, and the remaining strings are
697 broken down into the following parts:
700 hello- @r{vs} hello- @r{(rule 2, all non-digits)}
701 8 @r{vs} 8 @r{(rule 3, all digits)}
702 empty @r{vs} . @r{(rule 2)}
706 As empty strings sort before non-empty strings, the result is @samp{hello-8}
709 A real-world example would be listing files such as:
710 @file{gcc_10.fc9.tar.gz}
711 and @file{gcc_10.8.12.7rc2.fc9.tar.bz2}: Debian's algorithm would list
712 @file{gcc_10.8.12.7rc2.fc9.tar.bz2} first, while @samp{ls -v} will list
713 @file{gcc_10.fc9.tar.gz} first.
715 These priorities make sense for @samp{ls -v}:
716 Versioned files will be listed in a more natural order.
718 For @samp{sort -V} these priorities might seem arbitrary. However,
719 because the sorting code is shared between the @command{ls} and @command{sort}
720 program, the ordering rules are the same.
723 @node Comparing two strings using Debian's algorithm
724 @subsection Comparing two strings using Debian's algorithm
726 The Debian program @command{dpkg} (available on all Debian and Ubuntu
727 installations) can compare two strings using the @option{--compare-versions}
730 To use it, create a helper shell function (simply copy & paste the
731 following snippet to your shell command-prompt):
735 if dpkg --compare-versions "$1" lt "$2"
736 then printf '%s\n' "$1" "$2"
737 else printf '%s\n' "$2" "$1"
742 Then compare two strings by calling @command{compver}:
750 Note that @command{dpkg} will warn if the strings have invalid syntax:
753 $ compver "foo07.7z" "foo7a.7z"
754 dpkg: warning: version 'foo07.7z' has bad syntax:
755 version number does not start with digit
756 dpkg: warning: version 'foo7a.7z' has bad syntax:
757 version number does not start with digit
760 $ compver "3.0/" "3.0.5"
761 dpkg: warning: version '3.0/' has bad syntax:
762 invalid character in version number
767 To illustrate the different handling of hyphens between Debian and
768 Coreutils algorithms (see
769 @ref{Hyphen-minus and colon}):
772 $ compver abb ab-cd 2>/dev/null $ printf 'abb\nab-cd\n' | sort -V
777 To illustrate the different handling of file extension: (see @ref{Special
778 handling of file extensions}):
781 $ compver hello-8.txt hello-8.2.txt 2>/dev/null
784 $ printf '%s\n' hello-8.txt hello-8.2.txt | sort -V
790 @node Advanced version sort topics
791 @section Advanced Topics
794 @node Reporting version sort bugs
795 @subsection Reporting version sort bugs
797 If you suspect a bug in GNU Coreutils version sort (i.e., in the
798 output of @samp{ls -v} or @samp{sort -V}), please first check the following:
802 Is the result consistent with Debian's own ordering (using @command{dpkg}, see
803 @ref{Comparing two strings using Debian's algorithm})? If it is, then this
804 is not a bug -- please do not report it.
807 If the result differs from Debian's, is it explained by one of the
808 sections in @ref{Differences from Debian version sort}? If it is,
809 then this is not a bug -- please do not report it.
812 If you have a question about specific ordering which is not explained
813 here, please write to @email{coreutils@@gnu.org}, and provide a
814 concise example that will help us diagnose the issue.
817 If you still suspect a bug which is not explained by the above, please
818 write to @email{bug-coreutils@@gnu.org} with a concrete example of the
819 suspected incorrect output, with details on why you think it is
824 @node Other version/natural sort implementations
825 @subsection Other version/natural sort implementations
827 As previously mentioned, there are multiple variations on
828 version/natural sort, each with its own rules. Some examples are:
833 Natural Sorting variants in
834 @uref{https://rosettacode.org/wiki/Natural_sorting,Rosetta Code}.
837 Python's @uref{https://pypi.org/project/natsort/,natsort package}
838 (includes detailed description of their sorting rules:
839 @uref{https://natsort.readthedocs.io/en/master/howitworks.html,
840 natsort -- how it works}).
843 Ruby's @uref{https://github.com/github/version_sorter,version_sorter}.
846 Perl has multiple packages for natural and version sorts
847 (each likely with its own rules and nuances):
848 @uref{https://metacpan.org/pod/Sort::Naturally,Sort::Naturally},
849 @uref{https://metacpan.org/pod/Sort::Versions,Sort::Versions},
850 @uref{https://metacpan.org/pod/CPAN::Version,CPAN::Version}.
853 PHP has a built-in function
854 @uref{https://www.php.net/manual/en/function.natsort.php,natsort}.
857 NodeJS's @uref{https://www.npmjs.com/package/natural-sort,natural-sort package}.
861 @uref{http://zsh.sourceforge.net/Doc/Release/Expansion.html#Glob-Qualifiers,
862 glob modifier} @samp{*(n)} will expand to files in natural sort order.
865 When writing C programs, the GNU libc library (@samp{glibc})
867 @uref{https://man7.org/linux/man-pages/man3/strverscmp.3.html,
868 strverscmp(3)} function to compare two strings, and
869 @uref{https://man7.org/linux/man-pages/man3/versionsort.3.html,versionsort(3)}
870 function to compare two directory entries (despite the names, they are
871 not identical to GNU Coreutils version sort ordering).
874 Using Debian's sorting algorithm in:
878 python: @uref{https://stackoverflow.com/a/4957741,
879 Stack Overflow Example #4957741}.
882 NodeJS: @uref{https://www.npmjs.com/package/deb-version-compare,
883 deb-version-compare}.
889 @node Related source code
890 @subsection Related source code
895 Debian's code which splits a version string into
896 @code{epoch/upstream_version/debian_revision} parts:
897 @uref{https://git.dpkg.org/cgit/dpkg/dpkg.git/tree/lib/dpkg/parsehelp.c#n191,
898 parsehelp.c:parseversion()}.
901 Debian's code which performs the @code{upstream_version} comparison:
902 @uref{https://git.dpkg.org/cgit/dpkg/dpkg.git/tree/lib/dpkg/version.c#n140,
906 Gnulib code (used by GNU Coreutils) which performs the version comparison:
907 @uref{https://git.savannah.gnu.org/cgit/gnulib.git/tree/lib/filevercmp.c,