Update: Okay, duh: I shouldn’t have called it “Bash”. What I meant was, “whatever the sort utility is in my default terminal.” Which, as Bryan points out in a comment below, has nothing to do with Bash: it’s GNU Sort. More updates below.
I have a little text file with “Hello World” in lots of languages, which I often use for testing. I extracted a few lines with various scripts and saved that as helloworld.txt.
$ cat helloworld.txt
สวัสดีราคาถูก! Thai
Habari dunia! Kiswahili
Halló heimur! Icelandic
Saluton Mondo! Esperanto
Sveika, pasaule! Latvian
Привет, мир! Russian
ሠላም ዓለም! Amharic
안녕, 세상! Korean
Chào thế giới! Vietnamese
Hallo, wrâld Frisian
Hallo verden! Norwegian/Bokmal
Laba ryta, pasauli! Lithuanian
For my first amazing trick, I sort the file with the Bash shell built-in:
$ sort helloworld.txt
ሠላም ዓለም! Amharic
Chào thế giới! Vietnamese
Habari dunia! Kiswahili
Halló heimur! Icelandic
Hallo verden! Norwegian/Bokmal
Hallo, wrâld Frisian
안녕, 세상! Korean
Laba ryta, pasauli! Lithuanian
Saluton Mondo! Esperanto
Sveika, pasaule! Latvian
สวัสดีราคาถูก! Thai
Привет, мир! Russian
…which sucks. Because obviously Bash is ignoring anything fancy (Amharic, Korean, Thai) and sorting strictly by whatever ASCII shows up in the line. (Hard to say whether the «ó» in Icelandic is being considered, but shouldn’t it come after «o» anyway?)
I also installed and tried another terminal called rxvt-unicode, which supposedly has better Unicode support. I got the same results as what I got in Bash under gnome-terminal, which suggests to me that the problem is Bash, or somewhere deeper, and not the terminal itself. I got the same result.
$ python
>>> lines = open('helloworld.txt').read().decode('utf-8').splitlines()
>>> for line in sorted(lines): print line
...
Chào thế giới! Vietnamese
Habari dunia! Kiswahili
Hallo verden! Norwegian/Bokmal
Hallo, wrâld Frisian
Halló heimur! Icelandic
Laba ryta, pasauli! Lithuanian
Saluton Mondo! Esperanto
Sveika, pasaule! Latvian
Привет, мир! Russian
สวัสดีราคาถูก! Thai
ሠላም ዓለም! Amharic
안녕, 세상! Korean
Python does better; clearly things are being sorted according to their Unicode code points. Which of course is a far cry from following UTS #10: Unicode Collation Algorithm, but that has to do with locales and all that.
In any case, I won’t be trusting Bash to sort Unicode files any more.
(I’d be interested to know what the default sort does to the initial input in various other programming languages, comments welcome.)
Update:
After Bryan’s comment pointed out that it wasn’t Bash that I was even dealing with, but rather GNU sort , reading through the manual I discovered the following trick in a footnote:
$ export LC_ALL=C; sort hw.txt
Chào thế giới! Vietnamese
Habari dunia! Kiswahili
Hallo verden! Norwegian/Bokmal
Hallo, wrâld Frisian
Halló heimur! Icelandic
Laba ryta, pasauli! Lithuanian
Saluton Mondo! Esperanto
Sveika, pasaule! Latvian
Привет, мир! Russian
สวัสดีราคาถูก! Thai
ሠላም ዓለም! Amharic
안녕, 세상! Korean
Which seems to be what I was looking for.