Tech Trivia: Linux: Unusually slow performance of grep/sort/other text processing commands

The grep manual does provide a subtle hint but it’s something that can be easily missed by the english speaking community (since they don’t need to use MBS – Multi Byte Strings – so often).

The multibyte setting can have a disastrous impact on the performance of text processing utilities that make use of the operating system’s built in regular expression processing capabilities. In plane english (and to cut the not-so-crappy details behind it), it depends a lot on the environment variables LC_* (in regex terms, read as “the common environment variable starting with LC_).

If you are sure that you don’t need multibyte processing in the processes you are running, just set LC_ALL=C (or LC_ALL=POSIX) and then run the grep/sort/text processing command that you want. This should do the trick.

And if you do need multibyte processing, well…life isn’t half as rosy, or so it seems as the bug to fix this in grep (at least) is still open!

For those who want to dig deeper, here’s a thread that can be of help. There’s a bug against the GNU Grep on this.


