gay robot noises

cut -c considered harmful

GNU coreutils still don't support UTF-8.
date:
958 words
ASTRAL SLICE

Stepping on the rake

Like many window manager nerds, I have a status bar (in my case, waybar) that shows the artist and title of the song I'm currently listening to. cmus has a command-line tool that (among other things) will format information for you, so I thought it was as easy as cmus-remote -C "format_print '%a - %t'". And that does, in fact work... as long as I don't listen to any Godspeed You! Black Emperor, because

Godspeed You! Black Emperor - A Military Alphabet (five eyes all blind) (4521.0kHz 6730.0kHz 4109.09kHz) / Job’s Lament / First of the Last Glaciers / where we break how we shine (ROCKETS FOR MARY)

is a bit long to put on my status bar; it'd take up almost all the available space by itself. And unfortunately, even though format-print takes an argument that looks like a standard printf-style format string, there's no 'truncate to this length' modifier for strings like there is in printf proper.

Fortunately, Unix tools to the rescue! cut is generally used for splitting lines on delimiters, but the -c argument will select a range of characters. So you can just do

> cmus-remote -C "format_print '%a - %t'" | cut -c-60
Godspeed You! Black Emperor - A Military Alphabet (five eyes

and while the unbalanced parenthesis is kind of visually offensive, and an ellipsis might be nice, it's good enough to use.

Until I noticed that while I was listening to the FFXIV soundtrack, sometimes the status bar wouldn't update when I changed songs. Looking at the waybar output showed me this error:

(waybar:2854654): Gtk-WARNING **: 11:04:07.422: Failed to set text 'Masayoshi Soken - Voidal Manifest / ヴォイドの棺 ~\xe9\xad' from markup due to error parsing markup: Error on line 1 char 69: Invalid UTF-8 encoded text in name — not valid “Masayoshi Soken - Voidal Manifest / ヴォイドの棺 ~\xe9\xad”

At first I thought this was a weird encoding error, since if I remember correctly the ID3 information was originally in Shift-JIS and I had to do some hackery to get them into UTF-8. But it wouldn't make sense for only some songs to have this problen. And then I realized that's not actually 60 characters:

> python3 -c 'print(len("Masayoshi Soken - Voidal Manifest / ヴォイド の棺 ~"))'
44

The conclusion is that -c/--characters, despite saying it operates on characters, and despite the presence of a -b/--bytes option, actually operates on bytes. And sure enough, the documentation on the GNU website says:

Select for printing only the characters in positions listed in character-list. The same as -b for now, but internationalization will change that.

This isn't present in the manpage (at least, not on my install), but it is in the info page. And since I never check the info page (does anyone?), I had no clue.

Hilariously, there's a -n option to not split multi-byte characters... but it's a no-op. As far as I can tell it's for compatibility with BSD's cut, which does support this, and it looks like it even actually implements -c correctly (since it has no mention of internationalization bugs, and explicitly says the behavior depends on locale variables).

Disappointment

I figured someone had already reported this to GNU, so I looked. I found this bug report from 2010. In it, someone mentions that several distros have patches to fix this, but they duplicate code and/or slow it down on the single-byte locale case. Someone else adds that

We're working on it. Something is coming soon.

There hasn't been any activity since then aside from changing the severity and title. This collection of bugs across GNU coreutils shows this has been going on for a while. I'm not sure if any of those bugs have been fixed, or if anybody's written partial support coreutils. But, honestly, it's 2021. GNU coreutils is supposed to be the solid foundation of Linux; I see people talking all the time about how powerful these tools are. And they break as soon as you give them anything that isn't a single-byte encoding... which, in the era where UTF-8 has become the dominant encoding and emojis are common, is going to be a lot of text.

You can hack around this by using iconv to convert your input to UTF-32, which will have four bytes for every character, and then back afterwards. And GNU awk will just do the right thing automatically if your locale is set up right:

> echo "ヴォイド の棺 ~" | awk '{ print substr($0, 0, 4) }'
ヴォイド

But it's still extremely silly that this is a thing you have to do. It's not 2000 any more; the world is Unicode and UTF-8 now, and you can't get away with assuming single-byte encodings in general-purpose tools.

So if you're doing text processing, unless you're sure you'll never run into multibyte characters, don't use GNU cut. Use awk or something else instead.

// α-5/h