How can I use unix tools with Cyrillic text?

I recently started processing Cyrillic text, and it’s been really difficult.

I couldn’t get my Python scripts to work with it at all. And I tried.

PHP worked well, but I don’t know PHP. I just managed to hack a few things together, and I still don’t feel comfortable in it. (It may become a bit of a mainstay, though, as it’s proven unexpectedly useful.)

Of course, grep is out of the question.

Or is it?

That’s what this question is about.

I wanted to do this:

alec@ROOROO:~/$ grep '\w\{4\}' cyrillicstuff

…and came up empty handed.

But is there a way I could have returned all words 4 characters or greater, given that they’re all in Cyrillic, using good ‘ol grep??

Answer

I believe you need to use the unicode-based character classes instead. The locale-aware class for word characters is [:alnum:] and this is used inside character class, so the command would be

grep '[[:alnum:]]\{4\}' cyrillicstuff

and make sure your locale is set to the encoding the file is actually in. You can check with locale command and look for what value it gives for LC_CTYPE category.

This syntax is supported by all tools that use POSIX basic or extended regular expressions like sed, awk etc. and also by perl and “perl compatible regular expressions” used by python and php. The perl and “perl compatible regular expressions” have one additional syntax \Y and \p{yyy}, where Y or YYY is a unicode category name, so \pL is the same as [:alpha:] and \p{Uppercase} should be the same as [:upper:]. All unicode categories should be usable.


Ad python. Python is perfectly unicode aware too. In python 3 it should work out of the box, opening files in locale encoding seems to be default there (but I just looked it up, not tested). However in python 2, you have to specify the encodings there manually. They should be set for stdin, stdout and stderr, but for all other files you have to use the codecs.open function and specify the encoding you get from locale.getpreferredencoding() and you have to initialize locales like in C with locale.setlocale(locale.LC_ALL, '').

Attribution
Source : Link , Question Author : ixtmixilix , Answer Author : Jan Hudec

Leave a Comment