tinyapps.org / blog

Detect the character encoding of a file #

The aforementioned Perl module Unicode::Japanese includes ujguess, which attempts to detect the character encoding of a given file. The Unix program file is often suggested on forums and the like for this purpose, but it only returns the file type, not the encoding. Here's an illustration of the difference, using a Shift JIS-encoded file:
$ file foo
foo: UTF-8 Unicode text, with no line terminators

$ ujguess foo
and an EUC-encoded one:
$ file bar
bar: ISO-8859 text, with CRLF line terminators

$ ujguess bar

/nix | Jan 03, 2010

Subscribe or visit the archives