[Namazu-users-en] Re: Malformed UTF-8 character

Earl Hood earl at earlhood.com
Sat Jun 18 02:04:21 JST 2005


On June 17, 2005 at 14:42, NOKUBI Takatsugu wrote:

> > >    Japanese use.  Now it can handle English and other latin
> > >    languages encoded in ISO-8859-* character sets.
> > >    ---------------------^^^^^^^^^^

> > Oh..... README is wrong. 
> 
> I think, in the past time, it was true because former perl was simple
> so we didn't care about encodings.
> Nowaday, perl become very complex and consider about many encodings
> and locales.

Perl became locale aware.  Setting the LC_ALL envar (or similiar
method) is generally sufficient.  The problem I encountered caused
Perl to tag a scalar string as utf-8.

Later versions of Perl are unicode-aware, and Perl uses utf-8
internally for such strings.  Internally, Perl has a flag on a scalar
indicating if it is utf-8 or just a string of bytes.

Wrt mknmz, with a locale of "C" or "en_US", by default, the strings are
_not_ utf-8.  Even the mknmz code invokes binmode() on filehandles to
prevent Perl from applying any character encoding semantics (Perl 5.8.x
supports character encoding/decoding on file handles similiar to Java).

The problem trigger is in decode_numbered_entity() in html.pl and
the statement:

  sprintf("%c",$num);

If $num is > 256, Perl ends up creating a utf-8 sequence (because
of the "%c" format), causing the string having the entity decoded
get its utf-8 flag set (regardless of the current locale setting).
Subsequently, any character-based operations (like regexes or file
writes) cause Perl to generate warnings.  It also causes mis-behavior
and probably corruption in Namazu.

Therefore, my initial fix was to drop any $num >= 255.  This would
preserve the 8-bit agnostic behavior of namazu.

Mr. Teranishi claims that encodings like ISO-8859-* were never
supported, however, it is clear that there are users using namazu
for such encodings with no reported problems.  If Mr. Teranishi can
provide specifics on which tests are supposedly failing in namazu
for such encodings, others may be able to figure out the proper fixes.

Since someone did add to the Namazu README that ISO-8859-* is supported,
apparently there was belief among at least one namazu developer
that the statement was justified.

--ewh


More information about the Namazu-users-en mailing list