[Namazu-users-en] Re: Malformed UTF-8 character

NOKUBI Takatsugu knok at daionet.gr.jp
Mon Jun 20 12:43:18 JST 2005

Thank you for your description about the issue.

At Fri, 17 Jun 2005 12:04:21 -0500,
Earl Hood wrote:
> Wrt mknmz, with a locale of "C" or "en_US", by default, the strings are
> _not_ utf-8.  Even the mknmz code invokes binmode() on filehandles to
> prevent Perl from applying any character encoding semantics (Perl 5.8.x
> supports character encoding/decoding on file handles similiar to Java).

binmode was used for Win32 former, I hadn't know such side effect.

> The problem trigger is in decode_numbered_entity() in html.pl and
> the statement:
>   sprintf("%c",$num);
> If $num is > 256, Perl ends up creating a utf-8 sequence (because
> of the "%c" format), causing the string having the entity decoded
> get its utf-8 flag set (regardless of the current locale setting).
> Subsequently, any character-based operations (like regexes or file
> writes) cause Perl to generate warnings.  It also causes mis-behavior
> and probably corruption in Namazu.
> Therefore, my initial fix was to drop any $num >= 255.  This would
> preserve the 8-bit agnostic behavior of namazu.

Hmm, it seems sufficently for me. I want to apply it in the stable
branch and HEAD.

Do you have any objection about it, Teranishi-san?
NOKUBI Takatsugu
E-mail: knok at daionet.gr.jp
	knok at namazu.org / knok at debian.org

