[Namazu-users-en] Re: Malformed UTF-8 character
NOKUBI Takatsugu
knok at daionet.gr.jp
Mon Jun 20 12:43:18 JST 2005
Thank you for your description about the issue.
At Fri, 17 Jun 2005 12:04:21 -0500,
Earl Hood wrote:
> Wrt mknmz, with a locale of "C" or "en_US", by default, the strings are
> _not_ utf-8. Even the mknmz code invokes binmode() on filehandles to
> prevent Perl from applying any character encoding semantics (Perl 5.8.x
> supports character encoding/decoding on file handles similiar to Java).
binmode was used for Win32 former, I hadn't know such side effect.
> The problem trigger is in decode_numbered_entity() in html.pl and
> the statement:
>
> sprintf("%c",$num);
>
> If $num is > 256, Perl ends up creating a utf-8 sequence (because
> of the "%c" format), causing the string having the entity decoded
> get its utf-8 flag set (regardless of the current locale setting).
> Subsequently, any character-based operations (like regexes or file
> writes) cause Perl to generate warnings. It also causes mis-behavior
> and probably corruption in Namazu.
>
> Therefore, my initial fix was to drop any $num >= 255. This would
> preserve the 8-bit agnostic behavior of namazu.
Hmm, it seems sufficently for me. I want to apply it in the stable
branch and HEAD.
Do you have any objection about it, Teranishi-san?
--
NOKUBI Takatsugu
E-mail: knok at daionet.gr.jp
knok at namazu.org / knok at debian.org
More information about the Namazu-users-en
mailing list