[Namazu-users-en] Re: Problems with mknmz and Perl 5.8.6

Earl Hood earl at earlhood.com
Mon Jun 13 23:54:32 JST 2005


On June 13, 2005 at 06:09, Tadamasa Teranishi wrote:

> Even if it is ASCII character, it is not good according to the 
> character entity references. 

This is what I discovered.

> Namazu corresponds to a pure ASCII-only text alone without the 
> character entity references. 
> 
> Please use it by pure ASCII text-only.

I understand this.  What I am trying to say that this is not
necessarily an easy task for users.

Namazu should handle, with grace, cases where code-points exceed what
Namazu will support.  Otherwise, users will get incorrect behaviour
and not understand why.  You require all users to pre-filter data,
something namazu should do.

> The decode_numbered_entity subroutine of filter/html.pl is rewritten 
> as follows. 
> 
> sub decode_numbered_entity ($) {
>     my ($num) = @_;
>     return "?";
> }

I believe this is a bad implementation, because it neutralizes all
character entity references that namazu does support.

I recommend the following:

sub decode_numbered_entity ($) {
    my ($num) = @_;
    return "?"
	if ($num >= 0 && $num <= 31) || ($num >= 127 && $num <= 159) ||
	   ($num >= 255);
    return "?" 
	if $num >=127 && util::islang('ja');
    sprintf ("%c",$num);
}

This allows 8-bit character entity references, which is needed
for 8-bit character sets (e.g. ISO-8859-* family).

The above version also avoids the problem of Perl auto-flagging
text with the utf-8 flag.

If you are not familiar with how Perl handles Unicode, see the
perlunicode and related manual pages.  Namazu needs to be coded
to avoid causing Perl (v5.8.x and later) to set the utf-8 flag
on strings.  Just setting the LC_ALL=C environment variable
is NOT enough.

--ewh



More information about the Namazu-users-en mailing list