[Namazu-users-en] Re: Problems with mknmz and Perl 5.8.6

Tadamasa Teranishi yw3t-trns at asahi-net.or.jp
Tue Jun 14 00:46:27 JST 2005


Earl Hood wrote:
> 
> I believe this is a bad implementation, because it neutralizes all
> character entity references that namazu does support.

First of all, limit the input text to the one of an appropriate 
character set. 

Please correct if you cannot do it and use it. 
Because however, because this correction is not recommended, this 
correction is not reflected in stable-2-0 either.

> I recommend the following:
> 
> sub decode_numbered_entity ($) {
>     my ($num) = @_;
>     return "?"
>         if ($num >= 0 && $num <= 31) || ($num >= 127 && $num <= 159) ||
>            ($num >= 255);
>     return "?"
>         if $num >=127 && util::islang('ja');
>     sprintf ("%c",$num);
> }
> 
> This allows 8-bit character entity references, which is needed
> for 8-bit character sets (e.g. ISO-8859-* family).

No. 
Please examine NMZ.field.summary, etc... 
It is a multi byte or a wide character that is invalid or is imperfect. 
Another problem that you have not noticed yet is mysterious and the 
law is not mysterious either. 

Trying to support ISO-8859-*family etc. halfway is a problem. 
Whether it is not ISO-8859-*family or so it or UTF-8 or EUC-JP,
Shift_JIS, etc. cannot be easily judged according to 8bit code. 
That causes the problem. 
The problem is evaded by limiting it to 7bit ASCII character. 
If 8bit is permitted, a lot of corrections are needed. 

ISO-8859-*family should not be permitted if the character set of the 
input text is not definable. 

> If you are not familiar with how Perl handles Unicode, see the
> perlunicode and related manual pages.  Namazu needs to be coded
> to avoid causing Perl (v5.8.x and later) to set the utf-8 flag
> on strings.  Just setting the LC_ALL=C environment variable
> is NOT enough.

You must examine multi byte character and a wide character well 
ahead of that. 
-- 
=====================================================================
TADAMASA TERANISHI
http://www.asahi-net.or.jp/~yw3t-trns/index.htm
Key fingerprint =  474E 4D93 8E97 11F6 662D  8A42 17F5 52F4 10E7 D14E



More information about the Namazu-users-en mailing list