[Namazu-users-en] Re: Malformed UTF-8 character

Tadamasa Teranishi yw3t-trns at asahi-net.or.jp
Wed Jun 15 03:00:21 JST 2005

Earl Hood wrote:
> I have been told that namazu is not designed for 8-bit charsets.
> I find this odd since it is known that there are users of Namazu in
> locales with 8-bit sets (e.g. DE/German and PL/Polish).

As for Namazu, the design is not done because of 8bit charsets though 
it repeats. Moreover, the test is not done. 
It is not to relate even if there is a person who is using it with 
8-bit charsets. 

By the volunteer,
Namazu prepares 8bit charset in Message translations. 
However, Message translations and Text processing are another. 

> If 8-bit chars are a problem, you could use the following version of
> the routine:
> sub decode_numbered_entity ($) {
>     my ($num) = @_;
>     return "?"
>         if ($num >= 0 && $num <= 31) || ($num >= 127 && $num <= 159) ||
>            ($num >= 127);
>     sprintf ("%c",$num);
> }

Still, it will be omissible though doesn't care as follows. 

sub decode_numbered_entity ($) {
    my ($num) = @_;
    return "?"
        if $num >= 0 && $num <= 31 || $num >= 127;
    sprintf ("%c",$num);

Only 127 or more is whether it makes it to "?" or. 

sub decode_numbered_entity ($) {
    my ($num) = @_;
    return ""
        if $num >= 0 && $num <= 31;
    return "?"
        if $num >= 127;
    sprintf ("%c",$num);

> It is interesting that the original version of the routine did not
> exclude 8-bit character entity references, only for the locale of JA.
> So if 8-bit chars are not desirable, why did decode_numbered_entity()
> allow it initially?

There is a problem in the program. 
decode_numbered_entity() was mounting to consider 8bit charsets 
However, Namazu doesn't come to be designed for 8bit charsets. 

See. Tips.html 

The content being written in tips.html becomes a basic design 
though it differs from the content and mounting being written here. 

There is a possibility of causing the problem if the input text has 
not been limited. 
As for 8bit character, the program is being written in the 
processing of Namazu on the assumption that it is EUC-JP. 
The part that doesn't become it might still remain in curettage 
though it is necessary to do Japanese processing only in a Japanese 
(It has already been understood for chomp_eucjp() to pass even if 
it is Japanese, is environmental, and it is unexpected. This is 
corrected with Namazu 2.0.15. )
Key fingerprint =  474E 4D93 8E97 11F6 662D  8A42 17F5 52F4 10E7 D14E

More information about the Namazu-users-en mailing list