[Namazu-users-en] Re: Malformed UTF-8 character

Earl Hood earl at earlhood.com
Wed Jun 15 04:04:06 JST 2005


On June 15, 2005 at 03:00, Tadamasa Teranishi wrote:

> Only 127 or more is whether it makes it to "?" or. 
> 
> sub decode_numbered_entity ($) {
>     my ($num) = @_;
>     return ""
>         if $num >= 0 && $num <= 31;
>     return "?"
>         if $num >= 127;
>     sprintf ("%c",$num);
> }

So non-printable characters and some whitespace characters do not
constitute word boundaries?  You realize that characters like tab
(ASCII 9) and form-feed (ASCII 12) are not being treated as word
boundaries.  I think this is a mistake.

The code you have will combine two words into one.  For example:

  hello&#9;there

Will get filtered to:

  hellothere

Using '?' for the replacement will have:

  hello?there

which, hopefully, will cause mknmz to treat "hello" and "there"
as two separate words.


> There is a possibility of causing the problem if the input text has 
> not been limited. 
> As for 8bit character, the program is being written in the 
> processing of Namazu on the assumption that it is EUC-JP. 

If I understand you correctly, namazu using EUC-JP internally, even
if the locale is not JP?  Am I correct?

If so, EUC-JP has code point equivalents for ISO-8859-* charsets.
Examining the ucm file for euc-jp, I see encodings for greek, cyrillic,
and latin characters.

--ewh


More information about the Namazu-users-en mailing list