[Namazu-users-en] Re: Malformed UTF-8 character
Earl Hood
earl at earlhood.com
Wed Jun 15 04:04:06 JST 2005
On June 15, 2005 at 03:00, Tadamasa Teranishi wrote:
> Only 127 or more is whether it makes it to "?" or.
>
> sub decode_numbered_entity ($) {
> my ($num) = @_;
> return ""
> if $num >= 0 && $num <= 31;
> return "?"
> if $num >= 127;
> sprintf ("%c",$num);
> }
So non-printable characters and some whitespace characters do not
constitute word boundaries? You realize that characters like tab
(ASCII 9) and form-feed (ASCII 12) are not being treated as word
boundaries. I think this is a mistake.
The code you have will combine two words into one. For example:
hello	there
Will get filtered to:
hellothere
Using '?' for the replacement will have:
hello?there
which, hopefully, will cause mknmz to treat "hello" and "there"
as two separate words.
> There is a possibility of causing the problem if the input text has
> not been limited.
> As for 8bit character, the program is being written in the
> processing of Namazu on the assumption that it is EUC-JP.
If I understand you correctly, namazu using EUC-JP internally, even
if the locale is not JP? Am I correct?
If so, EUC-JP has code point equivalents for ISO-8859-* charsets.
Examining the ucm file for euc-jp, I see encodings for greek, cyrillic,
and latin characters.
--ewh
More information about the Namazu-users-en
mailing list