[Namazu-users-en] Re: Problems with mknmz and Perl 5.8.6

Earl Hood earl at earlhood.com
Mon Jun 13 04:34:44 JST 2005


On June 12, 2005 at 12:23, Tadamasa Teranishi wrote:

> To our regret, Namazu supports ASCII text-only input. 
> (However, Japanese text can be used for a Japanese environment. )
> For instance,
> namazu-users-en/2000-07/msg00000.html is a Japanese text. 
> namazu-users-en/2003-06/msg00000.html is non-ASCII text. 
> ...
> In addition a lot. 
> 
> Please use it by ASCII text-only. 

Which is problematic for email archives since it is hard to control the
types of character encoding used.  In this case, the mailing list is
supposed to be English only, but someone posts an iso-2022-jp message
to it.

It is worth noting that namazu-users-en/2000-07/msg00000.html is
actually in ASCII in the raw form (something mhonarc does by default).
The problem is with the unicode character entity references >= 256.
I.e.  MHonArc, by default, converts the iso-2022-jp character data
into raw ASCII, using unicode character entity references for Japanase
characters.  So the raw HTML input is ASCII-only.

> > Also, the "Malformed UTF-8 ..." warnings are popping up, regardles
> > of what LANG or LC_ALL are set to.  I had to add a 'use bytes' pragma
> > to mailnews.pl at line 212 to get rid of the warnings.
> 
> 'use bytes' is not the one that only warning is erased, and the root 
> of the problem is solved. 

See my previous message of where the problem is.  The 'use bytes'
will suppress the warnings, but not necessarily fix the real problem.
It has nothing to do with locale, but with Perl tagging some string
data as UTF-8.  I show where in the namazu code this happens and a
temporary fix to avoid the problem.

Some fix must be incorporated because the problem causes incorrect
behavior by namazu.

Note, I periodically get "inconsistent" errors from mknmz on some
private archives I maintain:

ASSERTION ERROR!: NMZ.r (9869) and NMZ.t (9868) are not consistent! at
    /usr/local/share/namazu/pl/util.pl line 257.

and the utf-8 issue could be the culprit.  I am not sure yet, but
with one problem at least addressed with a local fix, I can see if
inconistent errors keep occuring, and if so, see if I can create
a reproducable case.

What kind of consideration has there been for supporting Unicode
(UTF-8) in namazu?

--ewh


More information about the Namazu-users-en mailing list