[Namazu-users-en] Re: namazu stopped working

Tadamasa Teranishi yw3t-trns at asahi-net.or.jp
Fri Nov 25 14:33:17 JST 2005

IEM - network operating center wrote:
> international mailinglist, language is english (but there are a lot of 
> subscribers with special characters in their names, especial spanish ones)
> the list in question is the one with the most traffic: the archive 
> starts in 1998 and by now there are about 37850 files in it, without 
> attachments (which i exclude from indexing via the 
> "exclude-pattern"-flag) there are 33419.

Namazu supports only English (and Japanese). 
Spanish cannot be correctly processed. 
In a word, operation when Spanish is input has not been secured. 

> i guess it is a problem with some multi-byte characters.

The cause might be another one. 
If the document file can be gotten by specifying the document 
that makes trouble, it is 
likely to be able to pinpoint the cause. 

By the way,
I think that warning is improved by the following corrections. 
(no guarantee)

> (which reminds me that when i build the index i get some warnings:
> "Wide character in print at /usr/bin/mknmz line 2447, <GEN7162> line
> 158600.")

--- namazu-2.0.14/scripts/mknmz.in      2004-04-08 17:34:42.000000000
+++ mknmz.in    2005-11-25 14:21:26.000000000 +0900
@@ -2250,7 +2250,7 @@ sub count_words ($$$$) {
     $$contref =~ tr/A-Z/a-z/;

     # Remove control char.
-    $$contref =~ tr/\x00-\x08\x0b-\x0c\x0e-\x1a/ /;
+    $$contref =~ tr/\x00-\x08\x0b-\x0c\x0e-\x1a\x80-\xff/ /;

     # Do wakatigaki if necessary.
     if (util::islang("ja")) {
