From ot at w3.org Tue Sep 7 17:35:58 2004 From: ot at w3.org (Olivier Thereaux) Date: Tue Sep 7 17:36:02 2004 Subject: [Namazu-devel-en] making filter/mailnews.pl understand "no archive" directive Message-ID: <20040907083558.GA6256@w3.mag.keio.ac.jp> Hello Namazu developers, I have been doing some research on how indexing/archiving software can ignore certain rfc(2)822/mailnews documents. A similar idea is already implemented for HTML documents in the form of "noindex, nofollow" directives in the robots exclusion protocol. I found out that most of the mail archiving software (e.g Mhonarc[1],hypermail[2]) I could find implement a "X-no-archive" directive which makes them ignore the specific message. [1] http://www.mhonarc.org/MHonArc/doc/resources/checknoarchive.html [2] http://www.hypermail.org/source/docs/hmrc.html#deleted Namazu appears to already nicely implement the robots exclusion protocol (for HTML) (as seen in the filter() subroutine in filter/html.pl), and I am planning to hack namazu to make it behave similarly (for mailnews), and I was wondering if you would be able to answer these few questions. * If I understood correctly, if the filter() subroutine of any given filter returns an (error) string, then indexing of this file aborts. Could you confirm this? * There is no consensus on whether "X-no-archive: " is the only header that should trigger such a mechanism. Arguably, for indexing, it could also be "X-no-index:", and other headers such as "X-Spam-Status: Yes" would be nice, too. Would it be OK if name of the trigger header(s) was/were made an option in mknmzrc? And finally... when I am done with this patch, will you be interested in including it within the namazu distribution? Thank you very much. -- olivier From earl at earlhood.com Sat Sep 11 00:24:26 2004 From: earl at earlhood.com (Earl Hood) Date: Sat Sep 11 00:24:29 2004 Subject: [Namazu-devel-en] Re: making filter/mailnews.pl understand "no archive" directive In-Reply-To: <20040907083558.GA6256@w3.mag.keio.ac.jp> References: <20040907083558.GA6256@w3.mag.keio.ac.jp> Message-ID: <200409101524.i8AFOQF07497@gator.earlhood.com> On September 7, 2004 at 17:35, Olivier Thereaux wrote: > Namazu appears to already nicely implement the robots exclusion > protocol (for HTML) (as seen in the filter() subroutine in > filter/html.pl), and I am planning to hack namazu to make it behave > similarly (for mailnews), and I was wondering if you would be able to > answer these few questions. You do encounter some semantic problems with adding no-archive support directly into namazu. Namazu is a search indexing tool, so the concept of an "archive" is separate from namazu, even though many use it to index mail archives. An example of where no-archive support is not desired is for those that use namazu to index their personal mail folders. In this case, the user wants messages with no-archive indicator to be indexed. IMHO, if one desires to not have messages with a no-archive designator to not be indexed, that message should not be part of namazu's input. For example, if namazu is being used to index a mail archive, messages with a no-archive indicator should not be placed in the archive in the first place. It may help if you provide some context on why you desire such a feature in namazu to see if patching namazu is the best solution for your problem. > * If I understood correctly, if the filter() subroutine of any given > filter returns an (error) string, then indexing of this file aborts. > Could you confirm this? I cannot answer this one. Of course, a simple test can be done to confirm the behavior. Unfortunately, the filtering aspects of namazu are not documented that well. > * There is no consensus on whether "X-no-archive: " is the only header > that should trigger such a mechanism. Arguably, for indexing, it could > also be "X-no-index:", and other headers such as "X-Spam-Status: Yes" > would be nice, too. MHonArc also looks for: Restrict: no-external-archive The Restrict: header field is the formal way of indicating no archiving, but the X-no-archive is still widely used. --ewh From filip.molcan at blue-point.cz Sun Sep 12 21:46:39 2004 From: filip.molcan at blue-point.cz (=?UTF-8?Q?Filip_Mol=C4=8Dan?=) Date: Mon Sep 13 00:50:34 2004 Subject: [Namazu-devel-en] other languages Message-ID: Hello, I am now trying Namazu with my friend on our servers and we have problems with some characters from czech language. When do you plan to support other languages? There are problems with indexing and searching in czech documents... Have a nice day! Filip Molcan OpenOffice.org Development Team http://cs.openoffice.og -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/enriched Size: 343 bytes Desc: not available Url : http://www.namazu.org/pipermail/namazu-devel-en/attachments/20040912/2fc5b284/attachment.bin From knok at daionet.gr.jp Mon Sep 13 11:33:38 2004 From: knok at daionet.gr.jp (knok@daionet.gr.jp) Date: Mon Sep 13 11:33:47 2004 Subject: [Namazu-devel-en] Re: other languages In-Reply-To: References: Message-ID: <87fz5mnbi5.wl@knok.daionet.gr.jp> At Sun, 12 Sep 2004 14:46:39 +0200, Filip Mol?an wrote: > I am now trying Namazu with my friend on our servers and we have > problems with some characters from czech language. When do you plan to > support other languages? There are problems with indexing and > searching in czech documents... In the past, I got a report that a french user could use Namazu, so I think it is not so difficult to support it. Could you describe details of the problems? -- NOKUBI Takatsugu E-mail: knok@daionet.gr.jp knok@namazu.org / knok@debian.org From ot at w3.org Tue Sep 21 11:32:41 2004 From: ot at w3.org (olivier Thereaux) Date: Tue Sep 21 11:32:40 2004 Subject: [Namazu-devel-en] Re: making filter/mailnews.pl understand "no archive" directive In-Reply-To: <200409101524.i8AFOQF07497@gator.earlhood.com> References: <20040907083558.GA6256@w3.mag.keio.ac.jp> <200409101524.i8AFOQF07497@gator.earlhood.com> Message-ID: <83009394-0B76-11D9-858B-000393A80896@w3.org> On Sep 11, 2004, at 12:24 AM, Earl Hood wrote: > On September 7, 2004 at 17:35, Olivier Thereaux wrote: >> Namazu appears to already nicely implement the robots exclusion >> protocol (for HTML) (as seen in the filter() subroutine in >> filter/html.pl), and I am planning to hack namazu to make it behave >> similarly (for mailnews) > You do encounter some semantic problems with adding no-archive support > directly into namazu. Indeed, and as I stated in my original message, I was not sure whether no-archive was the proper model to follow. On the other hand, "noindex" is very well adapted. > An example of where no-archive support is not desired is for those > that use namazu to index their personal mail folders. In this case, > the user wants messages with no-archive indicator to be indexed. Agreed, hence the idea of making it an option (possibly off by default). > It may help if you provide some context on why you desire such a > feature in namazu to see if patching namazu is the best solution for > your problem. Good idea. The system used is a rather simple mailing-list+html archive+search engine combination, with the only specificity that the search engine indexes (and searches) the raw mails and yet the search engine searches the HTML archive. We've developed a system called annospam. The inner mechanisms are a bit complicated, but the purpose is to be able to annotate a document in the HTMl archive as being spam, have the original message marked, and the html archives regenerated without that particular document. Instead of "marking" the original message as being a rotten apple, the system could indeed remove it or remove its content, but then: - marking a message is harmless, can be reverted easier than removal of content, making the automation of the marking (through e.g a bayesian spam filtering loop) safer - if the whole message is removed, the sequence of the HTML archive would be off. That is not acceptable. - if the content of the message is removed, there are still a lot of "empty" messages left in the archive and search index I hope this explanation is clear. >> * If I understood correctly, if the filter() subroutine of any given >> filter returns an (error) string, then indexing of this file aborts. > I cannot answer this one. Of course, a simple test can be done to > confirm the behavior. Unfortunately, the filtering aspects of namazu > are not documented that well. I was hoping that someone could stop me if that was a stupid idea or if there was clear documentation somewhere. I will test. > MHonArc also looks for: > Restrict: no-external-archive Good to know. Thanks a lot, Earl. -- olivier