[Namazu-devel-en] Re: making filter/mailnews.pl understand "no archive" directive

olivier Thereaux ot at w3.org
Tue Sep 21 11:32:41 JST 2004

On Sep 11, 2004, at 12:24 AM, Earl Hood wrote:

> On September 7, 2004 at 17:35, Olivier Thereaux wrote:
>> Namazu appears to already nicely implement the robots exclusion
>> protocol (for HTML) (as seen in the filter() subroutine in
>> filter/html.pl), and I am planning to hack namazu to make it behave
>> similarly (for mailnews)
> You do encounter some semantic problems with adding no-archive support
> directly into namazu.

Indeed, and as I stated in my original message, I was not sure whether 
no-archive was the proper model to follow. On the other hand, "noindex" 
is very well adapted.

> An example of where no-archive support is not desired is for those
> that use namazu to index their personal mail folders.  In this case,
> the user wants messages with no-archive indicator to be indexed.

Agreed, hence the idea of making it an option (possibly off by default).

> It may help if you provide some context on why you desire such a
> feature in namazu to see if patching namazu is the best solution for
> your problem.

Good idea. The system used is a rather simple mailing-list+html 
archive+search engine combination, with the only specificity that the 
search engine indexes (and searches) the raw mails and yet the search 
engine searches the HTML archive.

We've developed a system called annospam. The inner mechanisms are a 
bit complicated, but the purpose is to be able to annotate a document 
in the HTMl archive as being spam, have the original message marked, 
and the html archives regenerated without that particular document. 
Instead of "marking" the original message as being a rotten apple, the 
system could indeed remove it or remove its content, but then:

- marking a message is harmless, can be reverted easier than removal of 
content, making the automation of the marking (through e.g a bayesian 
spam filtering loop)  safer
- if the whole message is removed, the sequence of the HTML archive 
would be off. That is not acceptable.
- if the content of the message is removed, there are still a lot of 
"empty" messages left in the archive and search index

I hope this explanation is clear.

>> * If I understood correctly, if the filter() subroutine of any given
>> filter returns an (error) string, then indexing of this file aborts.
> I cannot answer this one.  Of course, a simple test can be done to
> confirm the behavior.  Unfortunately, the filtering aspects of namazu
> are not documented that well.

I was hoping that someone could stop me if that was a stupid idea or if 
there was  clear documentation somewhere. I will test.

> MHonArc also looks for:
>   Restrict: no-external-archive

Good to know.

Thanks a lot, Earl.


