[Namazu-devel-en] Re: making filter/mailnews.pl understand "no
ot at w3.org
Tue Sep 21 11:32:41 JST 2004
On Sep 11, 2004, at 12:24 AM, Earl Hood wrote:
> On September 7, 2004 at 17:35, Olivier Thereaux wrote:
>> Namazu appears to already nicely implement the robots exclusion
>> protocol (for HTML) (as seen in the filter() subroutine in
>> filter/html.pl), and I am planning to hack namazu to make it behave
>> similarly (for mailnews)
> You do encounter some semantic problems with adding no-archive support
> directly into namazu.
Indeed, and as I stated in my original message, I was not sure whether
no-archive was the proper model to follow. On the other hand, "noindex"
is very well adapted.
> An example of where no-archive support is not desired is for those
> that use namazu to index their personal mail folders. In this case,
> the user wants messages with no-archive indicator to be indexed.
Agreed, hence the idea of making it an option (possibly off by default).
> It may help if you provide some context on why you desire such a
> feature in namazu to see if patching namazu is the best solution for
> your problem.
Good idea. The system used is a rather simple mailing-list+html
archive+search engine combination, with the only specificity that the
search engine indexes (and searches) the raw mails and yet the search
engine searches the HTML archive.
We've developed a system called annospam. The inner mechanisms are a
bit complicated, but the purpose is to be able to annotate a document
in the HTMl archive as being spam, have the original message marked,
and the html archives regenerated without that particular document.
Instead of "marking" the original message as being a rotten apple, the
system could indeed remove it or remove its content, but then:
- marking a message is harmless, can be reverted easier than removal of
content, making the automation of the marking (through e.g a bayesian
spam filtering loop) safer
- if the whole message is removed, the sequence of the HTML archive
would be off. That is not acceptable.
- if the content of the message is removed, there are still a lot of
"empty" messages left in the archive and search index
I hope this explanation is clear.
>> * If I understood correctly, if the filter() subroutine of any given
>> filter returns an (error) string, then indexing of this file aborts.
> I cannot answer this one. Of course, a simple test can be done to
> confirm the behavior. Unfortunately, the filtering aspects of namazu
> are not documented that well.
I was hoping that someone could stop me if that was a stupid idea or if
there was clear documentation somewhere. I will test.
> MHonArc also looks for:
> Restrict: no-external-archive
Good to know.
Thanks a lot, Earl.
More information about the Namazu-devel-en