[Namazu-devel-en] Re: making filter/mailnews.pl understand "no archive" directive

Earl Hood earl at earlhood.com
Sat Sep 11 00:24:26 JST 2004

On September 7, 2004 at 17:35, Olivier Thereaux wrote:

> Namazu appears to already nicely implement the robots exclusion
> protocol (for HTML) (as seen in the filter() subroutine in
> filter/html.pl), and I am planning to hack namazu to make it behave
> similarly (for mailnews), and I was wondering if you would be able to
> answer these few questions.

You do encounter some semantic problems with adding no-archive support
directly into namazu.  Namazu is a search indexing tool, so the
concept of an "archive" is separate from namazu, even though many use
it to index mail archives.

An example of where no-archive support is not desired is for those
that use namazu to index their personal mail folders.  In this case,
the user wants messages with no-archive indicator to be indexed.

IMHO, if one desires to not have messages with a no-archive designator
to not be indexed, that message should not be part of namazu's input.
For example, if namazu is being used to index a mail archive, messages
with a no-archive indicator should not be placed in the archive in
the first place.

It may help if you provide some context on why you desire such a
feature in namazu to see if patching namazu is the best solution for
your problem.

> * If I understood correctly, if the filter() subroutine of any given
> filter returns an (error) string, then indexing of this file aborts.
> Could you confirm this?

I cannot answer this one.  Of course, a simple test can be done to
confirm the behavior.  Unfortunately, the filtering aspects of namazu
are not documented that well.

> * There is no consensus on whether "X-no-archive: " is the only header
> that should trigger such a mechanism. Arguably, for indexing, it could
> also be "X-no-index:", and other headers such as "X-Spam-Status: Yes"
> would be nice, too.

MHonArc also looks for:

  Restrict: no-external-archive

The Restrict: header field is the formal way of indicating no archiving,
but the X-no-archive is still widely used.


