[Namazu-devel-en] making filter/mailnews.pl understand "no archive" directive

Olivier Thereaux ot at w3.org
Tue Sep 7 17:35:58 JST 2004

Hello Namazu developers,

I have been doing some research on how indexing/archiving software can
ignore certain rfc(2)822/mailnews documents.

A similar idea is already implemented for HTML documents in the form of
"noindex, nofollow" directives in the robots exclusion protocol.  I
found out that most of the mail archiving software (e.g
Mhonarc[1],hypermail[2])  I could find implement a "X-no-archive"
directive which makes them ignore the specific message.

[1] http://www.mhonarc.org/MHonArc/doc/resources/checknoarchive.html
[2] http://www.hypermail.org/source/docs/hmrc.html#deleted

Namazu appears to already nicely implement the robots exclusion
protocol (for HTML) (as seen in the filter() subroutine in
filter/html.pl), and I am planning to hack namazu to make it behave
similarly (for mailnews), and I was wondering if you would be able to
answer these few questions.

* If I understood correctly, if the filter() subroutine of any given
filter returns an (error) string, then indexing of this file aborts.
Could you confirm this?

* There is no consensus on whether "X-no-archive: " is the only header
that should trigger such a mechanism. Arguably, for indexing, it could
also be "X-no-index:", and other headers such as "X-Spam-Status: Yes"
would be nice, too. Would it be OK if name of the trigger header(s)
was/were made an option in mknmzrc?

And finally... when I am done with this patch, will you be interested
in including it within the namazu distribution?

Thank you very much.


