How to make Namazu document filter
- for Namazu 2.0  -

2001/9/21  Kenji Suzuki 
2001/7/7   Kenji Suzuki 
version 0.0.3

-----------------------------------------------
This document is under contruction.
Description about add_magic() is not exact.
If there are errors, shortage, unclear points,
inform me, please.
-----------------------------------------------


** What is a document filter?
A document filter is a module (Perl script) to extract information(text)
from files to index.
Namazu can handle various kinds of file to prepare filters for each
kind of file.
"Weighted scoring" and "making summary" can be done in a document
filter.


** Where document filters installed
Document filters are installed into {prefix}/share/namazu/filter/.
By default, it is /usr/local/share/namazu/filter/.
If you install a new document filter into it, the filter can be
used automatically.


** Interface of document filters
Below subroutines must be defined in a document filter:

mediatype()
status()
recursive()
pre_codeconv()
post_codeconv()
add_magic($)
filter($$$$$)


* mediatype()
Return media type of a file to process.

text/x-hdml
application/postscript
application/x-compress
etc.

a filter can handle some kinds of media type, return all kinds of media type
in an array (eg. mailnews.pl).
We recommend returning IANA registerd media type.


* status()
normally return yes. If a documet filter uses outer command, and
the command is not install on the system, the filter can't handle
a document correcty. In that case, return no.


* recursive()
If a HTML file which is compressed by gzip, you must handle it as 
application/x-gzip first, and after uncompression, handle as
text/html. Like this, if you want filter processing recursively,
retunr 1. Otherwise return 0.


* pre_codeconv()
If you want to convert Japanese Kanji code of a document before calling 
filter(),
return 1. Otherwise return 0. Namazu uses EUC internally.


* post_codeconv()
If you want to convert Japanese Kanji code of a document after calling 
filter(),
return 1. Otherwise return 0. Namazu uses EUC internally.


* add_magic()
In case File::MMagic fails to recognize file type,
you can add information to recognize a file with File::MMagic method.

$magic->addSpecials

  eg:
  $magic->addSpecials('text/x-hdml', '<[Hh][Dd][Mm][Ll][^>]*>');
  $magic->addSpecials("text/plain; x-type=rfc",
                        "^Network Working Group",
                        "^Request [fF]or Comments",
                        "^Obsoletes:",
                        "^Category:",
                        "^Updates:");

$magic->addFileExts
  Specify file extention. This is for Microsoft Office
  suites document which we can't write magic entry correctly?

  eg:
  $magic->addFileExts('^rfc\d+\.txt$', 'text/plain; x-type=rfc');
  $magic->addFileExts('\\.tex$', 'application/x-tex');

$magic->addMagicEntry
  Specigy magic entry.

  eg:
  $magic->addMagicEntry('0    string          \