Namazu-devel-ja(旧)


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

NMZ.i ( Re: http://www.namazu.org/doc/nmz.html )



>                                            千葉市中央区長洲
>                                                    藤原  誠
以前に namazu-users-ja で議論のあった、NMZ.i の形式をどう表現
するかという話
   http://www.namazu.org/ml/namazu-users-ja/msg03313.html
の続きにもなるのですが、

stable-2-0/namazu/doc/ja/nmz.html 
にも書いてあるので、これを直した方がいいなぁと思っていて、
(1) 以前に野首さんが Web を直しましたと書いていたと思うのですが、
    web の方も直っていない ?
    http://www.namazu.org/doc/nmz.html
   (戻ってしまった ?)

(2) 同じものが
   stable-2-0/namazu/doc/ja/nmz.html 
   にあると思うが Web の data  との関連は ?

ということで、先に英語の方を用意して見たのですが、
最初のデータは、単に「レコード長」でいいのでは ? という気が
しています。(ただし、多くの場合 文書数*2 になると書いておく)
-------------------------------------------------------------
The format is just a series of BER compressed data.
But it may be looked as a series of record, of which
each record is for particular word related data.

For the word one, 
 +---------------+------------+-------+------------+-------+
 | Record length | documentID | score | documentID | score |....
 +---------------+------------+-------+------------+-------+
Record length is the byte count of this record, thus variable
length record,  succeeding documentID and the score is the
documentID the word one is contained, and the score of the
word-documentID pair.
                 +------------+-------+
                 | documentID | score |
                 +------------+-------+
Note: 
(1) documentID is ordered in increasing manner.
(2) documentID is actually recorded in difference of previous
    numbers. Say, 1, 5, 29, 34 -> 1, 4, 24, 5
(3) Each data is in BER compression pack 'w' in Perl.

For the word two, the same format of the record continues
 +---------------+------------+-------+------------+-------+
 | Record length | documentID | score | documentID | score |....
 +---------------+------------+-------+------------+-------+

Note:
(4)
BER compression stands for Basic Encoding Rules, pack 'w' in Perl,
and the length of each compressed data may be in two bytes, 
assuming original is in 0-16383 range.

(5) Record length:
Assuming above (4) is true, 'Record Length' is usually twice of
the number of documents which contains respective word.

---
(藤原)