Namazu-users-en(old)


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Hello..



On Sat, Mar 18, 2000 at 01:31:43AM +0900, Satoru Takabayashi wrote:
> Peter Marelas <maral@xxxxxxxxxxxxxxxx> wrote:
> 
> >> Thank you for your information.  It sounds great.  Since
> >> Namazu's indexer called mknmz is written in perl, indexing
> >> takes rather a long time.  Ryuji Abe has a plan to rewrite
> >> mknmz with C. It would be great if we can employ mifluz to
> >> the task.
> >
> >Certainly mifluz is up to the task. You may have read already
> >mifluz is designed to index a large (+10 million) number of words.
> >Mifluz relies on a modified version of Berkeley DB B+Tree's
> >(we added on compression) for storing its index. The structure
> >employed makes updates very fast. There is some work going on
> >to improve the structure.
> 
> Speaking of Namazu, as README says "for a small or medium
> scale Web search engine", Namazu's is not designed to index
> a large number of documents.  As far as I know, the largest
> Namazu index ever made is as follows:
> 
>   Documents:  878,914 files
>   Total size: 2,167,480,108 bytes
 
Im curious as to how fast query performance is on that index?

> On the other hand, mifluz Web site says:
> 
> <http://www.senga.org/mifluz/html/description.html>
> |   mifluz has been designed with the further upper limits in mind : 500
> |   million documents, 50 giga words, 20 million document updates per day.
> 
> It is terrific!
>
> >I would be interested if the persons that designed namazu's
> >index structure, critisized the mifluz structure. As the
> >structure is the key to fast updates and query performance.
> 
> I am the designer of Namazu's index structure.  The
> structure is a very simple inverted index.  It is easy to
> implement both indexer and search engine, but it is not fast
> to update.  See the following page for details.
> <http://www.namazu.org/doc/nmz.html.en>
 
The major differences are your index grows outwards..
i.e. from left to right. You also build different
indexes to solve different queries i.e. phrase.
Mifluz index grows downwards. From top to bottom.
All key/value pairs are stored sorted in a b+tree.
The b+tree introduces prefix compression. This saves
space. Currently only the key is used to store stuff.

There have been many discussions regarding the structure
and its design on the mailing list. If you like there are archives
here http://www.egroups.com/list/sengamifluz/info.html.

> I just printed out mifluz.texinfo and read it.  I notice
> that it is really a high-performance library.  But at the
> moment, I don't know whether or not it is good to employ
> mifluz for Namazu.

Mifluz is generalised enough (at the moment) that it will
cater for most requirements. The fact that it is generalised
can be a problem though. I infact use my own indexer derived from
the mifluz structure. If I can prove my optimizations are
worthwhile they should end up in mifluz.

> Since Namazu is an easy-to-use search system, features which
> mifluz provides are perhaps too much. 

I dont think mifluz was designed to provide many features.
In fact I would say its the opposite. 

It provides an API to plug in/out words and other data into an
index in a user defined sorted fashion. Thats pretty much what mifluz
gives you. There are other products produced by Senga that
use mifluz for indexing, like the crawler and catalog system.

> We mainly uses Namazu
> for an intranet or personal use.  In my opinion, the latter
> will becomes more important because people gets a number of
> emails nowadays.  That's why Namazu emphasizes mail/news and
> MHonArc support.
> 
> For the present, we Namazu project are concentrating on
> development of Namazu 2.x.  TODOs are:
> 
>   * Support index compression with zlib.

Mifluz uses zlib and some bit compression. Coupled with
sorted b+tree's it achieves 1/8th compression on-disk.

>   * Improve index merging.  O(n^2) -> O(n log n)
>   * Rewrite query operations with lex and yacc.
>   * Make source codes clear.  Throw legacy codes away.
> 
> When above TODOs are completed, we will change over to 3.0
> development and decide employment of mifluz.  I hope
> mifluz's APIs will be fixed and well documented at that
> time. :-)

I've asked the main developer to join this
list, and he has. Im sure he will pass on his thoughts as
well.

Regards
Peter Marelas