namazu-dev(ring)


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

stopword (Re: How to get tf value?)



Hiroshi KOMATSU <sui_feng@xxxxxxxxxxxxx> wrote:

> ひとつ、悩んでいるのが、'stop words'の排除です。
>
> 1.3.x 用gnmzでは、インデックスを読み取った後に、$word がストップワード
>かどうかを判定し、「あたり」であれば、next; します。このため、1.3.x 用に
>用意されたインデックス読み取り関数は、そのまま使いまわしができず、車輪を発明
>してしまいました。kwnmz では too many を排除という形のようですが、たとえ
>ば「第」と「条」というのは、法律文では too many で、検索の役にはあまり
>立ちませんが、法律条文と、そうでないものを分類するときには、非常に役に
>立ちます。stop words を場合によって使い分けたいわけです。
>
> 英文用の検索エンジンとしては、いずれストップ・ワード排除の問題は、
>避けて通れないだろうと思いますので、ご検討いただければ幸いです。

Namazu では基本的に stopword の排除は行わない方針です。誤っ
て排除すれば recall を下げるので。

  Modern Information Retrieval
  <http://www1.fatbrain.com/asp/bookinfo/bookinfo.asp?theisbn=020139829x>

によると、多くの Web search engine は stopword の除外も
stemming も行っていないそうです。

# stopword を除外すると "to be or not to be" が検索できなく
# なる、という例が載っています

stopword について軽く調べてみると、

  Information Retrieval Data Structures and Algorithms
  <http://www1.fatbrain.com/asp/bookinfo/bookinfo.asp?theisbn=0134638379>

に英文の stopword 425個のリストが載っていました。参考までに
メイル末尾に添付しておきます。(暇な人に入力してもらった :-)

このリストを用いて stopword の排除を行うオプションを mknmz 
につけるといいかな? 

# ORBIT という商用検索サーヴィスでは and, an, by, from, of,
# the, with というたった 8 つのstopword しか使っていないそう
# です

日本語に関しては、mknmz に -m オプションを指定することで、
「名詞のみをインデックスに登録する」ことができます。
pl/wakati.pl の 55行目あたりを修正すれば任意の品詞を指定でき
ます。

必要とあれば stopword を扱う機能をつけます。どんな機能が欲し
いですか?

-- Satoru Takabayashi

a
about
above
across
after
again
against
all
almost
alone
along
already
also
although
always
among
an
and
another
any
anybody
anyone
anything
anywhere
are
area
areas
around
as
ask
asked
asking
asks
at
away
b
back
backed
backing
backs
be
because
became
become
becomes
been
before
began
behind
being
beings
best
better
between
big
both
but
by
c
came
can
cannot
case
cases
certain
certainly
clear
clearly
come
could
d
did
differ
different
differently
do
does
done
down
downed
downing
downs
during
e
each
early
either
end
ended
ending
ends
enough
even
evenly
ever
every
everybody
everyone
everything
everywhere
f
face
faces
fact
facts
far
felt
few
find
finds
first
for
four
from
full
fully
further
furthered
furthering
furthers
g
gave
general
generally
get
gets
give
given
gives
go
going
good
goods
got
great
greater
greatest
group
grouped
grouping
groups
h
had
has
have
having
he
her
herself
here
high
higher
highest
him
himself
his
how
however
i
if
important
in
interest
interested
interesting
interests
into
is
it
its
itself
j
just
k
keep
keeps
kind
knew
know
known
knows
l
large
largely
last
later
latest
least
less
let
lets
like
likely
long
longer
longest
m
made
make
making
man
many
may
me
member
members
men
might
more
most
mostly
mr
mrs
much
must
my
myself
n
necessary
need
needed
needing
needs
never
new
newer
newest
next
no
non
not
nobody
noone
nothing
now
nowhere
number
numbered
numbering
numbers
o
of
off
often
old
older
oldest
on
once
one
only
open
opened
opening
opens
or
order
ordered
ordering
orders
other
others
our
out
over
p
part
parted
parting
parts
per
perhaps
place
places
point
pointed
pointing
points
possible
present
presented
presenting
presents
problem
problems
put
puts
q
quite
r
rather
really
right
room
rooms
s
said
same
saw
say
says
second
seconds
see
seem
seemed
seeming
seems
sees
several
shall
she
should
show
showed
showing
shows
side
sides
since
small
smaller
smallest
so
some
somebody
someone
something
somewhere
state
states
still
such
sure
t
take
taken
than
that
the
their
them
then
there
therefore
these
they
thing
things
think
thinks
this
those
though
thought
thoughts
three
through
thus
to
today
together
too
took
toward
turn
turned
turning
turns
two
u
under
until
up
upon
us
use
uses
used
v
very
w
want
wanted
wanting
wants
was
way
ways
we
well
wells
went
were
what
when
where
whether
which
while
who
whole
whose
why
will
with
within
without
work
worked
working
works
would
x
y
year
years
yet
you
young
younger
youngest
your
yours
z