namazu-dev(ring)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: tolower()

From: Satoru Takabayashi <satoru-t@xxxxxxxxxxxxxxxxxx>
Date: Mon, 03 Jan 2000 23:02:52 +0900
References: <200001011904.EAA15866@ring.etl.go.jp> <200001031043.TAA29967@ring.etl.go.jp> <200001031206.VAA02273@ring.etl.go.jp>

Ryuji Abe <raeva@xxxxxxxxxxxx> wrote:

>> インデックスを作る段階 (mknmz) で、localeを意識せずに大文字
>> から小文字へ変換しているので、それに合わせて、検索側
>> (namazu) でも localeを意識せずに大文字から小文字へ変換した方
>> がいいです。
>
>なるほど。mknmzでは
>    # Normalize into small letter.
>    $$contref =~ tr/A-Z/a-z/;
>としているのですね。ならばASCIIに依存した変換の方がいいですね。
>ISO-8859-1なんかでもtolowerできたら嬉しいかなとか考えたのですけど。

perl で locale を意識して大文字から小文字へ変換するには次の
ようなコードを書きます。

  use locale;
  use POSIX qw(locale_h);
  setlocale(LC_CTYPE, "fr_CA.ISO8859-1");
  $foo = lc($foo);

lc() を使っているところが肝要です。単純に tr/A-Z/a-z/ とはで
きません。ちなみに、正規表現で「大文字」を表すには
[^\W0-9a-z_] と書くようです。

# Perl Cookbook に詳しく載っています。ただいま手元にないので
# 確認できませんが


>ところで、現在grep-2.4がリリースされていますけど、これの
>regex.cを眺めてみると
>
>/* For platform which support the ISO C amendement 1 functionality we
>   support user defined character classes.  */
>#if defined _LIBC || WIDE_CHAR_SUPPORT
>/* Solaris 2.5 has a bug: <wchar.h> must be included before <wctype.h>. 
>*/
># include <wchar.h>
># include <wctype.h>
>#endif
>
(snip)
>というのがありますが、これって前からありましたっけ?

調べてみました。grep 2.0 にはありません。2.2からあります。

ちなみに、 2.1 では

  # include <wctype.h>
  # include <wchar.h>

の順でした。:-)

-- Satoru Takabayashi

References:
- Re: tolower()
  - From: Ryuji Abe
- Re: tolower()
  - From: Satoru Takabayashi
- Re: tolower()
  - From: Ryuji Abe

Prev by Date: Re: tolower()
Next by Date: Re: tolower()
Previous by thread: Re: tolower()
Next by thread: compressing an index with zlib
Index(es):
- Date
- Thread