[Namazu-users-en] Re: Problems with date sorts

Lindsay Haisley fmouse-namazu at fmp.com
Sat Jan 13 08:19:00 JST 2007


Thus spake Tadamasa Teranishi on Fri, Jan 12, 2007 at 09:18:26AM CST
> Lindsay Haisley wrote:
> > 
> > I'm running into a rather nasty problem with date sorting on Mailman pipermail
> > archives.  When I sort on date:early or date:late there appears to be some
> > other sort being applied, although if I do a date sort in the reverse order the
> > order of the messages is indeed reversed, indicating that the sort is working,
> > albeit with an incorrect algorithm.
> 
> Does date information accurately follow the form of RFC2822 by 
> all documents of MailMan?
> 
> Is there mail with an illegal Date: field ?
> 
> Please show the Date: field of the mail.

OK, here is an example.  I used the following query:

http://www.kca-tx.org/mailman/kca/namazu.cgi?query=Laptop&submit=Search%21&idxname=kca&max=100&result=short&sort=date%3Aearly

Here's the result:

1. win 98SE (score: 2)
    /pipermail/kca/2002-September/000192.html (4,152 bytes)

2. Linux install plus a note on jedit (score: 2)
    /pipermail/kca/2002-July/000052.html (4,432 bytes)

3. Canon BJC-2100, Restart in DOS mode (score: 2)
    /pipermail/kca/2002-August/000103.html (3,073 bytes)

4. March Newscard (score: 2)
    /pipermail/kca/2003-March/000353.html (4,296 bytes)

5. New TurboTax "feature" (score: 2)
    /pipermail/kca/2003-January/000331.html (6,865 bytes)

You can see from path names that these are out of order.  Here are the Date 
fields in each of these, copy-n-pasted from the files themselves:

Sun Sep 22 14:07:51 CDT 2002

Fri Jul 19 11:54:00 CDT 2002

Fri Aug 23 08:44:57 CDT 2002

Mon Mar  3 10:14:12 CST 2003

Tue Jan 14 17:37:11 CST 2003


The date information isn't in a standard RFC2822 header format once the files 
are in a pipermail archive, but embedded in HTML markup, e.g.:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
 <HEAD>
   <TITLE> New TurboTax &quot;feature&quot;
   </TITLE>
   <LINK REL="Index" HREF="index.html" >
   <LINK REL="made" 
HREF="mailto:kca%40lists.kca-tx.org?Subject=New%20TurboTax%20%22feature%22&In-Reply-To=20030114.1005
28.-140453.1.bstrohm%40juno.com">
   <META NAME="robots" CONTENT="index,nofollow">
   <META http-equiv="Content-Type" content="text/html; charset=us-ascii">
   <LINK REL="Previous"  HREF="000330.html">
   <LINK REL="Next"  HREF="000332.html">
 </HEAD>
 <BODY BGCOLOR="#ffffff">
   <H1>New TurboTax &quot;feature&quot;</H1>
    <B>Dale Cockle</B> <A 
HREF="mailto:kca%40lists.kca-tx.org?Subject=New%20TurboTax%20%22feature%22&In-Reply-To=20030114.100528.-140453.1.
bstrohm%40juno.com"
       TITLE="New TurboTax &quot;feature&quot;">k5jic at kca-tx.org
       </A><BR>
    <I>Tue Jan 14 17:37:11 CST 2003</I>

etc....

Could that be a problem?  Should I perhaps be indexing the mbox file?  Would 
namazu understand that better?

-- 
Lindsay Haisley       | "Fighting against human |     PGP public key
FMP Computer Services |    creativity is like   |      available at
512-259-1190          |    trying to eradicate  | <http://pubkeys.fmp.com>
http://www.fmp.com    |        dandelions"      |
                      |      (Pamela Jones)     |


More information about the Namazu-users-en mailing list