From ot at w3.org Mon Feb 21 13:09:32 2005 From: ot at w3.org (Olivier Thereaux) Date: Mon Feb 21 13:09:38 2005 Subject: [Namazu-devel-en] Parsing of MIME multipart messages' boundary too strict Message-ID: <20050221040932.GA2223@w3.mag.keio.ac.jp> Hello Namazu developers, I have recently noticed that Namazu (using the mailnews.pl filter through the use of the -h option at indexing time) failed to index properly some messages, in particular messages sent with Apple's Mail.app with attachments. Searching any kind of content from these messages gives no result at all. I tracked the issue down to the fact that when sending messages with attachments, Mail.app (like most mail clients) sends a multipart/mixed MIME message, with a boundary declaration. What Mail.app does not do (that most other mailers do) is enclose the boundary in double quotes, e.g it uses Content-Type: multipart/mixed; boundary=gc0p4Jq0M2Yt08j34c0p instead of Content-Type: multipart/mixed; boundary="gc0p4Jq0M2Yt08j34c0p" Both are actually perfectly legitimate, even though the latter is considered safer. Quoting RFC 2046: [[ WARNING TO IMPLEMENTORS: The grammar for parameters on the Content- type field is such that it is often necessary to enclose the boundary parameter values in quotes on the Content-type line. This is not always necessary, but never hurts. Implementors should be sure to study the grammar carefully in order to avoid producing invalid Content-type fields. ]] -- http://www.faqs.org/rfcs/rfc2046.html Namazu's filter/mailnews.pl is therefore too "safe" in its parsing of the boundary, in effect ignoring some multipart messages it should not. A very simple patch (from current HEAD) would be something like: --- mailnews_orig.pl Mon Feb 21 12:44:47 2005 +++ mailnews.pl Mon Feb 21 12:54:02 2005 @@ -203,8 +203,9 @@ if ($contenttype =~ m!text/plain!){ $$contref .= $body; } elsif ($contenttype =~ m!multipart/alternative!){ - if ($head =~ /boundary="(.*?)"/i){ + if ($head =~ /boundary=(.*?)/i){ my $boundary2 = $1; + $boundary2 =~ s/"(.*?)"/$1/; util::dprint("((boundary: $boundary2))\n"); $boundary2 =~ s/(\W)/\\$1/g; multipart_process(\$body, $boundary2, $weighted_str, $fields); Would you mind checking the proposed patch and applying if it looks OK? Thank you, -- olivier Thereaux http://www.w3.org/People/olivier/ http://yoda.zoy.org/ From knok at daionet.gr.jp Tue Feb 22 08:35:00 2005 From: knok at daionet.gr.jp (NOKUBI Takatsugu) Date: Tue Feb 22 08:35:02 2005 Subject: [Namazu-devel-en] Re: Parsing of MIME multipart messages' boundary too strict In-Reply-To: <20050221040932.GA2223@w3.mag.keio.ac.jp> References: <20050221040932.GA2223@w3.mag.keio.ac.jp> Message-ID: <87ll9hbi6z.wl@knok.daionet.gr.jp> At Mon, 21 Feb 2005 13:09:32 +0900, Olivier Thereaux wrote: > Content-Type: multipart/mixed; boundary=gc0p4Jq0M2Yt08j34c0p > instead of > Content-Type: multipart/mixed; boundary="gc0p4Jq0M2Yt08j34c0p" > > Both are actually perfectly legitimate, even though the latter is > considered safer. Yes, you are right. > Would you mind checking the proposed patch and applying if it looks OK? I'll try to apply and check it. Thank you. -- NOKUBI Takatsugu E-mail: knok@daionet.gr.jp knok@namazu.org / knok@debian.org