Improve email address parsing #14

dracos · 2016-05-13T21:32:19Z

This PR:

Adds a variable to allow non-ASCII email address parsing (RFC5335/6532), fixing Doesn't parse non-ASCII in addresses #12;
Updates the regular expressions to be more in line with RFC5322;
Tries to prevent backtracking regular expression explosions, fixing parse method hangs indefinitley #10;
Switches to regex recursion for comment nesting, fixing DoS attack through Email-Address perl module related to nested comments #11;
Fixes comments being stripped in a quoted string;
Fixes multiple quoted strings in the display name.

Sorry, I realise this covers a few things that could perhaps be separate PRs, but the changes to regexes were all overlapping and I didn't think it'd work well to split them out.

The first three commits I think should still be fine in 5.8, but the others need 5.10 due to the use of regex recursion and named references. The tests all pass; I'm not sure if that's enough to be sure these changes don't affect any parsing or have any backward incompatibilities, but hopefully this is along the right lines for improving this module.

Keep the default behaviour of previous versions, add a UNICODE variable to switch it on. This is from RFC 5335 / 6532, "Internationalized Email Headers". Fixes Perl-Email-Project#12.

FWS is still brought in as \s+ for simplicity. The obs_phrase test in comment passes fine, so remove that code; space backtracking problems will be fixed in the next commit. This commit actually fixes the original example in Perl-Email-Project#10 as $comment is much simplified. Update test to pass; test is actually incorrect in that the "comment" should not be removed (RT#80665).

COLLAPSE_SPACES is no longer necessary. Fixes the second example given in Perl-Email-Project#10.

Due to use of recursive regex and named backpatterns, this can not support perl 5.8. Fixes Perl-Email-Project#11.

Fixes https://rt.cpan.org/Public/Bug/Display.html?id=80665

Tidy up handling of phrase (display name) to be consistent throughout; a phrase will be treated as is unless it starts and ends with double quote marks, in which case it will be treated as a quoted string, unquoted and unescaped. Fixes comment in Perl-Email-Project#13.

nwellnhof · 2016-05-14T13:38:45Z

lib/Email/Address.pm

  $comment  = qr/\($ccontent*\)/;
 }
-my $cfws           = qr/$comment|\s+/;
+my $cfws           = qr/(?>$comment|\s+)/;


I think this construct is only supported from 5.10 onwards.

http://perldoc.perl.org/5.8.8/perlre.html#(%3f%3epattern) says it is supported but experimental. I haven't got perl 5.8 installed to test, I'm afraid; later commits in this PR are definitely 5.10 only, but I hoped this one would be okay so that 5.8 could get the space backtracking improvements, if not the comment backtracking.

pali · 2018-02-02T14:23:00Z

lib/Email/Address.pm

+      unless ($UNICODE) {
+          next if $user =~ /\P{ASCII}/;
+          next if $host =~ /\P{ASCII}/;
+      }


This is incorrect, RFC 6532 talks about UTF-8 which is subspace of 8bit sequences. But \P{ASCII} matches also ordinals above 8bits. UTF-8 != \P{ASCII} and also UTF-8 != $UNICODE. This code absolutely does not match documentation which is written above and also does not confirm to the RFC 6532.

Matching UTF-8 sequences by perl regexes is not easy and there is no \P abbrev for it. Even [^\x00-\xFF] is incorrect as it would not match invalid UTF-8 sequences too.

I agree. I am not saying that with this change, the code would validate the data as UTF-8, it would not. But currently, it is impossible to use this module with UTF-8 addresses at all, and this change would at least allow you to use them (assuming you would do such checking that it is valid UTF-8 data yourself). This option allows you to remove some code that currently exists in the module – as an option for backwards compatibility (I assumed I could not simply remove the two lines, plus #12 said some people may want the current behaviour). Please don't let the perfect be the enemy of the good :)

pali · 2018-02-02T14:28:57Z

t/tests.t

    [
      [
-        'Greg Norris',
+        'Greg Norris ',


This is incorrect change. String "Greg Norris (humble visionary genius)" <nextrightmove-- ATAT --bang.example.net> (after replacing -- ATAT -- with @) must be parsed as:

phrase: Greg Norris (humble visionary genius)

mailbox: nextrightmove

domain: bang.example.net

You are missing part after space in parenthesis in phrase part. In quoted string content of parenthesis is not comment which you probably missed.

Ah, this is fixed in next commit. In this commit is just test noise.

Yes, I tried to explain that in the commit message, sorry if it wasn't clear enough.

dracos added 7 commits May 13, 2016 20:36

Allow non-ASCII characters in email addresses.

d77b859

Keep the default behaviour of previous versions, add a UNICODE variable to switch it on. This is from RFC 5335 / 6532, "Internationalized Email Headers". Fixes Perl-Email-Project#12.

Try and prevent backtracking regex explosions.

b142fb6

COLLAPSE_SPACES is no longer necessary. Fixes the second example given in Perl-Email-Project#10.

Use regex recusion for comment nesting.

9066167

Due to use of recursive regex and named backpatterns, this can not support perl 5.8. Fixes Perl-Email-Project#11.

Don't extract "comments" from in quoted strings.

7205fc9

Fixes https://rt.cpan.org/Public/Bug/Display.html?id=80665

Fetch phrase/user/host from main regex.

c61bb28

dracos mentioned this pull request May 14, 2016

Several parsing and formatting bugs #13

Open

nwellnhof reviewed May 14, 2016
View reviewed changes

dracos mentioned this pull request Sep 23, 2016

Since 1.197, with Net::DNS installed, mx no longer passes A records Perl-Email-Project/Email-Valid#30

Closed

pali reviewed Feb 2, 2018

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve email address parsing #14

Improve email address parsing #14

Uh oh!

dracos commented May 13, 2016 •

edited

Loading

Uh oh!

nwellnhof May 14, 2016

Uh oh!

dracos May 14, 2016

Uh oh!

pali Feb 2, 2018

Uh oh!

pali Feb 2, 2018

Uh oh!

dracos Feb 2, 2018

Uh oh!

pali Feb 2, 2018

Uh oh!

pali Feb 2, 2018

Uh oh!

dracos Feb 2, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Improve email address parsing #14

Are you sure you want to change the base?

Improve email address parsing #14

Uh oh!

Conversation

dracos commented May 13, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dracos commented May 13, 2016 •

edited

Loading