Skip to content

Conversation

@dracos
Copy link

@dracos dracos commented May 13, 2016

This PR:

Sorry, I realise this covers a few things that could perhaps be separate PRs, but the changes to regexes were all overlapping and I didn't think it'd work well to split them out.

The first three commits I think should still be fine in 5.8, but the others need 5.10 due to the use of regex recursion and named references. The tests all pass; I'm not sure if that's enough to be sure these changes don't affect any parsing or have any backward incompatibilities, but hopefully this is along the right lines for improving this module.

dracos added 7 commits May 13, 2016 20:36
Keep the default behaviour of previous versions, add a UNICODE variable
to switch it on.

This is from RFC 5335 / 6532, "Internationalized Email Headers".

Fixes Perl-Email-Project#12.
FWS is still brought in as \s+ for simplicity. The obs_phrase test in
comment passes fine, so remove that code; space backtracking problems
will be fixed in the next commit.

This commit actually fixes the original example in Perl-Email-Project#10 as $comment is
much simplified.

Update test to pass; test is actually incorrect in that the "comment"
should not be removed (RT#80665).
COLLAPSE_SPACES is no longer necessary.

Fixes the second example given in Perl-Email-Project#10.
Due to use of recursive regex and named backpatterns, this can not
support perl 5.8.

Fixes Perl-Email-Project#11.
Tidy up handling of phrase (display name) to be consistent throughout; a
phrase will be treated as is unless it starts and ends with double quote
marks, in which case it will be treated as a quoted string, unquoted and
unescaped.

Fixes comment in Perl-Email-Project#13.
$comment = qr/\($ccontent*\)/;
}
my $cfws = qr/$comment|\s+/;
my $cfws = qr/(?>$comment|\s+)/;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this construct is only supported from 5.10 onwards.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

http://perldoc.perl.org/5.8.8/perlre.html#(%3f%3epattern) says it is supported but experimental. I haven't got perl 5.8 installed to test, I'm afraid; later commits in this PR are definitely 5.10 only, but I hoped this one would be okay so that 5.8 could get the space backtracking improvements, if not the comment backtracking.

unless ($UNICODE) {
next if $user =~ /\P{ASCII}/;
next if $host =~ /\P{ASCII}/;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect, RFC 6532 talks about UTF-8 which is subspace of 8bit sequences. But \P{ASCII} matches also ordinals above 8bits. UTF-8 != \P{ASCII} and also UTF-8 != $UNICODE. This code absolutely does not match documentation which is written above and also does not confirm to the RFC 6532.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matching UTF-8 sequences by perl regexes is not easy and there is no \P abbrev for it. Even [^\x00-\xFF] is incorrect as it would not match invalid UTF-8 sequences too.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I am not saying that with this change, the code would validate the data as UTF-8, it would not. But currently, it is impossible to use this module with UTF-8 addresses at all, and this change would at least allow you to use them (assuming you would do such checking that it is valid UTF-8 data yourself). This option allows you to remove some code that currently exists in the module – as an option for backwards compatibility (I assumed I could not simply remove the two lines, plus #12 said some people may want the current behaviour). Please don't let the perfect be the enemy of the good :)

t/tests.t Outdated
[
[
'Greg Norris',
'Greg Norris ',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect change. String "Greg Norris (humble visionary genius)" <nextrightmove-- ATAT --bang.example.net> (after replacing -- ATAT -- with @) must be parsed as:

  • phrase: Greg Norris (humble visionary genius)
  • mailbox: nextrightmove
  • domain: bang.example.net

You are missing part after space in parenthesis in phrase part. In quoted string content of parenthesis is not comment which you probably missed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this is fixed in next commit. In this commit is just test noise.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I tried to explain that in the commit message, sorry if it wasn't clear enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Several parsing and formatting bugs

3 participants