-
Notifications
You must be signed in to change notification settings - Fork 12
Improve email address parsing #14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Keep the default behaviour of previous versions, add a UNICODE variable to switch it on. This is from RFC 5335 / 6532, "Internationalized Email Headers". Fixes Perl-Email-Project#12.
FWS is still brought in as \s+ for simplicity. The obs_phrase test in comment passes fine, so remove that code; space backtracking problems will be fixed in the next commit. This commit actually fixes the original example in Perl-Email-Project#10 as $comment is much simplified. Update test to pass; test is actually incorrect in that the "comment" should not be removed (RT#80665).
COLLAPSE_SPACES is no longer necessary. Fixes the second example given in Perl-Email-Project#10.
Due to use of recursive regex and named backpatterns, this can not support perl 5.8. Fixes Perl-Email-Project#11.
Tidy up handling of phrase (display name) to be consistent throughout; a phrase will be treated as is unless it starts and ends with double quote marks, in which case it will be treated as a quoted string, unquoted and unescaped. Fixes comment in Perl-Email-Project#13.
| $comment = qr/\($ccontent*\)/; | ||
| } | ||
| my $cfws = qr/$comment|\s+/; | ||
| my $cfws = qr/(?>$comment|\s+)/; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this construct is only supported from 5.10 onwards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
http://perldoc.perl.org/5.8.8/perlre.html#(%3f%3epattern) says it is supported but experimental. I haven't got perl 5.8 installed to test, I'm afraid; later commits in this PR are definitely 5.10 only, but I hoped this one would be okay so that 5.8 could get the space backtracking improvements, if not the comment backtracking.
| unless ($UNICODE) { | ||
| next if $user =~ /\P{ASCII}/; | ||
| next if $host =~ /\P{ASCII}/; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is incorrect, RFC 6532 talks about UTF-8 which is subspace of 8bit sequences. But \P{ASCII} matches also ordinals above 8bits. UTF-8 != \P{ASCII} and also UTF-8 != $UNICODE. This code absolutely does not match documentation which is written above and also does not confirm to the RFC 6532.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Matching UTF-8 sequences by perl regexes is not easy and there is no \P abbrev for it. Even [^\x00-\xFF] is incorrect as it would not match invalid UTF-8 sequences too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I am not saying that with this change, the code would validate the data as UTF-8, it would not. But currently, it is impossible to use this module with UTF-8 addresses at all, and this change would at least allow you to use them (assuming you would do such checking that it is valid UTF-8 data yourself). This option allows you to remove some code that currently exists in the module – as an option for backwards compatibility (I assumed I could not simply remove the two lines, plus #12 said some people may want the current behaviour). Please don't let the perfect be the enemy of the good :)
t/tests.t
Outdated
| [ | ||
| [ | ||
| 'Greg Norris', | ||
| 'Greg Norris ', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is incorrect change. String "Greg Norris (humble visionary genius)" <nextrightmove-- ATAT --bang.example.net> (after replacing -- ATAT -- with @) must be parsed as:
- phrase:
Greg Norris (humble visionary genius) - mailbox:
nextrightmove - domain:
bang.example.net
You are missing part after space in parenthesis in phrase part. In quoted string content of parenthesis is not comment which you probably missed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, this is fixed in next commit. In this commit is just test noise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I tried to explain that in the commit message, sorry if it wasn't clear enough.
This PR:
Sorry, I realise this covers a few things that could perhaps be separate PRs, but the changes to regexes were all overlapping and I didn't think it'd work well to split them out.
The first three commits I think should still be fine in 5.8, but the others need 5.10 due to the use of regex recursion and named references. The tests all pass; I'm not sure if that's enough to be sure these changes don't affect any parsing or have any backward incompatibilities, but hopefully this is along the right lines for improving this module.