-
Notifications
You must be signed in to change notification settings - Fork 38
Fix 128bit SimHash, and add 256bit and 512bit SimHash #5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
add 256 and 512 bit SimHash
tgalopin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @nicolaichuk,
Thanks for your PR!
I'm not sure what you want to fix in 128 bits version and why you need all the modifications you did. Unfortunately, I can't accept such a PR without some explanations about your work and fixes on the coding style and BC breaks.
| * @param Fingerprint $fp1 | ||
| * @param Fingerprint $fp2 | ||
| * @return int | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure this method is slower than the current one, why do you need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks :) .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can't use scalar type hint, we support php 5.3+
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove scalar type hint
| $this->deviation = $deviation; | ||
| } | ||
|
|
||
| /** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Count differences between fingerprints
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix
lib/Tga/SimHash/Fingerprint.php
Outdated
| * @var float | ||
| */ | ||
| protected $decimalValue; | ||
| protected $value; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a BC break without any added value, could you revert this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"decimalValue" save value as number.
"value" save value as string.
These are two different variables, so they are called differently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you make this property private as well? I don't think it should be exposed anyway.
| public function tokenize($element) | ||
| { | ||
| return str_pad(base_convert(md5($element), 16, 2), 128, '0', STR_PAD_LEFT); | ||
| $hash = md5($element); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need to split this in parts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For more comfort debugging
| class String128Tokenizer implements TokenizerInterface | ||
| { | ||
|
|
||
| protected static $search = array('0','1','2','3','4','5','6','7','8','9','a','b','c','d','e','f'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You shouldn't add protected variables like this without a specific need as they will need to be handled by BC promise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix
| { | ||
| return $size === 256; | ||
| } | ||
| } No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add an ending line to this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix
| { | ||
| return $size === 512; | ||
| } | ||
| } No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add an ending line to this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix
* add an ending line to this file
first: second: |
tgalopin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your work, I understand it now :) . A few comments and we should be okay :) !
| * @param Fingerprint $fp1 | ||
| * @param Fingerprint $fp2 | ||
| * @return int | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks :) .
lib/Tga/SimHash/Fingerprint.php
Outdated
| * @var float | ||
| */ | ||
| protected $decimalValue; | ||
| protected $value; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you make this property private as well? I don't think it should be exposed anyway.
| */ | ||
| class String128Tokenizer implements TokenizerInterface | ||
| { | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove this extra line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix
| */ | ||
| class String256Tokenizer implements TokenizerInterface | ||
| { | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove this extra line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix
| */ | ||
| class String512Tokenizer implements TokenizerInterface | ||
| { | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove this extra line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix
|
@tgalopin |
tgalopin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Travis is failing due to scalar type hints.
| return str_pad(base_convert(md5($element), 16, 2), 128, '0', STR_PAD_LEFT); | ||
| $hash = md5($element); | ||
| $hash = str_replace(self::$search, self::$replace, $hash); | ||
| $hash = str_pad($hash, 512, '0', STR_PAD_LEFT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a typo here. Shouldn't you write:
$hash = str_pad($hash, 128, '0', STR_PAD_LEFT);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bbalet, fixed.
|
Hi guys, I know what is to maintain an OSS project, but this PR looks good to me except a typo and the scalar type hint (except if you want to stop to maintain compatibility with older projects). It is a good lib that helped me in the past, but now I am stuck with the error in 128 hashes. How can I help? |
|
Hello @bbalet ! The issue is that as @nicolaichuk didn't gave me access to code modification in this PR, I can't update his code, and using it without crediting him does not feel right in open-source :) . If you want to open a new PR based on his code, fixing the issues and adding a note in your commit message about @nicolaichuk, please don't hesitate to do so and I will review it happily :) ! |
Option "Allow edits from maintainers" is selected in this PR. |
by code review @bbalet tgalopin#5 (comment)
@tgalopin, hi.
Please review and merge to upstream.