-
Notifications
You must be signed in to change notification settings - Fork 2
Description
As an aid to extracting likely base words, it would be very useful to split strings on character class. I've been calling the resulting strings 'tokens', but I think there's probably a better word. :)
An optional flag to consider changes in case to be significant could be useful.
For example, this list:
Hello123
PaSsWoRd$
hashes4evar
... might produce the following output, if case were treated as a character-class change:
Hello
123
Pa
Ss
Wo
Rd
$
hashes
4
evar
... and might produce this output, if case were not treated as a character-class change:
Hello
123
PaSsWoRd
$
hashes
4
evar
I'd argue that optionally normalizing the strings on the fly would also be useful, such that it might produce this output. This somewhat artificially inflates the significance of the lower-case version of the word, but since the lower-case form is likely the most "basic" / "proto" version of a given base word, it could be argued that this is a feature, not a bug :)
Hello
hello
123
Pa
Ss
Wo
Rd
PaSsWoRd
password
$
hashes
4
evar
Since a common use case for this is to obtain frequency counts, an optional flag to automatically also accumulate frequency count at the same time would be ideal (but also preserving the ability to not do this, to support larger data sets, would be good).
Either way, finding a way to do this in a very efficient way (in terms of both memory and speed) would be highly useful.
How to handle the long tail of non-ASCII / non-Unicode strings is up for discussion.