Skip to content

new tool: analyze and split strings on character-class changes #13

@roycewilliams

Description

@roycewilliams

As an aid to extracting likely base words, it would be very useful to split strings on character class. I've been calling the resulting strings 'tokens', but I think there's probably a better word. :)

An optional flag to consider changes in case to be significant could be useful.

For example, this list:

Hello123
PaSsWoRd$
hashes4evar

... might produce the following output, if case were treated as a character-class change:

Hello
123
Pa
Ss
Wo
Rd
$
hashes
4
evar

... and might produce this output, if case were not treated as a character-class change:

Hello
123
PaSsWoRd
$
hashes
4
evar

I'd argue that optionally normalizing the strings on the fly would also be useful, such that it might produce this output. This somewhat artificially inflates the significance of the lower-case version of the word, but since the lower-case form is likely the most "basic" / "proto" version of a given base word, it could be argued that this is a feature, not a bug :)

Hello
hello
123
Pa
Ss
Wo
Rd
PaSsWoRd
password
$
hashes
4
evar

Since a common use case for this is to obtain frequency counts, an optional flag to automatically also accumulate frequency count at the same time would be ideal (but also preserving the ability to not do this, to support larger data sets, would be good).

Either way, finding a way to do this in a very efficient way (in terms of both memory and speed) would be highly useful.

How to handle the long tail of non-ASCII / non-Unicode strings is up for discussion.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions