new tool: analyze and split strings on character-class changes

As an aid to extracting likely base words, it would be very useful to split strings on character class. I've been calling the resulting strings 'tokens', but I think there's probably a better word. :)

An optional flag to consider changes in case to be significant could be useful.

For example, this list:

Hello123
PaSsWoRd$
hashes4evar

... might produce the following output, if case were treated as a character-class change:

Hello
123
Pa
Ss
Wo
Rd
$
hashes
4
evar

... and might produce this output, if case were *not* treated as a character-class change:

Hello
123
PaSsWoRd
$
hashes
4
evar

I'd argue that optionally normalizing the strings on the fly would also be useful, such that it might produce this output. This somewhat artificially inflates the significance of the lower-case version of the word, but since the lower-case form is likely the most "basic" / "proto" version of a given base word, it could be argued that this is a feature, not a bug :) 

Hello
hello
123
Pa
Ss
Wo
Rd
PaSsWoRd
password
$
hashes
4
evar

Since a common use case for this is to obtain frequency counts, an optional flag to automatically also accumulate frequency count at the same time would be ideal (but also preserving the ability to *not* do this, to support larger data sets, would be good).

Either way, finding a way to do this in a very efficient way (in terms of both memory and speed) would be highly useful.

How to handle the long tail of non-ASCII / non-Unicode strings is up for discussion.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

new tool: analyze and split strings on character-class changes #13

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

new tool: analyze and split strings on character-class changes #13

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions