-
-
Notifications
You must be signed in to change notification settings - Fork 224
Description
We currently lack dask bag examples in this repository. Two come to mind:
- Read JSON data, and do some groupby aggregation with both
Bag.groupbyandBag.foldby - Read text data and do some basic wordcount
For the JSON data it might make sense to add a dataset generation tool for nested records data, similar to dask.datasets.timeseries, and then use that to generate JSON data to disk, similar to how we generate CSV data in http://examples.dask.org/dataframes/01-data-access.html#Create-artificial-dataset.
We would then read the JSON data, and do some minimal processing.
For the text data I wonder if there is an online dataset we can download. I suspect that the complete works of shakespeare is around somewhere. We might do a simple thing like read, split, frequencies. Or we might do more complex work afterwards by bringing in NLTK, stemming words, removing stopwords, etc..