80_Excerpts

Harper Strickland, 2025

Description:

This repository contains a speech corpus of 320 recordings (80 transcripts x 4 speakers). The 80 selections are taken from 38 public domain source texts, all with full-chapter audio recordings available from LibriVox.org. All LibriVox audio recordings are in the public domain.

30 excerpts are from recordings available in the LJ Speech Dataset, and 30 are from recordings represented in the LJ2 Corpus. For each set of 30, 20 are from source texts that appear in only one of these corpora, and 10 are from sources with chapters existing in both. Both LJ Speech and the LJ2 Corpus contain only nonfiction texts. The final 20 excerpts are from fiction source chapters available on LibriVox from the same reader.

Each excerpt is read by multiple adult American voices, with multiple genders represented. 'LJ' is the voice of LJ Speech and the LJ2 Corpus.

Summary Statistics:

Transcript Details:

sources	excerpts	1 sentence	< sentence	> sentence	mean words	min words	max words
LJ Speech unique	20	14	5	1	18.5	10	30
LJ Speech shared	10	8	2	0	19.9	14	27
LJ2 Corpus unique	20	12	6	2	18.3	6	28
LJ2 Corpus shared	10	8	2	0	16.9	5	25
Librivox fiction	20	8	9	3	18.6	3	31

Voice Characteristics:

readers	demographic	mean duration	words per min
LJ	woman	6.35 sec	160
MB	woman	6.57 sec	159
WS	man	5.13 sec	203
HS	nonbinary	5.43 sec	184

Files:

metadata_80.csv

Includes for each excerpt, not only the identifying number and transcript, but also subset, source corpus, LibriVox Source id number (foreign key), and wav file durations for each reader.
LV_books_data_80.json

Original source metadata available via LibriVox's API: id number, title, description, language, copyright year, section/chapter count, urls (text source, rss, zip file, project, LibriVox), total time (h:m:s and seconds), authors (name and birth/death years).

books_enrichment_80.csv

Additional source metadata: LibriVox id number (primary key), updated title (if incomplete in LibriVox API), LibriVox recordings upload date, count and duration of sections read by Linda Johnson ('LJ').
Audio recordings:
- Audio files for LJ, WS, HS can be downloaded from /wavs directory, containing a subdirectory for each reader.
- Access to MB recordings is limited based on intended use; for more information, contact speakingofdata@gmail.com with project details and requirements.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
wavs		wavs
LICENSE		LICENSE
LV_books_data_80.json		LV_books_data_80.json
README.md		README.md
books_enrichment_80.csv		books_enrichment_80.csv
metadata_80.csv		metadata_80.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

80_Excerpts

Harper Strickland, 2025

Description:

Summary Statistics:

Files:

About

Uh oh!

License

speakingofdata/80_Excerpts

Folders and files

Latest commit

History

Repository files navigation

80_Excerpts

Harper Strickland, 2025

Description:

Summary Statistics:

Files:

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks