Skip to content

speakingofdata/LJ2_Corpus

Repository files navigation

LJ2_Corpus

Harper Strickland, 2025

Description:

This repository contains a speech corpus in the format of the popular LJ Speech Dataset. It includes 26,200 audio files, with segmentation available in 12-, 24-, and 48-hours. It can be used in combination with LJ Speech to triple its size, compared with LJ Speech, or easily substituted into any model designed to use LJ Speech.

The single speaker's voice is the same as LJ Speech (Linda Johnson, an adult American woman). All texts are public domain nonfiction texts with audio recordings available from LibriVox.org. Selection decisions were made to maximize variety of subjects and authors, and terms over-represented in the texts were downsampled.

Audio file naming convention:

  • /wavs1 subdirectory: LJ1##-####.wav
  • /wavs2 subdirectory: LJ2##-####.wav
  • /wavs3 subdirectory: LJ3##-####.wav
  • /wavs4 subdirectory: LJ4##-####.wav

The first 3 digits designate all files partitioned from a single source text chapter. (Files in the LJ Speech Dataset are named LJ0##-####.wav.)

Segments:

  • 12-hour: /wavs1
  • 24-hour: /wavs1 + /wavs2
  • 48-hour: /wavs1 + /wavs2 + /wavs3 + /wavs4

Summary Statistics Compared to LJ Speech:

segment files duration mean (sec) min (sec) max (sec) sources chapters authors max source (hours)
LJ Speech 13,100 23:55 6.57 1.11 10.10 7 50 7 10:34
LJ2 12-hour 6,550 11:57 6.58 1.13 10.10 59 68 59 0:47
LJ2 24-hour 13,100 23:55 6.58 1.11 10.10 61 135 74 1:01
LJ2 48-hour 26,200 47:51 6.58 1.11 10.10 61 288 82 2:53

Files:

  • metadata.csv

    Includes file id (same as audio filename without '.wav') and transcript for each selection, in the same format as LJ Speech.

  • metadata.zip

    Compressed version of metadata.csv.

  • LV_books_data_LJ2.json

    Original source metadata available via LibriVox's API: id number, title, description, language, copyright year, section/chapter count, urls (text source, rss, zip file, project, LibriVox), total time (h:m:s and seconds), authors (name and birth/death years).

  • books_enrichment_LJ2.csv

    Additional source metadata: LibriVox id number (primary key), updated title (if incomplete in LibriVox API), LibriVox recordings upload date, count and duration of sections read by Linda Johnson ('LJ').

  • sections_details_LJ2.csv

    Additional details about each source chapter represented in the corpus: index (first 3 digits in filename), LibriVox id number (foreign key), section number, authors (some texts are compilations with different authors for each chapter), count of files in corpus, total duration of corpus audio files (minutes), count of files included in '80 Excerpts' Corpus.

About

Single speaker, 26,200 transcribed audio recordings, 48 hours total

Topics

Resources

License

Stars

Watchers

Forks