Skip to content

Petlja/data-preprocessing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

data-preprocessing

Tools to fetch and lightly preprocess Sphinx-based documentation repositories.

Project goals

  • Fetch repositories: clone or update repositories listed in config.json.
  • Detect project type: identify the Sphinx project type (if applicable).
  • Extract sources: collect activity/source files referenced in _sources/index.yaml or source/index.mdand convert them to normalized Markdown.

Setup

If you use Poetry (recommended):

# Install Poetry if needed
pip install poetry

# Install dependencies and create the virtual environment
poetry install

# Run the bootstrap command (installs pandoc, syncs repos, prepares dataset)
python -m scripts.cli bootstrap --config config.json --base-dir repos --output-dir dataset

If you prefer venv + pip:

# Create and activate a venv
python -m venv .venv
.\.venv\Scripts\activate.bat

# Install runtime dependencies
pip install -r requirements.txt

# Run the bootstrap command
python -m scripts.cli bootstrap --config config.json --base-dir repos --output-dir dataset

Commands

get-pandoc

Installs Pandoc (using pypandoc) if it is not already available, or reports the installed version.

Basic usage:

python -m scripts.cli get-pandoc

Options examples:

git-sync

Clone or update repositories listed in a config.json file.

Basic usage:

python -m scripts.cli git-sync

Options examples:

python -m scripts.cli git-sync --config my-config.json  # use alternate config file
python -m scripts.cli git-sync --base-dir repos         # change repos directory

The config file is a JSON object with a top-level repos array. Each item may be either a string (the repo URL) or an object with keys url and optional path.

Example config.json:

{
  "repos": [
    { "url": "https://example.com/repo1.git" },
    { "url": "https://example.com/repo2.git" }
  ]
}

prepare-dataset

Convert activity/source files from each repository into a normalized Markdown dataset. This command detects the project type (when possible), collects the list of files from _sources/index.yaml or _sources/index.md, and converts them to Markdown using Pandoc.

Basic usage:

python -m scripts.cli prepare-dataset

Options examples:

python -m scripts.cli prepare-dataset --base-dir repos --output-dir dataset

Additional options:

  • --jobs: control the number of worker threads for conversions. Use --jobs 1 to force single-threaded (serial) conversion; omit the option to use the default (number of CPUs), or pass --jobs N to use N workers explicitly.

Examples:

# Force serial conversion (useful for debugging or on constrained systems)
python -m scripts.cli prepare-dataset --base-dir repos --output-dir dataset --jobs 1

# Use the default number of workers (number of CPUs)
python -m scripts.cli prepare-dataset --base-dir repos --output-dir dataset

# Use 4 workers explicitly
python -m scripts.cli prepare-dataset --base-dir repos --output-dir dataset --jobs 4

Notes:

  • Ensure pandoc is available (use get-pandoc to install if needed).

bootstrap

Convenience command that runs the full local setup flow: ensures Pandoc is installed, syncs repositories listed in config.json, then prepares the dataset by converting activity files into normalized Markdown.

Basic usage:

python -m scripts.cli bootstrap

Options examples:

python -m scripts.cli bootstrap --config my-config.json --base-dir repos --output-dir dataset

What it does:

  • Runs get-pandoc (installs Pandoc via pypandoc if missing).
  • Runs git-sync to clone or update repositories from config.json into --base-dir.
  • Runs prepare-dataset to collect activity files and convert them into markdown under --output-dir.

Notes:

  • Useful for initial environment setup on a fresh machine or CI job.
  • The command calls the same underlying code as the three individual commands, so you can still run steps separately when you need more control.

Notes about parallel conversions:

  • The bootstrap command forwards the --jobs option to prepare-dataset. If you experience issues with parallel pandoc conversions or pandoc filters, retry with --jobs 1 to run conversions serially.

Example (force serial conversion during bootstrap):

python -m scripts.cli bootstrap --config my-config.json --base-dir repos --output-dir dataset --jobs 1

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages