Skip to content

Migrate file_formats.py to spec + code generation #182

@rtibbles

Description

@rtibbles

This issue is not open for contribution. Visit Contributing guidelines to learn about the contributing process and how to find suitable issues.

Overview

This is the FOUNDATION issue for the constants migration project. It establishes the infrastructure and pattern that Issues #2-5 will follow. This issue must be completed before the other migration issues can proceed.

Context

Currently, le_utils/constants/file_formats.py uses the legacy approach:

  • Loads resources/formatlookup.json at runtime with pkgutil.get_data()
  • Manual Python constants (MP4 = "mp4", PDF = "pdf", etc.) must be kept in sync
  • Manual _FORMATLOOKUP dict and getformat() helper function
  • No JavaScript export available
  • Tests verify Python/JSON sync

This issue migrates it to the modern spec + code generation approach used by 8 other modules.

Scope

This issue will:

  1. Enhance generate_from_specs.py to support namedtuple-based constants (the key infrastructure work)
  2. Create spec/constants-file_formats.json following the new format
  3. Generate Python and JavaScript files via make build
  4. Update tests to verify against the spec
  5. Delete resources/formatlookup.json
  6. Document the spec format for subsequent tasks

Current Structure

File: le_utils/resources/formatlookup.json (only has 20 formats)

{
  "mp4": {"mimetype": "video/mp4"},
  "webm": {"mimetype": "video/webm"},
  "vtt": {"mimetype": ".vtt"},
  "pdf": {"mimetype": "application/pdf"},
  ...
}

Python module (file_formats.py) currently has 40+ manual constants including:

  • Formats in JSON: MP4, WEBM, VTT, PDF, EPUB, MP3, JPG, JPEG, PNG, GIF, JSON, SVG, GRAPHIE, PERSEUS, H5P, ZIM, HTML5 (zip), BLOOMPUB, BLOOMD, HTML5_ARTICLE (kpub)
  • Formats NOT in JSON (these need to be added to spec): AVI, MOV, MPG, WMV, MKV, FLV, OGV, M4V, SRT, TTML, SAMI, SCC, DFXP
  • Namedtuple: class Format(namedtuple("Format", ["id", "mimetype"])): pass
  • LIST, choices tuple, helper function getformat()

Target Spec Format

Create spec/constants-file_formats.json with ALL formats including those currently missing from JSON:

{
  "namedtuple": {
    "name": "Format",
    "fields": ["id", "mimetype"]
  },
  "constants": {
    "mp4": {"mimetype": "video/mp4"},
    "webm": {"mimetype": "video/webm"},
    "avi": {"mimetype": "video/x-msvideo"},
    "mov": {"mimetype": "video/quicktime"},
    "mpg": {"mimetype": "video/mpeg"},
    "wmv": {"mimetype": "video/x-ms-wmv"},
    "mkv": {"mimetype": "video/x-matroska"},
    "flv": {"mimetype": "video/x-flv"},
    "ogv": {"mimetype": "video/ogg"},
    "m4v": {"mimetype": "video/x-m4v"},
    "vtt": {"mimetype": "text/vtt"},
    "srt": {"mimetype": "application/x-subrip"},
    "ttml": {"mimetype": "application/ttml+xml"},
    "sami": {"mimetype": "application/x-sami"},
    "scc": {"mimetype": "text/x-scc"},
    "dfxp": {"mimetype": "application/ttaf+xml"},
    "mp3": {"mimetype": "audio/mpeg"},
    "pdf": {"mimetype": "application/pdf"},
    "epub": {"mimetype": "application/epub+zip"},
    "jpg": {"mimetype": "image/jpeg"},
    "jpeg": {"mimetype": "image/jpeg"},
    "png": {"mimetype": "image/png"},
    "gif": {"mimetype": "image/gif"},
    "json": {"mimetype": "application/json"},
    "svg": {"mimetype": "image/svg+xml"},
    "graphie": {"mimetype": "application/graphie"},
    "perseus": {"mimetype": "application/perseus+zip"},
    "h5p": {"mimetype": "application/h5p+zip"},
    "zim": {"mimetype": "application/zim"},
    "zip": {"mimetype": "application/zip"},
    "bloompub": {"mimetype": "application/bloompub+zip"},
    "bloomd": {"mimetype": "application/bloompub+zip"},
    "kpub": {"mimetype": "application/kpub+zip"}
  }
}

How to determine mimetypes for missing formats:

Generation Script Enhancement

Update scripts/generate_from_specs.py to handle the namedtuple format:

  1. Modify read_constants_specs() to detect and handle namedtuple format:

    • Check if spec has namedtuple key
    • If yes, extract namedtuple definition and constants
    • If no, use existing simple constant handling
  2. Update write_python_file() to support namedtuples:

    • Add from collections import namedtuple import when needed
    • Generate namedtuple class definition
    • Generate {MODULE}LIST with namedtuple instances
    • Generate uppercase constants from keys (e.g., MP4 = "mp4")
    • Generate _MIMETYPE constants (e.g., MP4_MIMETYPE = "video/mp4") for each format
    • Generate choices tuple with custom display names (from spec or title-cased)
    • Generate lookup dict: _{MODULE}LOOKUP = {item.id: item for item in {MODULE}LIST}
    • Generate helper function (e.g., getformat())
  3. Update write_js_file() to export rich namedtuple data with PascalCase:

    • Export constant name → id mapping (default export, e.g., MP4: "mp4")
    • Export FormatsList - full namedtuple data as array
    • Export FormatsMap - Map for efficient lookups

Generated Output Example

Python (le_utils/constants/file_formats.py):

# -*- coding: utf-8 -*-
# Generated by scripts/generate_from_specs.py
from __future__ import unicode_literals
from collections import namedtuple

# FileFormats

class Format(namedtuple("Format", ["id", "mimetype"])):
    pass

# Format constants
MP4 = "mp4"
WEBM = "webm"
AVI = "avi"
PDF = "pdf"
# ... (all formats)

# Mimetype constants  
MP4_MIMETYPE = "video/mp4"
WEBM_MIMETYPE = "video/webm"
AVI_MIMETYPE = "video/x-msvideo"
PDF_MIMETYPE = "application/pdf"
# ...

choices = (
    (MP4, "Mp4"),
    (WEBM, "Webm"),
    (AVI, "Avi"),
    (PDF, "Pdf"),
    # ...
)

FORMATLIST = [
    Format(id="mp4", mimetype="video/mp4"),
    Format(id="webm", mimetype="video/webm"),
    Format(id="avi", mimetype="video/x-msvideo"),
    # ...
]

_FORMATLOOKUP = {f.id: f for f in FORMATLIST}

def getformat(id, default=None):
    """
    Try to lookup a file format object for its `id` in internal representation.
    Returns None if lookup by internal representation fails.
    """
    return _FORMATLOOKUP.get(id) or None

JavaScript (js/FileFormats.js):

// Generated by scripts/generate_from_specs.py

// Format constants
export default {
    MP4: "mp4",
    WEBM: "webm",
    AVI: "avi",
    PDF: "pdf",
    // ...
};

// Full format data with mimetypes
export const FormatsList = [
    { id: "mp4", mimetype: "video/mp4" },
    { id: "webm", mimetype: "video/webm" },
    { id: "avi", mimetype: "video/x-msvideo" },
    { id: "pdf", mimetype: "application/pdf" },
    // ...
];

// Lookup Map
export const FormatsMap = new Map(
    FormatsList.map(format => [format.id, format])
);

This way JavaScript code can:

  • Use constants: import FileFormats from './FileFormats'; if (ext === FileFormats.MP4) ...
  • Access full data: import { FormatsList } from './FileFormats';
  • Look up by id: import { FormatsMap } from './FileFormats'; const format = FormatsMap.get('pdf');

Testing Updates

File: tests/test_formats.py

Update to test against spec instead of old JSON:

import os
import json

spec_path = os.path.join(os.path.dirname(__file__), "..", "spec", "constants-file_formats.json")
with open(spec_path) as f:
    spec = json.load(f)
    formatlookup = spec["constants"]

# Verify all constants in Python module match spec
# Verify FORMATLIST namedtuples match spec data
# Test getformat() helper
# Verify _MIMETYPE constants

How to Run Tests

# Run file formats tests
pytest tests/test_formats.py -v

# Run all tests to ensure nothing broke
pytest tests/ -v

Acceptance Criteria

  • scripts/generate_from_specs.py enhanced to support namedtuple specs
  • spec/constants-file_formats.json created with ALL formats (including AVI, MOV, SRT, etc. currently missing)
  • Mimetypes determined for all missing formats (using MDN/IANA resources)
  • make build successfully generates Python and JavaScript files
  • Generated le_utils/constants/file_formats.py has:
    • Namedtuple class definition
    • Uppercase format constants for ALL formats
    • _MIMETYPE constants for each format
    • choices tuple
    • FORMATLIST with namedtuple instances
    • _FORMATLOOKUP dict
    • getformat() helper function
  • Generated js/FileFormats.js has:
    • Default export with constant name mappings
    • FormatsList export (PascalCase) with full data
    • FormatsMap export (PascalCase) as Map
  • tests/test_formats.py updated to test against spec
  • All tests pass: pytest tests/ -v
  • resources/formatlookup.json deleted
  • Auto-generated comment in code

Disclosure

🤖 This issue was written by Claude Code, under supervision, review and final edits by @rtibbles 🤖

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions