Skip to content

Migrate languages.py to spec + code generation #186

@rtibbles

Description

@rtibbles

This issue is not open for contribution. Visit Contributing guidelines to learn about the contributing process and how to find suitable issues.

Overview

Migrate le_utils/constants/languages.py from the legacy JSON-as-data approach to the modern spec + code generation system. This is the most complex module with 1,141 lines of language data, custom namedtuple properties, and multiple helper functions.

Context

Currently, le_utils/constants/languages.py uses the legacy approach:

  • Loads resources/languagelookup.json (22,820 bytes, 1,141 lines!)
  • Custom Language namedtuple with code, id, and first_native_name properties
  • Multiple helper functions: getlang(), getlang_by_name(), getlang_by_native_name(), getlang_by_alpha2()
  • RTL language list: RTL_LANG_CODES
  • No JavaScript export available

Current Structure

File: le_utils/resources/languagelookup.json (1,141 language entries)

{
  "aa": {
    "name": "Afar",
    "native_name": "Afaraf"
  },
  "en": {
    "name": "English",
    "native_name": "English"
  },
  "es-MX": {
    "name": "Spanish (Mexico)",
    "native_name": "Español (México)"
  },
  ...
}

Python module has:

  • Custom Language namedtuple with properties:
    • code property: combines primary_code and subcode (e.g., "en-US")
    • id property: alias for code
    • first_native_name property: first name from comma-separated list
  • Helper functions for lookups by various criteria
  • RTL language codes list

Target Spec Format

Create spec/constants-languages.json with all language data:

{
  "namedtuple": {
    "name": "Language",
    "fields": ["native_name", "primary_code", "subcode", "name"],
    "properties": {
      "code": "return '{}-{}'.format(self.primary_code, self.subcode) if self.subcode else self.primary_code",
      "id": "return self.code",
      "first_native_name": "return self.native_name.split(',')[0]"
    }
  },
  "rtl_codes": ["ar", "arq", "dv", "he", "fa", "ps", "ur", "yi"],
  "constants": {
    "aa": {
      "name": "Afar",
      "native_name": "Afaraf"
    },
    "en": {
      "name": "English",
      "native_name": "English"
    },
    "es-MX": {
      "name": "Spanish (Mexico)",
      "native_name": "Español (México)"
    }
  }
}

Copy all 1,141 entries from languagelookup.json. The generation script will parse language codes (e.g., "es-MX") into primary_code="es" and subcode="MX".

Note: The properties metadata tells the generation script to add @property methods to the namedtuple class.

Generation Script Enhancement

Update scripts/generate_from_specs.py to handle:

  1. Namedtuple properties from properties metadata
  2. RTL codes list from rtl_codes metadata
  3. Helper functions for language lookups:
    • getlang(code) - lookup by code
    • getlang_by_name(name) - case-insensitive lookup by English name
    • getlang_by_native_name(native_name) - case-insensitive lookup
    • getlang_by_alpha2(alpha2) - lookup by 2-letter code

Generated Output Example

Python (le_utils/constants/languages.py):

# Generated by scripts/generate_from_specs.py
from collections import namedtuple

class Language(namedtuple("Language", ["native_name", "primary_code", "subcode", "name"])):
    @property
    def code(self):
        return "{}-{}".format(self.primary_code, self.subcode) if self.subcode else self.primary_code
    
    @property
    def id(self):
        return self.code
    
    @property
    def first_native_name(self):
        return self.native_name.split(",")[0]

RTL_LANG_CODES = ["ar", "arq", "dv", "he", "fa", "ps", "ur", "yi"]

LANGUAGELIST = [
    Language(native_name="Afaraf", primary_code="aa", subcode=None, name="Afar"),
    Language(native_name="English", primary_code="en", subcode=None, name="English"),
    Language(native_name="Español (México)", primary_code="es", subcode="MX", name="Spanish (Mexico)"),
    # ... (1,141 total)
]

_LANGUAGELOOKUP = {lang.code: lang for lang in LANGUAGELIST}
_LANGUAGELOOKUP_BY_NAME = {lang.name.lower(): lang for lang in LANGUAGELIST}
_LANGUAGELOOKUP_BY_NATIVE_NAME = {lang.native_name.lower(): lang for lang in LANGUAGELIST}
_LANGUAGELOOKUP_BY_ALPHA2 = {lang.primary_code: lang for lang in LANGUAGELIST if not lang.subcode}

def getlang(code, default=None):
    return _LANGUAGELOOKUP.get(code) or default

def getlang_by_name(name, default=None):
    return _LANGUAGELOOKUP_BY_NAME.get(name.lower()) or default

def getlang_by_native_name(native_name, default=None):
    return _LANGUAGELOOKUP_BY_NATIVE_NAME.get(native_name.lower()) or default

def getlang_by_alpha2(alpha2, default=None):
    return _LANGUAGELOOKUP_BY_ALPHA2.get(alpha2) or default

JavaScript (js/Languages.js):

// Generated by scripts/generate_from_specs.py

export const RTL_LANG_CODES = ["ar", "arq", "dv", "he", "fa", "ps", "ur", "yi"];

export const LanguagesList = [
    { native_name: "Afaraf", primary_code: "aa", subcode: null, name: "Afar", code: "aa", first_native_name: "Afaraf" },
    { native_name: "English", primary_code: "en", subcode: null, name: "English", code: "en", first_native_name: "English" },
    { native_name: "Español (México)", primary_code: "es", subcode: "MX", name: "Spanish (Mexico)", code: "es-MX", first_native_name: "Español (México)" },
    // ...
];

export const LanguagesMap = new Map(
    LanguagesList.map(lang => [lang.code, lang])
);

export function getLanguage(code) {
    return LanguagesMap.get(code) || null;
}

export function getLanguageByName(name) {
    return LanguagesList.find(lang => lang.name.toLowerCase() === name.toLowerCase()) || null;
}

export function getLanguageByNativeName(nativeName) {
    return LanguagesList.find(lang => lang.native_name.toLowerCase() === nativeName.toLowerCase()) || null;
}

export function getLanguageByAlpha2(alpha2) {
    return LanguagesList.find(lang => lang.primary_code === alpha2 && !lang.subcode) || null;
}

Testing Updates

Files: tests/test_languages.py and tests/test_getlangs.py

Update to test against spec:

spec_path = os.path.join(os.path.dirname(__file__), "..", "spec", "constants-languages.json")
with open(spec_path) as f:
    spec = json.load(f)
    languagelookup = spec["constants"]

# Verify all 1,141 languages
# Test helper functions
# Test Language properties (code, id, first_native_name)
# Test RTL_LANG_CODES list

How to Run Tests

pytest tests/test_languages.py -v
pytest tests/test_getlangs.py -v
pytest tests/ -v

Acceptance Criteria

  • spec/constants-languages.json created with all 1,141 language entries
  • Added properties metadata for code, id, first_native_name
  • Added rtl_codes metadata
  • scripts/generate_from_specs.py enhanced to generate namedtuple properties
  • make build successfully generates Python and JavaScript files
  • Generated le_utils/constants/languages.py has:
    • Language namedtuple with 4 fields and 3 properties
    • RTL_LANG_CODES list
    • LANGUAGELIST with all 1,141 languages
    • Helper functions (getlang, getlang_by_name, etc.)
    • Lookup dicts
  • Generated js/Languages.js has:
    • RTL_LANG_CODES export
    • LanguagesList with computed properties (code, first_native_name)
    • LanguagesMap for lookups
    • Helper functions (getLanguage, getLanguageByName, etc.)
  • Tests updated to test against spec
  • All tests pass
  • resources/languagelookup.json deleted

Disclosure

🤖 This issue was written by Claude Code, under supervision, review and final edits by @rtibbles 🤖

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions