A flexible, configurable Python tool for extracting fields from XML files and converting them to rectangular data formats like CSV or Excel. Designed to handle complex XML formats like METS, LIDO, PREMIS, and more.
🚨 This project was completely developed with the assistance of GitHub Copilot. It is currently in a prototype stage — use at your own risk. Contributions, reviews, and testing are highly appreciated.
- 🔍 Flexible XPath-based extraction - Extract any field using XPath expressions
- 🏷️ Automatic namespace detection - No need to manually define namespaces (but you can if needed)
- 🎯 Advanced filtering - Filter extracted values using regex, startswith, or contains patterns
- 🔬 Record filtering - Pre-filter records before extraction based on any field criteria (dates, licenses, etc.)
- 📊 Multiple value handling - Concatenate multiple values with custom separators
- 🚀 Batch processing - Process entire directories of XML files
- 📈 Progress tracking - Beautiful progress bars and statistics with Rich
- 📝 Comprehensive logging - Detailed logs with Loguru
- 🛡️ Error resilience - Skip invalid XML files and continue processing
- 🎨 Modern CLI - Built with Click for an intuitive command-line interface
This project uses uv for fast, reliable package management.
# Clone the repository
git clone <your-repo-url>
cd xml_extractor
# Install dependencies with uv
uv sync-
Prepare your XML files - Place them in a directory (e.g.,
./example_data) -
Configure extraction - Edit
config.yamlto define:- Root XPath for record elements
- Field mappings with XPath expressions
- Namespaces (optional - auto-detected if omitted)
- Filters for specific values
-
Run extraction:
uv run python xml_extractor.py# Input/Output
input_directory: "./example_data"
output_file: "output.csv"
# XPath to root element (each match = one CSV row)
root_xpath: ".//oai_dc:dc"
# Namespace definitions (optional)
namespaces:
oai_dc: "http://www.openarchives.org/OAI/2.0/oai_dc/"
dc: "http://purl.org/dc/elements/1.1/"
# Field mappings
fields:
- column: "Title"
xpath: ".//dc:title/text()"
- column: "Creators"
xpath: ".//dc:creator/text()"
separator: " | " # For multiple values
- column: "URN"
xpath: ".//dc:identifier/text()"
filter:
type: "startswith"
pattern: "urn:nbn"Each field can have the following properties:
| Property | Required | Description |
|---|---|---|
column |
✅ Yes | Name of the CSV column |
xpath |
✅ Yes | XPath expression (relative to root_xpath) |
separator |
❌ No | Separator for multiple values (default: " | ") |
filter |
❌ No | Filter configuration for value selection |
Apply filters to select specific values when an XPath matches multiple elements:
Regex Filter:
filter:
type: "regex"
pattern: "^urn:nbn" # Matches URNs starting with "urn:nbn"StartsWith Filter:
filter:
type: "startswith"
pattern: "10." # Matches DOIs starting with "10."Contains Filter:
filter:
type: "contains"
pattern: "miami.uni-muenster.de" # Matches URLs containing this domainExtract specific parts from text using regex patterns with capture groups:
# Extract date from MARC 008 field and reformat: "180830e20180830||..." → "2018-08-30"
- column: "Publication_Date"
xpath: ".//marcxml:controlfield[@tag='008']/text()"
transform:
regex: "\\d{6}e(\\d{4})(\\d{2})(\\d{2})"
format: "{0}-{1}-{2}" # Reformat using capture groups
# Extract language code: "...ger||||||" → "ger"
- column: "Language_Code"
xpath: ".//marcxml:controlfield[@tag='008']/text()"
transform:
regex: "([a-z]{3})\\|{6}$"
group: 1Transform options:
regex: Regular expression pattern (use\\for backslashes in YAML)group: Which capture group to extract (default: 0 = full match, 1+ = capture groups)format: Optional format string to reformat the matched data- Use
{0},{1},{2}to reference capture groups - Example:
"{0}-{1}-{2}"converts20180830to2018-08-30
- Use
Automatically prepend a URL to extracted values:
- column: "DOI"
xpath: ".//datafield[@tag='024']/subfield[@code='a']/text()"
url_prefix: "https://doi.org/" # Converts "10.1234/..." to "https://doi.org/10.1234/..."Pre-filter records before extraction to only process records matching specific criteria:
# Only extract records from the 1920s with Public Domain licenses
record_filters:
- xpath: ".//mods:dateCreated/text()"
condition: "matches"
value: "192\\d"
Pre-filter records before extraction to only process records matching specific criteria:
# Only extract records from the 1920s with Public Domain licenses
record_filters:
- xpath: ".//mods:dateCreated/text()"
condition: "matches"
value: "192\\d"
- xpath: ".//mods:accessCondition/text()"
condition: "contains"
value: "publicdomain"Available filter conditions: exists, not_exists, equals, not_equals, contains, not_contains, matches, not_matches, date_after, date_before, in, not_in
📖 See RECORD_FILTERS.md for complete documentation and examples
# Use default config.yaml
uv run python xml_extractor.py
# Specify custom config
uv run python xml_extractor.py --config my_config.yaml
# Override input/output
uv run python xml_extractor.py --input ./my_xmls --output results.csv
# Enable debug mode
uv run python xml_extractor.py --debugOptions:
-c, --config PATH Path to configuration YAML file [default: config.yaml]
-i, --input PATH Input directory containing XML files (overrides config)
-o, --output PATH Output CSV file path (overrides config)
--debug Enable debug mode with verbose logging
--log-file PATH Log file path (overrides config)
--help Show this message and exit
Your XML files can have different structures:
Single record per file:
<OAI-PMH xmlns="...">
<GetRecord>
<record>
<metadata>
<oai_dc:dc xmlns:oai_dc="..." xmlns:dc="...">
<dc:title>My Title</dc:title>
<dc:creator>Author Name</dc:creator>
<!-- more fields -->
</oai_dc:dc>
</metadata>
</record>
</GetRecord>
</OAI-PMH>Multiple records per file:
<records>
<record>
<metadata>
<oai_dc:dc xmlns:oai_dc="..." xmlns:dc="...">
<!-- fields -->
</oai_dc:dc>
</metadata>
</record>
<record>
<metadata>
<oai_dc:dc xmlns:oai_dc="..." xmlns:dc="...">
<!-- fields -->
</oai_dc:dc>
</metadata>
</record>
</records>Both formats work seamlessly - the tool finds all oai_dc:dc elements across all files.
Title,Creators,Date,URN,DocType
"Religiöse Traditionen in...","Dam, P. (Peter) van",2018-08-30,urn:nbn:de:hbz:6-87159515859,doc-type:article
"Urinary Dickkopf 3...","Jehn, U. (Ulrich) | Altuner, U. (Ugur)",2024-08-21,urn:nbn:de:hbz:6-05978733071,doc-type:article
root_xpath: ".//mets:mets"
namespaces:
mets: "http://www.loc.gov/METS/"
mods: "http://www.loc.gov/mods/v3"
xlink: "http://www.w3.org/1999/xlink"
fields:
- column: "Title"
xpath: ".//mets:dmdSec/mets:mdWrap/mets:xmlData/mods:mods/mods:titleInfo/mods:title/text()"
- column: "FileID"
xpath: ".//mets:file/@ID"root_xpath: ".//lido:lido"
namespaces:
lido: "http://www.lido-schema.org"
fields:
- column: "ObjectTitle"
xpath: ".//lido:titleWrap/lido:titleSet/lido:appellationValue/text()"
- column: "ObjectType"
xpath: ".//lido:objectWorkType/lido:term/text()"After extraction, you'll see:
- ✅ Files processed
⚠️ Files skipped (due to errors)- 📝 Total records extracted
⚠️ Fields with missing data- ❌ Errors encountered
All details are logged to extraction.log (configurable).
xml_extractor/
├── xml_extractor.py # Main application
├── config.yaml # Configuration file
├── pyproject.toml # Project dependencies
├── README.md # This file
└── extraction.log # Generated log file
- lxml - Fast XML processing with XPath support
- rich - Beautiful terminal output
- click - CLI framework
- loguru - Simple, powerful logging
- pyyaml - YAML configuration parsing
Contributions are welcome! Please feel free to submit issues or pull requests.
MIT License - feel free to use this tool for your projects!
- Start with debug mode (
--debug) when creating a new configuration to see what's happening - Test with a small subset of files first before processing large batches
- Use filters to extract specific identifiers (URNs, DOIs, etc.) from multi-value fields
- Check the log file (
extraction.log) for warnings about missing XPath matches - Namespaces are auto-detected - you only need to define them manually if auto-detection fails
- Check your
root_xpathexpression - Verify namespace prefixes match your XML
- Try without namespace prefix:
.//dcinstead of.//oai_dc:dc
- Ensure XPath syntax is correct
- Check that namespace prefixes are defined
- Use
.//for descendant search,/for direct children
- Enable
skip_invalid_xml: truein config to skip bad files - Check XML file encoding (should be UTF-8)
- Validate XML structure with an XML validator