From text processor odt file, extract all possible information in semantic XML (TEI).
Doc (in French): http://resultats.hypotheses.org/267
Demo: https://obvil.huma-num.fr/odette/
Maybe used with command line
~/myrepos $ sudo apt install git php php-cli php-xml
~/myrepos $ git clone https://github.com/oeuvres/odette.git
~/myrepos $ cd odette
~/myrepos/odette $ php odette.php
php odette.php (options)? "teidir/*.xml"
Export odt files with styles as XML (ex: TEI)
Parameters:
globs : 1-n files or globs
Options:
-h, --help : show this help message
-f, --force : force deletion of destination file
-d destdir : destination directory for generated files
-t template : a specific template for export among:
delacroix, desc_chine, dramabib, galien, hauy, hurlus, merveilles17, rougemont
--tei : default, export odt as XML/TEI
--html : export odt as html
--odtx : export native odt xml (for debug)Odette transpose some text processor direct formatting at paragraph level (left, right, center) and character level (italic, small caps…), but most of information is transmitted by user styles.
Text processor styles may be paragraph level (¶) or character level (@). Yous must ensure the level of your styles in your text processor if you want that Odette works well. Microsoft.Office may create linked styles, for example one style name for Quote, allowed for a full paragraph or for quotes of some words inline. This may confused an automat. It is good idea to conceive your template of styles in LibreOffice, you can record your template in docx format and edit texts with MS.Word (but you need to record files in odt at the end to transform it with Odette).
Example of Odette work, if you use the paragraph style <ab>, the para will be transformes in the xml
<ab type="ornament">My para</ab>Below a list of normalized style name known, and their xml/tei transposition. Unknown styles are kept in a @rend attribute. Styles are here shown normalized as ascii lower case letter, but real life styles may contain capitals, accents, spaces, or punctuation. For example, quotesalute could appears as <Quote, Salute> for the user in its word processor (a style for a letter in a citation).
ab
<ab type="ornament">content ¶</ab>address
<address>
<addrLine>content ¶</addrLine>
</address>argument
<argument>
<p>content ¶</p>
</argument>bibl
<bibl>content ¶</bibl>byline
<byline>content ¶</byline>camera
<camera>content ¶</camera>caption
<caption>content ¶</caption>castitem
<castList>
<castItem>content ¶</castItem>
</castList>castlist
<castList>content ¶</castList>closer
<closer>content ¶</closer>dateline
<dateline>content ¶</dateline>def
<entryFree>
<def>content ¶</def>
</entryFree>desc
<desc>content ¶</desc>docauthor
<docAuthor>content</docAuthor>docimprint
<docImprint>content ¶</docImprint>docdate
<docDate>content ¶</docDate>eg
<eg>content ¶</eg>epigraph
<epigraph>
<p rend="right italic…">content ¶</p>
</epigraph>epigraphl
<epigraph>
<l>content ¶</l>
</epigraph>entry
<entry>content ¶</entry>fw
<fw>content ¶</fw>index
<index>
<item>content ¶</item>
</index>l
<l rend="center italic…">content ¶</l>label
<label>content ¶</label>labeldateline
<label type="dateline">content ¶</label>labelhead
<label type="head">content ¶</label>labelsalute
<label type="salute">content ¶</label>labelspeaker
<label type="speaker">content ¶</label>lg
<lg>
<l>content ¶</l>
</lg>opener
<opener>content ¶</opener>p
<p rend="right italic…">content ¶</p>pb
<pb n="…"/>postscript
<postscript>
<p>content ¶</p>
</postscript>q
<q>content ¶</q>quote
<quote>
<p rend="right, italic…">content ¶</p>
</quote>quotedateline
<quote>
<dateline>content ¶</dateline>
</quote>quotel
<quote>
<l>content ¶</l>
</quote>quotesalute
<quote>
<salute>content ¶</salute>
</quote>quotesigned
<quote>
<signed>content ¶</signed>
</quote>role
<castItem>
<role>content ¶</role>
</castItem>roledesc
<castItem>
<roleDesc>content ¶</roleDesc>
</castItem>said
<said>content ¶</said>salute
<salute>content ¶</salute>salutation
<salute>content ¶</salute>set
<set>
<p>content ¶</p>
</set>signed
<signed>content ¶</signed>speaker
<speaker>content ¶</speaker>stage
<stage>content ¶</stage>term
<index>
<term>content ¶</term>
</index>trailer
<trailer>content ¶</trailer>abbr
blah… <abbr>@ level</abbr> …blahadd
blah… <add>@ level</add> …blahactor
blah… <actor>@ level</actor> …blahauthor
blah… <author>@ level</author> …blahaffiliation
blah… <affiliation>@ level</affiliation> …blahage
blah… <age>@ level</age> …blahbibl
blah… <bibl>@ level</bibl> …blahc
blah… <c>@ level</c> …blahcode
blah… <code>@ level</code> …blahcorr
blah… <corr>@ level</corr> …blahdate
blah… <date>@ level</date> …blahdel
blah… <del>@ level</del> …blahdistinct
blah… <distinct>@ level</distinct> …blahblah… <email>@ level</email> …blahemph
blah… <emph>@ level</emph> …blahgeogname
blah… <geogName>@ level</geogName> …blahgloss
blah… <gloss>@ level</gloss> …blahname
blah… <name>@ level</name> …blahnum
blah… <num>@ level</num> …blahpb
blah… <pb>@ level</pb> …blahpersname
blah… <persName>@ level</persName> …blahplacename
blah… <placeName>@ level</placeName> …blahstage
blah… <stage>@ level</stage> …blahtitle
blah… <title>@ level</title> …blah