XeLaTeX to Word/OpenOffice - the state of the art?

BPJ bpj at melroch.se
Fri Mar 15 14:10:53 CET 2019


Den 2019-03-15 kl. 13:51, skrev BPJ:
> Den 2019-03-15 kl. 12:31, skrev Zdenek Wagner:
>> I am also interested how you do it. I have tried with one of my
>> documents (I do not need this conversion, it was just a test). The
>> document contains 5 tables and 50 math equations. The first 
>> equation
>> is OK, the remainng equations are total garbage, they will have 
>> to be
>> entered manually from scratch. The tables are total garbage as 
>> well,
>> they even do not look like tables. The table of contents is garbage
>> but this is not a major issue. The problem is that in the middle of
>> the first page, probably as an effect of math, the text becomes
>> garbage as well. In this situation copy&paste and manual conversion
>> will be faster unless there is a special (hidden) trick which I 
>> do not
>> know.
> 
> As I said in my howto just posted your best bet if you have the 
> original LaTeX file is to redefine commands etc. in *TeX so that 
> the results become less garbagey and easier to correct by hand.
> I don't know about math because I don't do math, so for me 
> not-so-simple tables are the biggest problem.  If you or anyone 
> else comes up with a *TeX hack which makes column boundaries 
> "visible",
> as in inserting pipe characters or some such, it will be much 
> easier to tidy things up after conversion to a text format with 
> Pandoc.
> 
> You may also want to try Pandoc's direct LaTeX-->Anything 
> conversion, although it is rather lossy for more advanced stuff
> it does lists, tables, small caps and surely math quite OK.
> 
> I only use this PDF-->DOCX trick for PDFs I get from my clients 
> where the source is not included or may not exist.
> I'm still to encounter a client handing me a *TeX file... :-(

BTW you can "tame" recalcitrant LaTeX commands when converting 
with Pandoc by including `\renewcommand`s restating them in terms 
of simpler LaTeX constructs which Pandoc can handle and Pandoc 
will use them.  IIRC there is a feature request out (or I'll make 
one!) for getting some/all LaTeX commands "unknown" to Pandoc as 
Pandoc's native Div/Span syntax with `custom-style` attributes 
which you then could hook into when converting to DOCX with 
Pandoc.  It will probably become reality sooner than later.  The 
problem is how to handle commands/environments with multiple 
arguments (which argument is "the" text?) You can already have 
Pandoc preserve "unknown" LaTeX as raw LaTeX, which you then can 
use a Pandoc filter (written in Lua, Python, Perl, your language 
of choice) to massage into something suitable  UTF-8 is no problem 
as Pandoc uses it natively. It also understands standard 
babel/polyglossia commands, giving you a native span or div with a 
`lang` attribute which it then understands to handle correctly 
when converting to other formats.  There are some warts like 
`\textgreek` giving `lang="el"` rather than "grc" but that can be 
fixed with a Pandoc filter.  DOCX's (lack of) math capabilities 
may be another story though, but Pandoc surely does its best.

/bpj


More information about the XeTeX mailing list