XeLaTeX to Word/OpenOffice - the state of the art?
BPJ
bpj at melroch.se
Fri Mar 15 14:10:53 CET 2019
Den 2019-03-15 kl. 13:51, skrev BPJ:
> Den 2019-03-15 kl. 12:31, skrev Zdenek Wagner:
>> I am also interested how you do it. I have tried with one of my
>> documents (I do not need this conversion, it was just a test). The
>> document contains 5 tables and 50 math equations. The first
>> equation
>> is OK, the remainng equations are total garbage, they will have
>> to be
>> entered manually from scratch. The tables are total garbage as
>> well,
>> they even do not look like tables. The table of contents is garbage
>> but this is not a major issue. The problem is that in the middle of
>> the first page, probably as an effect of math, the text becomes
>> garbage as well. In this situation copy&paste and manual conversion
>> will be faster unless there is a special (hidden) trick which I
>> do not
>> know.
>
> As I said in my howto just posted your best bet if you have the
> original LaTeX file is to redefine commands etc. in *TeX so that
> the results become less garbagey and easier to correct by hand.
> I don't know about math because I don't do math, so for me
> not-so-simple tables are the biggest problem. If you or anyone
> else comes up with a *TeX hack which makes column boundaries
> "visible",
> as in inserting pipe characters or some such, it will be much
> easier to tidy things up after conversion to a text format with
> Pandoc.
>
> You may also want to try Pandoc's direct LaTeX-->Anything
> conversion, although it is rather lossy for more advanced stuff
> it does lists, tables, small caps and surely math quite OK.
>
> I only use this PDF-->DOCX trick for PDFs I get from my clients
> where the source is not included or may not exist.
> I'm still to encounter a client handing me a *TeX file... :-(
BTW you can "tame" recalcitrant LaTeX commands when converting
with Pandoc by including `\renewcommand`s restating them in terms
of simpler LaTeX constructs which Pandoc can handle and Pandoc
will use them. IIRC there is a feature request out (or I'll make
one!) for getting some/all LaTeX commands "unknown" to Pandoc as
Pandoc's native Div/Span syntax with `custom-style` attributes
which you then could hook into when converting to DOCX with
Pandoc. It will probably become reality sooner than later. The
problem is how to handle commands/environments with multiple
arguments (which argument is "the" text?) You can already have
Pandoc preserve "unknown" LaTeX as raw LaTeX, which you then can
use a Pandoc filter (written in Lua, Python, Perl, your language
of choice) to massage into something suitable UTF-8 is no problem
as Pandoc uses it natively. It also understands standard
babel/polyglossia commands, giving you a native span or div with a
`lang` attribute which it then understands to handle correctly
when converting to other formats. There are some warts like
`\textgreek` giving `lang="el"` rather than "grc" but that can be
fixed with a Pandoc filter. DOCX's (lack of) math capabilities
may be another story though, but Pandoc surely does its best.
/bpj
More information about the XeTeX
mailing list