How to convert HTML with mathjax into Latex using Pandoc?
Asked Answered
F

2

19

I have some HTML documents with MathJax equations, and I want to convert them to Latex, and then to pdf. I'd like to use Pandoc.

However, Pandoc replaces $ with \$ and it replaces \ in formulas with \textbackslash{}.

Is it possible to get Pandoc to pass MathJax formulas literally from HTML to Latex?

Finedraw answered 5/7, 2012 at 5:0 Comment(0)
D
20

With the latest version of pandoc (1.12.2), you can do this:

pandoc -f html+tex_math_dollars+tex_math_single_backslash -t latex

Much nicer! If you don't want to convert math delimited by \( and \), just do

pandoc -f html+tex_math_dollars -t latex
Darryldarryn answered 10/12, 2013 at 17:58 Comment(0)
D
10

It's not an easy task. Here's a solution that should work, provided you only use $ and $$ as math delimiters, and assuming your document doesn't contain any other uses of $. (If you can't assume that, you can try adjusting the perl regex in what follows.)

Step 1: Install the Haskell Platform, if you don't have it already, and 'cabal install pandoc' to get the pandoc library. (If you installed pandoc with the binary installer, you only have the executable, not the Haskell library.)

Step 2: Now write a small Haskell script -- we'll call it fixmath.hs:

import Text.Pandoc

main = toJsonFilter fixmath

fixmath :: Block -> Block
fixmath = bottomUp fixmathBlock . bottomUp fixmathInline

fixmathInline :: Inline -> Inline
fixmathInline (RawInline "html" ('<':'!':'-':'-':'M':'A':'T':'H':xs)) =
  RawInline "tex" $ take (length xs - 3) xs
fixmathInline x = x

fixmathBlock :: Block -> Block
fixmathBlock (RawBlock "html" ('<':'!':'-':'-':'M':'A':'T':'H':xs)) =
  RawBlock "tex" $ take (length xs - 3) xs
fixmathBlock x = x

Compile this:

ghc --make fixmath.hs

This will give you an executable fixmath. Now, assuming your input file is input.html, the following command should convert it to latex with the math intact, putting the result in output.html:

cat input.html | \
perl -0pe 's/(\$\$?[^\$]+\$\$?)/\<!--MATH$1-->/gm' | \
pandoc -s --parse-raw -f html -t json | \
./fixmath | \
pandoc -f json -t latex -s > output.tex

The first part is a perl one-liner that puts your math bits in special HTML comments marked "MATH". The second part parses the HTML into a JSON representation of the Pandoc data structure corresponding to the document. Then fixmath transforms this structure, changing the special HTML comments into raw LaTeX blocks and inlines. (See Scripting with pandoc for an explanation.) Finally we convert from JSON back to LaTeX.

Darryldarryn answered 12/7, 2012 at 21:59 Comment(3)
Is there some way to make the executable fixmath work with pandoc-ruby?Lenoir
And how should the Haskell script be written to not convert math which is delimited by \(\)? #20493482Lenoir
See my latest answer.Darryldarryn

© 2022 - 2024 — McMap. All rights reserved.