Parsing mardown with Java

Markdown is a simple and convenient format to write documentations as simple text. This format is commonly used by platform such as GitHub.

In this post, we will describe how to parse and use your markdown content to produce other formats. For this purpose, we will implement the Pegdown tool available at the address https://github.com/sirthias/pegdown.

Installing Pegdown

We use the version 1.2.1 of Pegdown based on Parboiled available at https://github.com/sirthias/parboiled. Jar files of these tools are respectively available at adresses https://github.com/sirthias/pegdown/downloads and https://github.com/sirthias/parboiled/downloads. You can notice that the asm tool is also required.

After having installing these tools, we will have the following jar files in our classpath:

  • asm-all-4.1.jar: the asm tool
  • parboiled-core-1.1.4.jar: the parboiled core jar
  • parboiled-java-1.1.4.jar: the parboiled jar specific for Java
  • pegdown-1.2.1.jar: the pegdown jar

Lets dive now into how to handle markdown content.

Parsing markdown with Pegdown

Pegdown provides a processor that parses your markdown content provided as input. Following code describes how to parse markdown:

String fileName = (...)
PegDownProcessor processor = new PegDownProcessor(Extensions.ALL);
char[] markdown = FileUtils.readAllChars(fileName);
Preconditions.checkNotNull(markdown, "The specified file isn't found - "+fileName);

RootNode rootNode = processor.parseMarkdown(markdown);

The parseMarkdown method actually parses the content and provides a RootNode document corresponding to an object representation. You are now ready to use it to create another content (XML, and so on).

Using markdown content

We want to parse the following markdown-formatted text content. We will base on it for the rest of the post.

An introduction sentence. Another introduction sentence.

An introduction sentence.

# First header title

Some content. Some content.

* List item 1: some description
* List item 2: some description

Some content. Some content.

    SomeClass clazz = new SomeClass();
    clazz.test();

Some content.

Here is the markdown content that will use as input of the PegDown processor:

ParaNode
    SuperNode
        TextNode
        SpecialTextNode
        TextNode
        SpecialTextNode
ParaNode
    SuperNode
        TextNode
        SpecialTextNode
HeaderNode
    TextNode
ParaNode
    SuperNode
        TextNode
        SpecialTextNode
        TextNode
        SpecialTextNode
BulletListNode
    ListItemNode
        RootNode
            SuperNode
                TextNode
                SpecialTextNode
                TextNode
    ListItemNode
        RootNode
            SuperNode
                TextNode
                SpecialTextNode
                TextNode
ParaNode
    SuperNode
        TextNode
        SpecialTextNode
        TextNode
        SpecialTextNode
VerbatimNode
ParaNode
    SuperNode
        TextNode
        SpecialTextNode

Based on the root node returned when parsing the markdown file, we can iterate basing the getChildren method of the Node class.

Node rootNode = (...)
List<Node> nodes = rootNode.getChildren();
StringBuilder content = new StringBuilder();
for (Node node : nodes) {
    if (node instanceof HeaderNode) {
        HeaderNode headerNode = (HeaderNode) node;
        String text = getTextContent(node);
        (...)
    } else if (node instanceof ParaNode) {
        ParaNode paraNode = (ParaNode) node;
        String text = getTextContent(node);
        (...)
    } else if (node instanceof VerbatimNode) {
        VerbatimNode verbatimNode = (VerbatimNode) node;
        String text = getTextContent(node);
        (...)
    } else if (node instanceof BulletListNode) {
        BulletListNode bulletListNode = (BulletListNode) node;
        displayNodeChildren(bulletListNode);
        content.append("<ul>");
        List<Node> listItemNodes = bulletListNode.getChildren();
        for (Node childNode : listItemNodes) {
            if (childNode instanceof ListItemNode) {
                ListItemNode listItemNode = (ListItemNode) childNode;
                String text = getTextContent(childNode);
                (...)
            }
        }
        content.append("</ul>");
    }
}

The getTextContent methods implement how to get text from different blocks like headers, paragraphes and code listings:

private String getTextContent(Node node) {
    if (node instanceof TextNode) {
        return getTextContent((TextNode)node);
    } else if (node instanceof HeaderNode) {
        HeaderNode headerNode = (HeaderNode) node;
        return getTextContent((TextNode) headerNode.getChildren().get(0));
    } else if (node instanceof ParaNode) {
        ParaNode paraNode = (ParaNode) node;
        Node firstChildNode = paraNode.getChildren().get(0);
        if (firstChildNode instanceof SuperNode) {
            return getTextContent((SuperNode) firstChildNode);
        } else if (firstChildNode instanceof TextNode) {
            return getTextContent((TextNode) firstChildNode);
        }
    } else if (node instanceof ListItemNode) {
        ListItemNode listItemNode = (ListItemNode) node;
        RootNode rootNode = (RootNode) listItemNode.getChildren().get(0);
        Node firstChildNode = rootNode.getChildren().get(0);
        if (firstChildNode instanceof SuperNode) {
            return getTextContent((SuperNode) firstChildNode);
        } else if (firstChildNode instanceof TextNode) {
            return getTextContent((TextNode) firstChildNode);
        }
    }
    return null;
}

private String getTextContent(SuperNode node) {
    List<Node> nodes = node.getChildren();
    StringBuilder content = new StringBuilder();
    for (Node child : nodes) {
        if (child instanceof TextNode) {
            content.append(getTextContent((TextNode)child));
        } else if (child instanceof SpecialTextNode) {
            content.append(getTextContent((SpecialTextNode)child));
        }
    }
    return content.toString();
}

private String getTextContent(TextNode node) {
    return node.getText();
}

Generating new content

Now we can implement the complete transformation of our markdown content to a pseudo HTML format. We wrap headers within a h2 tag and program listing within a code tag. We leave paragraphes as they are without any wrapping. Following code describes this approach:

Node rootNode = (...)
List<Node> nodes = rootNode.getChildren();
StringBuilder content = new StringBuilder();
for (Node node : nodes) {
    if (node instanceof HeaderNode) {
        HeaderNode headerNode = (HeaderNode) node;
        content.append("<h2>");
        String text = getTextContent(node);
        if (text!=null) {
            content.append(text);
        }
        content.append("</h2>");
        content.append("\n\n");
    } else if (node instanceof ParaNode) {
        ParaNode paraNode = (ParaNode) node;
        String text = getTextContent(node);
        if (text!=null) {
            content.append(text);
        }
        content.append("\n\n");
    } else if (node instanceof VerbatimNode) {
        VerbatimNode verbatimNode = (VerbatimNode) node;
        content.append("<code>");
        String text = getTextContent(node);
        if (text!=null) {
            content.append(text);
        }
        content.append("</code>");
        content.append("\n\n");
    } else if (node instanceof BulletListNode) {
        BulletListNode bulletListNode = (BulletListNode) node;
        content.append("<ul>");
        List<Node> listItemNodes = bulletListNode.getChildren();
        for (Node childNode : listItemNodes) {
            if (childNode instanceof ListItemNode) {
                ListItemNode listItemNode = (ListItemNode) childNode;
                content.append("<li>");
                String text = getTextContent(childNode);
                if (text!=null) {
                    content.append(text);
                }
                content.append("</li>");
            }
        }
        content.append("</ul>");
    }
}

Here is the final output:

An introduction sentence. Another introduction sentence.

An introduction sentence.

<h2>First header title</h2

Some content. Some content.

<ul>
<li>List item 1: some description</li>
<li>List item 2: some description</li>
</ul>

Some content. Some content.

<code>
SomeClass clazz = new SomeClass();
clazz.test();
</code>

Some content.

This entry was posted in Documentation, Java, Markdown. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s