Introduction #

This tutorial is a deep dive in a real Markdown parser written using nom in Rust. This MD Parser is part of the r3bl_tui crate, which is part of the r3bl-open-core repo. It goes over the architecture of thinking about building complex parsers and the nitty gritty details the runtime nature and behavior when combining nom parsers.

The r3bl_tui crate is a Text User Interface (TUI) crate that is used in the R3BL suite of products. It is a very powerful and flexible TUI crate that is used to build a variety of different applications. It comes with a full featured Markdown editor component, and the parser that’s the focus on this tutorial is used by that editor component to parse an input string slice into a Markdown document model (AST representation in memory).

nom crate review #

nom is a parser combinator library for Rust. You can write small functions that parse a specific part of your input, and then combine them to build a parser that parses the whole input. nom is very efficient and fast, it does not allocate memory when parsing if it doesn’t have to, and it makes it very easy for you to do the same. nom uses streaming mode or complete mode, and in this tutorial & code examples provided we will be using complete mode.

Roughly the way it works is that you tell nom how to parse a bunch of bytes in a way that matches some pattern that is valid for your data. It will try to parse as much as it can from the input, and the rest of the input will be returned to you.

You express the pattern that you’re looking for by combining parsers. nom has a whole bunch of these that come out of the box. And a huge part of learning nom is figuring out what these built in parsers are and how to combine them to build a parser that does what you want.

Errors are a key part of it being able to apply a variety of different parsers to the same input. If a parser fails, nom will return an error, and the rest of the input will be returned to you. This allows you to combine parsers in a way that you can try to parse a bunch of different things, and if one of them fails, you can try the next one. This is very useful when you are trying to parse a bunch of different things, and you don’t know which one you are going to get.

We have a video and article on developerlife where you can learn more about nom and how to use it.

A real production grade Markdown parser example #

The production md_parser module in the r3bl-open-core repo contains a fully functional Markdown parser (that you can use in your projects that need a Markdown parser). This parser supports standard Markdown syntax as well as some extensions that are added to make it work w/ R3BL products. It makes a great starting point to study how a relatively complex parser is written. There are lots of tests that you can follow along to understand what the code is doing.

πŸ’‘ You can get the source code for the production Markdown parser used in r3bl_tui from the r3bl-open-core repo.

🌟 Please star this repo on github if you like it πŸ™.

The main entry point (function) for this Markdown parsing module is parse_markdown().

  • It takes a string slice.
  • And returns a vector of MdBlocks.

Here are some entry points into the codebase.

  1. The main function parse_markdown() that does the parsing of a string slice into a MdDocument. The tests are provided alongside the code itself. And you can follow along to see how other smaller parsers are used to build up this big one that parses the whole of the Markdown document.
  2. The types module contain all the types that are used to represent the Markdown document model, such as MdDocument, MdBlock, MdLineFragment and all the other intermediate types & enums required for parsing.
  3. All the parsers related to parsing metadata specific for R3BL applications which are not standard Markdown can be found in parse_metadata_kv and parse_metadata_kcsv.
  4. All the parsers that are related to parsing the main β€œblocks” of Markdown, such as order lists, unordered lists, code blocks, text blocks, heading blocks, can be found block.
  5. All the parsers that are related to parsing a single line of Markdown text, such as links, bold, italic, etc. can be found fragment.

If you like to consume content via video, then you can watch this video that covers the same content as this article, but in a live coding format.

πŸ’‘ You can get the source code for the production Markdown parser used in r3bl_tui from the r3bl-open-core repo.

Architecture and parsing order #

This diagram showcases the order in which the parsers are called and how they are composed together to parse a Markdown document.

priority β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  high   β”‚ parse_markdown() {           map to the correct                    β”‚
    β”‚    β”‚   many0(                     ───────────────────►  MdBlock variant β”‚
    β”‚    β”‚     parse_title_value()                              Title         β”‚
    β”‚    β”‚     parse_tags_list()                                Tags          β”‚
    β”‚    β”‚     parse_authors_list()                             Authors       β”‚
    β”‚    β”‚     parse_date_value()                               Date          β”‚
    β”‚    β”‚     parse_block_heading_opt_eol()                    Heading       β”‚
    β”‚    β”‚     parse_block_smart_list()                         SmartList     β”‚
    β”‚    β”‚     parse_block_code()                               CodeBlock     β”‚
    β”‚    β”‚     parse_block_m..n_text_with_or_without_new_line() Text          β”‚
    β”‚    β”‚   )                                                                β”‚
    β–Ό    β”‚ }                                                                  β”‚
priority β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  low

The parsing strategy in most cases is to parse the most specific thing first and then parse the more general thing later. We often use the existence of \n (or eol) to decide how far forwards we need to go into the input. And sometimes \n doesn’t exist and we simply use the entire input (or end of input or eoi). You might see functions that have these suffixes in their names. Another term you might see is with_or_without_new_line which makes the parsing strategy explicit in the name.

The nature of nom parsers is to simply error out when they don’t match. And leave the input untouched, so that another parser have a go at it again. The nature of these parsing functions is kind of recursive in nature. So it’s important identify edge and exit cases up front before diving into the parsing logic. You will see this used in parsers which look for something more specific, if its not found, they error out, and allow less specific parsers to have a go at it, and so on.

The priority of parsers #

As we drill down into the implementation further, we see that the parsers are prioritized in the order of their specificity. The most specific parsers are called first and the least specific parsers are called last. This is done to ensure that the most specific parsers get a chance to parse the input first. And if they fail, then the less specific parsers get a chance to parse the input.

parse_block_markdown_text_with_or_without_new_line() {
  many0(
    parse_inline_fragments_until_eol_or_eoi()
       )   β”‚
}          β”‚                                           ──map to the correct──►
           └─► alt(                                     MdLineFragment variant

             β–² p..e_f..t_s..s_with_underscore_err_on_new_line()  Italic
             β”‚ p..e_f..t_s..s_with_star_err_on_new_line()        Bold
specialized  β”‚ p..e_f..t_s..s_with_backtick_err_on_new_line()    InlineCode
parsers ────►│ p..e_f..t_s..s_with_left_image_err_on_new_line()  Image
             β”‚ p..e_f..t_s..s_with_left_link_err_on_new_line()   Link
             β”‚ p..e_f..t_s..s_with_checkbox_into_str()           Plain
             β–Ό p..e_f..t_s..s_with_checkbox_checkbox_into_bool() Checkbox
catch all────► p..e_f..t_plain_text_no_new_line()                Plain
parser
               )

The last one on the list in the diagram above is parse_block_markdown_text_with_or_without_new_line(). Let’s zoom into this function and see how it is composed.

The β€œcatch all” parser, which is the most complicated, and the lowest priority #

The most complicated parser is the β€œcatch all” parser or the β€œplain text” parser. This parser is the last one in the chain and it simply consumes the rest of the input and turns it into a MdBlock::Text. This parser is the most complicated because it has to deal with all the edge cases and exit cases that other parsers have not dealt with. Such as special characters like `, *, _, etc. They are all listed here:

  • If the input does not start with a special char in this get_sp_char_set_2(), then this is the β€œNormal case”. In this case the input is split at the first occurrence of a special char in get_sp_char_set_3(). The β€œbefore” part is MdLineFragment::Plain and the β€œafter” part is parsed again by a more specific parser.
  • If the input starts with a special char in this get_sp_char_set_2() and it is not in the get_sp_char_set_1() with only 1 occurrence, then the behavior is different β€œEdge case -> Normal case”. Otherwise the behavior is β€œEdge case -> Special case”.
    • β€œEdge case -> Normal case” takes all the characters until \n or end of input and turns it into a MdLineFragment::Plain.
    • β€œEdge case -> Special case” splits the input before and after the special char. The β€œbefore” part is turned into a MdLineFragment::Plain and the β€œafter” part is parsed again by a more specific parser.

The reason this parser gets called repeatedly is because it is the last one in the chain. Its the lowest priority parser called by parse_inline_fragments_until_eol_or_eoi(), which itself is called:

  1. Repeatedly in a loop by parse_block_markdown_text_with_or_without_new_line().
  2. And by parse_block_markdown_text_with_checkbox_policy_with_or_without_new_line().

Visualize the parsers running on real input #

Let’s run some tests from the md_parser module with the DEBUG_MD_PARSER_STDOUT flag set to true.

This will allow us to see the output of the parsers as they run on real input. This is a great way to understand how the parsers are working and what they are doing. This helps build an intuition around what happens at runtime which might not match what you think is happening when you read the code.

  1. The test we will run are in this file: tui/src/tui/md_parser/block/parse_block_markdown_text_until_eol_or_eoi.rs.
  2. The test suite itself is called tests_parse_block_markdown_text_with_or_without_new_line.
  3. And the function under test is parse_block_markdown_text_with_or_without_new_line().

For convenience, here’s a copy of the test that we will run (in this file):

#[test]
fn test_parse_hyperlink_markdown_text_1() {
    let input = "This is a _hyperlink: [foo](http://google.com).";
    let it = parse_block_markdown_text_with_or_without_new_line(input);
    assert_eq2!(
        it,
        Ok((
            "",
            list![
                MdLineFragment::Plain("This is a ",),
                MdLineFragment::Plain("_",),
                MdLineFragment::Plain("hyperlink: ",),
                MdLineFragment::Link(HyperlinkData {
                    text: "foo",
                    url: "http://google.com",
                },),
                MdLineFragment::Plain(".",),
            ],
        ))
    );
}

You can see from the assert_eq2!() statements that the input "This is a _hyperlink: [foo](http://google.com)." is turned into a abstract syntax tree (AST) which looks like this:

[
    MdLineFragment::Plain("This is a ",),
    MdLineFragment::Plain("_",),
    MdLineFragment::Plain("hyperlink: ",),
    MdLineFragment::Link(HyperlinkData {
        text: "foo",
        url: "http://google.com",
    },),
    MdLineFragment::Plain(".",),
]

Note the β€œstrange” way in which "_" is handled. Instead of what we might expect Plain("This is a _ hyperlink: "). But we get 3 fragments instead of one. This is because of the lowest priority parser handles special characters so that more specific parsers (higher priority) can have a go at it. So it doesn’t prematurely mark them as Plain.

Here are the commands to run one of the tests (make sure to run this in the tui subfolder):

cargo test -- --nocapture test_parse_hyperlink_markdown_text_1

Here’s the output, which you can walk through to see the parsing algorithms in action:

β– β–  specialized parser _:
input: "This is a _hyperlink: [foo](http://google.com).", delim: "_"
count: 1, starts_w: false, input=delim: false
β¬’β¬’ parser error out for input: "This is a _hyperlink: [foo](http://google.com)."

β– β–  specialized parser *:
input: "This is a _hyperlink: [foo](http://google.com).", delim: "*"
count: 0, starts_w: false, input=delim: false
β¬’β¬’ parser error out for input: "This is a _hyperlink: [foo](http://google.com)."

β– β–  specialized parser `:
input: "This is a _hyperlink: [foo](http://google.com).", delim: "`"
count: 0, starts_w: false, input=delim: false
β¬’β¬’ parser error out for input: "This is a _hyperlink: [foo](http://google.com)."

β– β–  specialized parser take text between delims err on new line:
input: "This is a _hyperlink: [foo](http://google.com).", start_delim: "![", end_delim: "]"
β¬’β¬’ parser error out for input: "This is a _hyperlink: [foo](http://google.com)."

β¬’β¬’ specialized parser error out with image:
input: "This is a _hyperlink: [foo](http://google.com).", delim: "!["

β– β–  specialized parser take text between delims err on new line:
input: "This is a _hyperlink: [foo](http://google.com).", start_delim: "[", end_delim: "]"
β¬’β¬’ parser error out for input: "This is a _hyperlink: [foo](http://google.com)."

β¬’β¬’ specialized parser error out with link:
input: "This is a _hyperlink: [foo](http://google.com).", delim: "["
β¬’β¬’ specialized parser for checkbox: Err(Error(Error { input: "This is a _hyperlink: [foo](http://google.com).", code: Tag }))

β–ˆβ–ˆ plain parser, input: "This is a _hyperlink: [foo](http://google.com)."
β–²β–² normal case :: Ok(("_hyperlink: [foo](http://google.com).", "This is a "))

β– β–  specialized parser _:
input: "_hyperlink: [foo](http://google.com).", delim: "_"
count: 1, starts_w: true, input=delim: false
β¬’β¬’ parser error out for input: "_hyperlink: [foo](http://google.com)."

β– β–  specialized parser *:
input: "_hyperlink: [foo](http://google.com).", delim: "*"
count: 0, starts_w: false, input=delim: false
β¬’β¬’ parser error out for input: "_hyperlink: [foo](http://google.com)."

β– β–  specialized parser `:
input: "_hyperlink: [foo](http://google.com).", delim: "`"
count: 0, starts_w: false, input=delim: false
β¬’β¬’ parser error out for input: "_hyperlink: [foo](http://google.com)."

β– β–  specialized parser take text between delims err on new line:
input: "_hyperlink: [foo](http://google.com).", start_delim: "![", end_delim: "]"
β¬’β¬’ parser error out for input: "_hyperlink: [foo](http://google.com)."

β¬’β¬’ specialized parser error out with image:
input: "_hyperlink: [foo](http://google.com).", delim: "!["

β– β–  specialized parser take text between delims err on new line:
input: "_hyperlink: [foo](http://google.com).", start_delim: "[", end_delim: "]"
β¬’β¬’ parser error out for input: "_hyperlink: [foo](http://google.com)."

β¬’β¬’ specialized parser error out with link:
input: "_hyperlink: [foo](http://google.com).", delim: "["
β¬’β¬’ specialized parser for checkbox: Err(Error(Error { input: "_hyperlink: [foo](http://google.com).", code: Tag }))

β–ˆβ–ˆ plain parser, input: "_hyperlink: [foo](http://google.com)."
β–²β–² edge case -> special case :: rem: "hyperlink: [foo](http://google.com).", output: "_"

β– β–  specialized parser _:
input: "hyperlink: [foo](http://google.com).", delim: "_"
count: 0, starts_w: false, input=delim: false
β¬’β¬’ parser error out for input: "hyperlink: [foo](http://google.com)."

β– β–  specialized parser *:
input: "hyperlink: [foo](http://google.com).", delim: "*"
count: 0, starts_w: false, input=delim: false
β¬’β¬’ parser error out for input: "hyperlink: [foo](http://google.com)."

β– β–  specialized parser `:
input: "hyperlink: [foo](http://google.com).", delim: "`"
count: 0, starts_w: false, input=delim: false
β¬’β¬’ parser error out for input: "hyperlink: [foo](http://google.com)."

β– β–  specialized parser take text between delims err on new line:
input: "hyperlink: [foo](http://google.com).", start_delim: "![", end_delim: "]"
β¬’β¬’ parser error out for input: "hyperlink: [foo](http://google.com)."

β¬’β¬’ specialized parser error out with image:
input: "hyperlink: [foo](http://google.com).", delim: "!["

β– β–  specialized parser take text between delims err on new line:
input: "hyperlink: [foo](http://google.com).", start_delim: "[", end_delim: "]"
β¬’β¬’ parser error out for input: "hyperlink: [foo](http://google.com)."

β¬’β¬’ specialized parser error out with link:
input: "hyperlink: [foo](http://google.com).", delim: "["
β¬’β¬’ specialized parser for checkbox: Err(Error(Error { input: "hyperlink: [foo](http://google.com).", code: Tag }))

β–ˆβ–ˆ plain parser, input: "hyperlink: [foo](http://google.com)."
β–²β–² normal case :: Ok(("[foo](http://google.com).", "hyperlink: "))

β– β–  specialized parser _:
input: "[foo](http://google.com).", delim: "_"
count: 0, starts_w: false, input=delim: false
β¬’β¬’ parser error out for input: "[foo](http://google.com)."

β– β–  specialized parser *:
input: "[foo](http://google.com).", delim: "*"
count: 0, starts_w: false, input=delim: false
β¬’β¬’ parser error out for input: "[foo](http://google.com)."

β– β–  specialized parser `:
input: "[foo](http://google.com).", delim: "`"
count: 0, starts_w: false, input=delim: false
β¬’β¬’ parser error out for input: "[foo](http://google.com)."

β– β–  specialized parser take text between delims err on new line:
input: "[foo](http://google.com).", start_delim: "![", end_delim: "]"
β¬’β¬’ parser error out for input: "[foo](http://google.com)."

β¬’β¬’ specialized parser error out with image:
input: "[foo](http://google.com).", delim: "!["

β– β–  specialized parser take text between delims err on new line:
input: "[foo](http://google.com).", start_delim: "[", end_delim: "]"

β– β–  specialized parser take text between delims err on new line:
input: "(http://google.com).", start_delim: "(", end_delim: ")"
β–²β–² specialized parser for link: Ok((".", HyperlinkData { text: "foo", url: "http://google.com" }))

β– β–  specialized parser _:
input: ".", delim: "_"
count: 0, starts_w: false, input=delim: false
β¬’β¬’ parser error out for input: "."

β– β–  specialized parser *:
input: ".", delim: "*"
count: 0, starts_w: false, input=delim: false
β¬’β¬’ parser error out for input: "."

β– β–  specialized parser `:
input: ".", delim: "`"
count: 0, starts_w: false, input=delim: false
β¬’β¬’ parser error out for input: "."

β– β–  specialized parser take text between delims err on new line:
input: ".", start_delim: "![", end_delim: "]"
β¬’β¬’ parser error out for input: "."

β¬’β¬’ specialized parser error out with image:
input: ".", delim: "!["

β– β–  specialized parser take text between delims err on new line:
input: ".", start_delim: "[", end_delim: "]"
β¬’β¬’ parser error out for input: "."

β¬’β¬’ specialized parser error out with link:
input: ".", delim: "["
β¬’β¬’ specialized parser for checkbox: Err(Error(Error { input: ".", code: Tag }))

β–ˆβ–ˆ plain parser, input: "."
β–²β–² normal case :: Ok(("", "."))

β– β–  specialized parser _:
input: "", delim: "_"
count: 0, starts_w: false, input=delim: false
β¬’β¬’ parser error out for input: ""

β– β–  specialized parser *:
input: "", delim: "*"
count: 0, starts_w: false, input=delim: false
β¬’β¬’ parser error out for input: ""

β– β–  specialized parser `:
input: "", delim: "`"
count: 0, starts_w: false, input=delim: false
β¬’β¬’ parser error out for input: ""

β– β–  specialized parser take text between delims err on new line:
input: "", start_delim: "![", end_delim: "]"
β¬’β¬’ parser error out for input: ""

β¬’β¬’ specialized parser error out with image:
input: "", delim: "!["

β– β–  specialized parser take text between delims err on new line:
input: "", start_delim: "[", end_delim: "]"
β¬’β¬’ parser error out for input: ""

β¬’β¬’ specialized parser error out with link:
input: "", delim: "["
β¬’β¬’ specialized parser for checkbox: Err(Error(Error { input: "", code: Tag }))

β–ˆβ–ˆ plain parser, input: ""
β¬’β¬’ normal case :: Err(Error(Error { input: "", code: Eof }))

See this in action in r3bl-cmdr #

If you want to use a TUI app that uses this Markdown Parser, run the following commands:

cargo install r3bl-cmdr
edi --help

This will install the r3bl-cmdr binary and run edi, which is a TUI Markdown editor that you can use on any OS (Mac, Windows, Linux).

References #

nom is a huge topic. This tutorial takes a hands on approach to learning nom. However, the resources listed below are very useful for learning nom. Think of them as a reference guide and deep dive into how the nom library works.

πŸ‘€ Watch Rust πŸ¦€ live coding videos on our YouTube Channel.



πŸ“¦ Install our useful Rust command line apps using cargo install r3bl-cmdr (they are from the r3bl-open-core project):
  • 🐱giti: run interactive git commands with confidence in your terminal
  • 🦜edi: edit Markdown with style in your terminal

giti in action

edi in action

Related Posts