Build with Naz : Markdown parser in Rust and nom from r3bl_tui
- Introduction
- nom crate review
- A real production grade Markdown parser example
- Related video
- Architecture and parsing order
- References
Introduction #
This tutorial, and video are a deep dive in a real Markdown parser written using nom in
Rust. This MD Parser is part of the r3bl_tui
crate,
which is part of the r3bl-open-core
repo. It goes over the architecture of thinking
about building complex parsers and the nitty gritty details the runtime nature and
behavior when combining nom parsers.
The r3bl_tui
crate is a Text User Interface (TUI)
crate that is used in the R3BL suite of products. It is a very
powerful and flexible TUI crate that is used to build a variety of different applications.
It comes with a full featured Markdown editor component, and the parser thatβs the focus
on this tutorial is used by that editor component to parse an input string slice into a
Markdown document model (AST representation in memory).
nom crate review #
nom
is a parser combinator library for Rust. You can write small
functions that parse a specific part of your input, and then combine them to build a parser that
parses the whole input. nom
is very efficient and fast, it does not allocate memory when parsing if
it doesnβt have to, and it makes it very easy for you to do the same. nom
uses streaming mode or
complete mode, and in this tutorial & code examples provided we will be using complete mode.
Roughly the way it works is that you tell nom
how to parse a bunch of bytes in a way that matches
some pattern that is valid for your data. It will try to parse as much as it can from the input, and
the rest of the input will be returned to you.
You express the pattern that youβre looking for by combining parsers. nom
has a whole bunch of these
that come out of the box. And a huge part of learning nom
is figuring out what these built in
parsers are and how to combine them to build a parser that does what you want.
Errors are a key part of it being able to apply a variety of different parsers to the same input. If
a parser fails, nom
will return an error, and the rest of the input will be returned to you. This
allows you to combine parsers in a way that you can try to parse a bunch of different things, and if
one of them fails, you can try the next one. This is very useful when you are trying to parse a
bunch of different things, and you donβt know which one you are going to get.
We have a video and article on developerlife where you can learn more about
nom
and how to use it.
A real production grade Markdown parser example #
The production
md_parser
module in the r3bl-open-core
repo contains a fully functional Markdown parser (that you
can use in your projects that need a Markdown parser). This parser supports standard
Markdown syntax as well as some extensions that are
added to make it work w/ R3BL products. It makes a great starting
point to study how a relatively complex parser is written. There are lots of tests that
you can follow along to understand what the code is doing.
π‘ You can get the source code for the production Markdown parser used in
r3bl_tui
from ther3bl-open-core
repo.π Please star this repo on github if you like it π.
The main entry point (function) for this Markdown parsing module is
parse_markdown()
.
- It takes a string slice.
- And returns a vector of
MdBlock
s.
Here are some entry points into the codebase.
- The main function
parse_markdown()
that does the parsing of a string slice into aMdDocument
. The tests are provided alongside the code itself. And you can follow along to see how other smaller parsers are used to build up this big one that parses the whole of the Markdown document. - The
types
module contain all the types that are used to represent the Markdown document model, such asMdDocument
,MdBlock
,MdLineFragment
and all the other intermediate types & enums required for parsing. - All the parsers related to parsing metadata specific for R3BL
applications which are not standard Markdown can be found in
parse_metadata_kv
andparse_metadata_kcsv
. - All the parsers that are related to parsing the main βblocksβ of Markdown, such as
order lists, unordered lists, code blocks, text blocks, heading blocks, can be
found
block
. - All the parsers that are related to parsing a single line of Markdown text, such as
links, bold, italic, etc. can be found
fragment
.
Related video #
If you like to consume content via video, then you can watch this video that covers the same content as this article, but in a live coding format.
π‘ You can get the source code for the production Markdown parser used in
r3bl_tui
from ther3bl-open-core
repo.
Architecture and parsing order #
This diagram showcases the order in which the parsers are called and how they are composed together to parse a Markdown document.
priority ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
high β parse_markdown() { map to the correct β
β β many0( ββββββββββββββββββββΊ MdBlock variant β
β β parse_title_value() Title β
β β parse_tags_list() Tags β
β β parse_authors_list() Authors β
β β parse_date_value() Date β
β β parse_block_heading_opt_eol() Heading β
β β parse_block_smart_list() SmartList β
β β parse_block_code() CodeBlock β
β β parse_block_m..n_text_with_or_without_new_line() Text β
β β ) β
βΌ β } β
priority ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
low
The parsing strategy in most cases is to parse the most specific thing first and then
parse the more general thing later. We often use the existence of \n
(or eol
) to
decide how far forwards we need to go into the input. And sometimes \n
doesnβt exist
and we simply use the entire input (or end of input or eoi
). You might see functions
that have these suffixes in their names. Another term you might see is
with_or_without_new_line
which makes the parsing strategy explicit in the name.
The nature of nom
parsers is to simply error out when they donβt match. And leave
the input
untouched, so that another parser have a go at it again. The nature of
these parsing functions is kind of recursive in nature. So itβs important identify
edge and exit cases up front before diving into the parsing logic. You will see this
used in parsers which look for something more specific, if its not found, they error
out, and allow less specific parsers to have a go at it, and so on.
The priority of parsers #
As we drill down into the implementation further, we see that the parsers are prioritized in the order of their specificity. The most specific parsers are called first and the least specific parsers are called last. This is done to ensure that the most specific parsers get a chance to parse the input first. And if they fail, then the less specific parsers get a chance to parse the input.
parse_block_markdown_text_with_or_without_new_line() {
many0(
parse_inline_fragments_until_eol_or_eoi()
) β
} β ββmap to the correctβββΊ
βββΊ alt( MdLineFragment variant
β² p..e_f..t_s..s_with_underscore_err_on_new_line() Italic
β p..e_f..t_s..s_with_star_err_on_new_line() Bold
specialized β p..e_f..t_s..s_with_backtick_err_on_new_line() InlineCode
parsers βββββΊβ p..e_f..t_s..s_with_left_image_err_on_new_line() Image
β p..e_f..t_s..s_with_left_link_err_on_new_line() Link
β p..e_f..t_s..s_with_checkbox_into_str() Plain
βΌ p..e_f..t_s..s_with_checkbox_checkbox_into_bool() Checkbox
catch allβββββΊ p..e_f..t_plain_text_no_new_line() Plain
parser
)
The last one on the list in the diagram above is
parse_block_markdown_text_with_or_without_new_line()
.
Letβs zoom into this function and see how it is composed.
The βcatch allβ parser, which is the most complicated, and the lowest priority #
The most complicated parser is the βcatch allβ parser or the βplain textβ parser. This
parser is the last one in the chain and it simply consumes the rest of the input and
turns it into a MdBlock::Text
. This parser is the most complicated because it has to
deal with all the edge cases and exit cases that other parsers have not dealt with.
Such as special characters like `
, *
, _
, etc. They are all listed here:
- If the input does not start with a special char in this
get_sp_char_set_2()
, then this is the βNormal caseβ. In this case the input is split at the first occurrence of a special char inget_sp_char_set_3()
. The βbeforeβ part isMdLineFragment::Plain
and the βafterβ part is parsed again by a more specific parser. - If the input starts with a special char in this
get_sp_char_set_2()
and it is not in theget_sp_char_set_1()
with only 1 occurrence, then the behavior is different βEdge case -> Normal caseβ. Otherwise the behavior is βEdge case -> Special caseβ.- βEdge case -> Normal caseβ takes all the characters until
\n
or end of input and turns it into aMdLineFragment::Plain
. - βEdge case -> Special caseβ splits the
input
before and after the special char. The βbeforeβ part is turned into aMdLineFragment::Plain
and the βafterβ part is parsed again by a more specific parser.
- βEdge case -> Normal caseβ takes all the characters until
The reason this parser gets called repeatedly is because it is the last one in the chain.
Its the lowest priority parser called by
parse_inline_fragments_until_eol_or_eoi()
,
which itself is called:
- Repeatedly in a loop by
parse_block_markdown_text_with_or_without_new_line()
. - And by
parse_block_markdown_text_with_checkbox_policy_with_or_without_new_line()
.
Visualize the parsers running on real input #
Letβs run some tests from the md_parser
module with the
DEBUG_MD_PARSER_STDOUT
flag set to true
.
This will allow us to see the output of the parsers as they run on real input. This is a great way to understand how the parsers are working and what they are doing. This helps build an intuition around what happens at runtime which might not match what you think is happening when you read the code.
- The test we will run are in this file:
tui/src/tui/md_parser/block/parse_block_markdown_text_until_eol_or_eoi.rs
. - The test suite itself is called
tests_parse_block_markdown_text_with_or_without_new_line
. - And the function under test is
parse_block_markdown_text_with_or_without_new_line()
.
For convenience, hereβs a copy of the test that we will run (in this file):
#[test]
fn test_parse_hyperlink_markdown_text_1() {
let input = "This is a _hyperlink: [foo](http://google.com).";
let it = parse_block_markdown_text_with_or_without_new_line(input);
assert_eq2!(
it,
Ok((
"",
list![
MdLineFragment::Plain("This is a ",),
MdLineFragment::Plain("_",),
MdLineFragment::Plain("hyperlink: ",),
MdLineFragment::Link(HyperlinkData {
text: "foo",
url: "http://google.com",
},),
MdLineFragment::Plain(".",),
],
))
);
}
You can see from the assert_eq2!()
statements that the input "This is a _hyperlink:
[foo](http://google.com)."
is turned into a abstract syntax tree (AST) which looks like
this:
[
MdLineFragment::Plain("This is a ",),
MdLineFragment::Plain("_",),
MdLineFragment::Plain("hyperlink: ",),
MdLineFragment::Link(HyperlinkData {
text: "foo",
url: "http://google.com",
},),
MdLineFragment::Plain(".",),
]
Note the βstrangeβ way in which "_"
is handled. Instead of what we might expect Plain("This is a _ hyperlink: ")
.
But we get 3 fragments instead of one. This is because of the lowest priority parser handles special characters
so that more specific parsers (higher priority) can have a go at it. So it doesnβt prematurely mark them as Plain
.
Here are the commands to run one of the tests (make sure to run this in the tui
subfolder):
cargo test -- --nocapture test_parse_hyperlink_markdown_text_1
Hereβs the output, which you can walk through to see the parsing algorithms in action:
β β specialized parser _: input: "This is a _hyperlink: [foo](http://google.com).", delim: "_" count: 1, starts_w: false, input=delim: false β¬’β¬’ parser error out for input: "This is a _hyperlink: [foo](http://google.com)." β β specialized parser *: input: "This is a _hyperlink: [foo](http://google.com).", delim: "*" count: 0, starts_w: false, input=delim: false β¬’β¬’ parser error out for input: "This is a _hyperlink: [foo](http://google.com)." β β specialized parser `: input: "This is a _hyperlink: [foo](http://google.com).", delim: "`" count: 0, starts_w: false, input=delim: false β¬’β¬’ parser error out for input: "This is a _hyperlink: [foo](http://google.com)." β β specialized parser take text between delims err on new line: input: "This is a _hyperlink: [foo](http://google.com).", start_delim: "![", end_delim: "]" β¬’β¬’ parser error out for input: "This is a _hyperlink: [foo](http://google.com)." β¬’β¬’ specialized parser error out with image: input: "This is a _hyperlink: [foo](http://google.com).", delim: "![" β β specialized parser take text between delims err on new line: input: "This is a _hyperlink: [foo](http://google.com).", start_delim: "[", end_delim: "]" β¬’β¬’ parser error out for input: "This is a _hyperlink: [foo](http://google.com)." β¬’β¬’ specialized parser error out with link: input: "This is a _hyperlink: [foo](http://google.com).", delim: "[" β¬’β¬’ specialized parser for checkbox: Err(Error(Error { input: "This is a _hyperlink: [foo](http://google.com).", code: Tag })) ββ plain parser, input: "This is a _hyperlink: [foo](http://google.com)." β²β² normal case :: Ok(("_hyperlink: [foo](http://google.com).", "This is a ")) β β specialized parser _: input: "_hyperlink: [foo](http://google.com).", delim: "_" count: 1, starts_w: true, input=delim: false β¬’β¬’ parser error out for input: "_hyperlink: [foo](http://google.com)." β β specialized parser *: input: "_hyperlink: [foo](http://google.com).", delim: "*" count: 0, starts_w: false, input=delim: false β¬’β¬’ parser error out for input: "_hyperlink: [foo](http://google.com)." β β specialized parser `: input: "_hyperlink: [foo](http://google.com).", delim: "`" count: 0, starts_w: false, input=delim: false β¬’β¬’ parser error out for input: "_hyperlink: [foo](http://google.com)." β β specialized parser take text between delims err on new line: input: "_hyperlink: [foo](http://google.com).", start_delim: "![", end_delim: "]" β¬’β¬’ parser error out for input: "_hyperlink: [foo](http://google.com)." β¬’β¬’ specialized parser error out with image: input: "_hyperlink: [foo](http://google.com).", delim: "![" β β specialized parser take text between delims err on new line: input: "_hyperlink: [foo](http://google.com).", start_delim: "[", end_delim: "]" β¬’β¬’ parser error out for input: "_hyperlink: [foo](http://google.com)." β¬’β¬’ specialized parser error out with link: input: "_hyperlink: [foo](http://google.com).", delim: "[" β¬’β¬’ specialized parser for checkbox: Err(Error(Error { input: "_hyperlink: [foo](http://google.com).", code: Tag })) ββ plain parser, input: "_hyperlink: [foo](http://google.com)." β²β² edge case -> special case :: rem: "hyperlink: [foo](http://google.com).", output: "_" β β specialized parser _: input: "hyperlink: [foo](http://google.com).", delim: "_" count: 0, starts_w: false, input=delim: false β¬’β¬’ parser error out for input: "hyperlink: [foo](http://google.com)." β β specialized parser *: input: "hyperlink: [foo](http://google.com).", delim: "*" count: 0, starts_w: false, input=delim: false β¬’β¬’ parser error out for input: "hyperlink: [foo](http://google.com)." β β specialized parser `: input: "hyperlink: [foo](http://google.com).", delim: "`" count: 0, starts_w: false, input=delim: false β¬’β¬’ parser error out for input: "hyperlink: [foo](http://google.com)." β β specialized parser take text between delims err on new line: input: "hyperlink: [foo](http://google.com).", start_delim: "![", end_delim: "]" β¬’β¬’ parser error out for input: "hyperlink: [foo](http://google.com)." β¬’β¬’ specialized parser error out with image: input: "hyperlink: [foo](http://google.com).", delim: "![" β β specialized parser take text between delims err on new line: input: "hyperlink: [foo](http://google.com).", start_delim: "[", end_delim: "]" β¬’β¬’ parser error out for input: "hyperlink: [foo](http://google.com)." β¬’β¬’ specialized parser error out with link: input: "hyperlink: [foo](http://google.com).", delim: "[" β¬’β¬’ specialized parser for checkbox: Err(Error(Error { input: "hyperlink: [foo](http://google.com).", code: Tag })) ββ plain parser, input: "hyperlink: [foo](http://google.com)." β²β² normal case :: Ok(("[foo](http://google.com).", "hyperlink: ")) β β specialized parser _: input: "[foo](http://google.com).", delim: "_" count: 0, starts_w: false, input=delim: false β¬’β¬’ parser error out for input: "[foo](http://google.com)." β β specialized parser *: input: "[foo](http://google.com).", delim: "*" count: 0, starts_w: false, input=delim: false β¬’β¬’ parser error out for input: "[foo](http://google.com)." β β specialized parser `: input: "[foo](http://google.com).", delim: "`" count: 0, starts_w: false, input=delim: false β¬’β¬’ parser error out for input: "[foo](http://google.com)." β β specialized parser take text between delims err on new line: input: "[foo](http://google.com).", start_delim: "![", end_delim: "]" β¬’β¬’ parser error out for input: "[foo](http://google.com)." β¬’β¬’ specialized parser error out with image: input: "[foo](http://google.com).", delim: "![" β β specialized parser take text between delims err on new line: input: "[foo](http://google.com).", start_delim: "[", end_delim: "]" β β specialized parser take text between delims err on new line: input: "(http://google.com).", start_delim: "(", end_delim: ")" β²β² specialized parser for link: Ok((".", HyperlinkData { text: "foo", url: "http://google.com" })) β β specialized parser _: input: ".", delim: "_" count: 0, starts_w: false, input=delim: false β¬’β¬’ parser error out for input: "." β β specialized parser *: input: ".", delim: "*" count: 0, starts_w: false, input=delim: false β¬’β¬’ parser error out for input: "." β β specialized parser `: input: ".", delim: "`" count: 0, starts_w: false, input=delim: false β¬’β¬’ parser error out for input: "." β β specialized parser take text between delims err on new line: input: ".", start_delim: "![", end_delim: "]" β¬’β¬’ parser error out for input: "." β¬’β¬’ specialized parser error out with image: input: ".", delim: "![" β β specialized parser take text between delims err on new line: input: ".", start_delim: "[", end_delim: "]" β¬’β¬’ parser error out for input: "." β¬’β¬’ specialized parser error out with link: input: ".", delim: "[" β¬’β¬’ specialized parser for checkbox: Err(Error(Error { input: ".", code: Tag })) ββ plain parser, input: "." β²β² normal case :: Ok(("", ".")) β β specialized parser _: input: "", delim: "_" count: 0, starts_w: false, input=delim: false β¬’β¬’ parser error out for input: "" β β specialized parser *: input: "", delim: "*" count: 0, starts_w: false, input=delim: false β¬’β¬’ parser error out for input: "" β β specialized parser `: input: "", delim: "`" count: 0, starts_w: false, input=delim: false β¬’β¬’ parser error out for input: "" β β specialized parser take text between delims err on new line: input: "", start_delim: "![", end_delim: "]" β¬’β¬’ parser error out for input: "" β¬’β¬’ specialized parser error out with image: input: "", delim: "![" β β specialized parser take text between delims err on new line: input: "", start_delim: "[", end_delim: "]" β¬’β¬’ parser error out for input: "" β¬’β¬’ specialized parser error out with link: input: "", delim: "[" β¬’β¬’ specialized parser for checkbox: Err(Error(Error { input: "", code: Tag })) ββ plain parser, input: "" β¬’β¬’ normal case :: Err(Error(Error { input: "", code: Eof }))
See this in action in r3bl-cmdr #
If you want to use a TUI app that uses this Markdown Parser, run the following commands:
cargo install r3bl-cmdr
edi --help
This will install the r3bl-cmdr
binary and run edi
, which is a TUI Markdown editor
that you can use on any OS (Mac, Windows, Linux).
References #
nom
is a huge topic. This tutorial takes a hands on approach to learning nom
. However, the resources
listed below are very useful for learning nom
. Think of them as a reference guide and deep dive into
how the nom
library works.
- Useful:
- Source code examples (fantastic way to learn
nom
): - Videos:
- Intro from the author 7yrs old
nom
7 deep dive videos:nom
6 videos (deep dive into how nom combinators themselves are constructed):
- Tutorials:
- Reference docs:
- Source code examples (fantastic way to learn
- Less useful:
Build with Naz video series on developerlife.com YouTube channel #
If you have comments and feedback on this content, or would like to request new content (articles & videos) on developerlife.com, please join our discord server.
You can watch a video series on building this crate with Naz on the developerlife.com YouTube channel.
- YT channel
- Playlists
π Watch Rust π¦ live coding videos on our YouTube Channel.
π¦ Install our useful Rust command line apps usingcargo install r3bl-cmdr
(they are from the r3bl-open-core project):
- π±
giti
: run interactive git commands with confidence in your terminal- π¦
edi
: edit Markdown with style in your terminalgiti in action
edi in action