Guide to parsing with nom
Introduction #
This tutorial is a guide to parsing with nom. It covers the basics of parsing and how to use nom to parse a string into a data structure. We will cover a variety of different examples ranging from parsing simple CSS like syntax to a full blown Markdown parser.
This tutorial has 2 examples in it:
For more information on general Rust type system design (functional approach rather than object oriented), please take a look at this paper by Will Crichton demonstrating Typed Design Patterns with Rust.
Documentation #
nom is a huge topic. This tutorial takes a hands on approach to learning nom. However, the resources listed below are very useful for learning nom. Think of them as a reference guide and deep dive into how the nom library works.
- Useful:
- Videos:
- Intro from the author 7yrs old
- Nom 7 deep dive videos:
- Nom 6 videos (deep dive into how nom combinators themselves are constructed):
- Tutorials:
- Reference docs:
- Videos:
- Less useful:
Getting to know nom using a simple example #
nom is a parser combinator library for Rust. You can write small functions that parse a specific part of your input, and then combine them to build a parser that parses the whole input. nom is very efficient and fast, it does not allocate memory when parsing if it doesn’t have to, and it makes it very easy for you to do the same. nom uses streaming mode or complete mode, and in this tutorial & code examples provided we will be using complete mode.
Roughly the way it works is that you tell nom how to parse a bunch of bytes in a way that matches some pattern that is valid for your data. It will try to parse as much as it can from the input, and the rest of the input will be returned to you.
You express the pattern that you’re looking for by combining parsers. nom has a whole bunch of these that come out of the box. And a huge part of learning nom is figuring out what these built in parsers are and how to combine them to build a parser that does what you want.
Errors are a key part of it being able to apply a variety of different parsers to the same input. If a parser fails, nom will return an error, and the rest of the input will be returned to you. This allows you to combine parsers in a way that you can try to parse a bunch of different things, and if one of them fails, you can try the next one. This is very useful when you are trying to parse a bunch of different things, and you don’t know which one you are going to get.
Parsing hex color codes #
Let’s dive into nom using a simple example of parsing hex color codes.
//! This module contains a parser that parses a hex color string into a [Color] struct.
//! The hex color string can be in the following format `#RRGGBB`.
//! For example, `#FF0000` is red.
use std::num::ParseIntError;
use nom::{bytes::complete::*, combinator::*, error::*, sequence::*, IResult, Parser};
#[derive(Debug, PartialEq)]
pub struct Color {
pub red: u8,
pub green: u8,
pub blue: u8,
}
impl Color {
pub fn new(red: u8, green: u8, blue: u8) -> Self {
Self { red, green, blue }
}
}
/// Helper functions to match and parse hex digits. These are not [Parser]
/// implementations.
mod helper_fns {
use super::*;
/// This function is used by [map_res] and it returns a [Result], not [IResult].
pub fn parse_str_to_hex_num(input: &str) -> Result<u8, std::num::ParseIntError> {
u8::from_str_radix(input, 16)
}
/// This function is used by [take_while_m_n] and as long as it returns `true`
/// items will be taken from the input.
pub fn match_is_hex_digit(c: char) -> bool {
c.is_ascii_hexdigit()
}
pub fn parse_hex_seg(input: &str) -> IResult<&str, u8> {
map_res(
take_while_m_n(2, 2, match_is_hex_digit),
parse_str_to_hex_num,
)(input)
}
}
/// These are [Parser] implementations that are used by [hex_color_no_alpha].
mod intermediate_parsers {
use super::*;
/// Call this to return function that implements the [Parser] trait.
pub fn gen_hex_seg_parser_fn<'input, E>() -> impl Parser<&'input str, u8, E>
where
E: FromExternalError<&'input str, ParseIntError> + ParseError<&'input str>,
{
map_res(
take_while_m_n(2, 2, helper_fns::match_is_hex_digit),
helper_fns::parse_str_to_hex_num,
)
}
}
/// This is the "main" function that is called by the tests.
fn hex_color_no_alpha(input: &str) -> IResult<&str, Color> {
// This tuple contains 3 ways to do the same thing.
let it = (
helper_fns::parse_hex_seg,
intermediate_parsers::gen_hex_seg_parser_fn(),
map_res(
take_while_m_n(2, 2, helper_fns::match_is_hex_digit),
helper_fns::parse_str_to_hex_num,
),
);
let (input, _) = tag("#")(input)?;
let (input, (red, green, blue)) = tuple(it)(input)?; // same as `it.parse(input)?`
Ok((input, Color { red, green, blue }))
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn parse_valid_color() {
let mut input = String::new();
input.push_str("#2F14DF");
input.push('🔅');
let result = dbg!(hex_color_no_alpha(&input));
let Ok((remainder, color)) = result else { panic!(); };
assert_eq!(remainder, "🔅");
assert_eq!(color, Color::new(47, 20, 223));
}
#[test]
fn parse_invalid_color() {
let result = dbg!(hex_color_no_alpha("🔅#2F14DF"));
assert!(result.is_err());
}
}
What does this code do, how does it work? #
Please note that:
- This string can be parsed:
#2F14DF🔅
. - However, this string can’t
🔅#2F14DF
.
So what is going on in the source code above?
-
The
intermediate_parsers::hex_color_no_alpha()
function is the main function that orchestrates all the other functions to parse aninput: &str
and turn it into a(&str, Color)
.- The
tag
combinator function is used to match the#
character. This means that if the input doesn’t start with#
, the parser will fail (which is why🔅#2F14DF
fails). It returns the remaining input after#
. And the output is#
which we throw away. - A
tuple
is created that takes 3 parsers, which all do the same exact thing, but are written in 3 different ways just to demonstrate how these can be written.- The
helper_fns::parse_hex_seg()
function is added to a tuple. - The higher order function
intermediate_parsers::gen_hex_seg_parser_fn()
is added to the tuple. - Finally, the
map_res
combinator is directly added to the tuple.
- The
- An extension function on this tuple called
parse()
is called w/ theinput
(thus far). This is used to parse the input hex number.- It returns the remaining input after the hex number which is why
#2F14DF🔅
returns🔅
as the first item in the tuple. - The second item in the tuple is the parsed color string turned into a
Color
struct.
- It returns the remaining input after the hex number which is why
- The
-
Let’s look at the
helper_fns::parse_hex_seg
(the other 2 ways shown above do the same exact thing). The signature of this function tells nom that you can call the function w/input
argument and it will returnIResult<Input, Output, Error>
. This signature is the pattern that we will end up using to figure out how to chain combinators together. Here’s how themap_res
combinator is used byparse_hex_seg()
to actually do the parsing:take_while_m_n
: This combinator takes a range of characters (2, 2
) and applies the functionmatch_is_hex_digit
to determine whether thechar
is a hex digit (usingis_ascii_hexdigit()
on thechar
). This is used to match a valid hex digit. It returns a&str
slice of the matched characters. Which is then passed to the next combinator.parse_str_to_hex_num
: This parser is used on the string slice returned from above. It simply takes string slice and turns it into aResult<u8>, std::num::ParseIntError>
. The error is important, since if the string slice is not a valid hex digit, then we want to return this error.
-
The key concept in nom is the
Parser
trait which is implemented for anyFnMut
that accepts an input and returns anIResult<Input, Output, Error>
.- If you write a simple function w/ the signature
fn(input: Input) -> IResult<Input, Output, Error>
then you are good to go! You just need to callparse()
on theInput
type and this will kick off the parsing. - Alternatively, you can just call the nom
tuple
function directly vianom::sequence::tuple(...)(input)?
. Or you can just call theparse()
method on the tuple since this is an extension function on tuples provided by nom. IResult
is a very important type alias. It encapsulates 3 key types that are related to parsing:- The
Input
type is the type of the input that is being parsed. For example, if you are parsing a string, then theInput
type is&str
. - The
Output
type is the type of the output that is returned by the parser. For example, if you are parsing a string and you want to return aColor
struct, then theOutput
type isColor
. - The
Error
type is the type of the error that is returned by the parser. For example, if you are parsing a string and you want to return anom::Err::Error
error, then theError
type isnom::Err::Error
. This is very useful when you are developing your parser combinators and you run into errors and have to debug them.
- The
- If you write a simple function w/ the signature
Generalized workflow #
After the really complicated walk through above, we could have just written the entire thing concisely like so:
pub fn parse_hex_seg(input: &str) -> IResult<&str, u8> {
map_res(
take_while_m_n(2, 2, |it: char| it.is_ascii_hexdigit()),
|it: &str| u8::from_str_radix(it, 16),
)(input)
}
fn hex_color_no_alpha(input: &str) -> IResult<&str, Color> {
let (input, _) = tag("#")(input)?;
let (input, (red, green, blue)) = tuple((
helper_fns::parse_hex_seg,
helper_fns::parse_hex_seg,
helper_fns::parse_hex_seg,
))(input)?;
Ok((input, Color { red, green, blue }))
}
This is a very simple example, but it shows how you can combine parsers together to create more complex parsers. You start w/ the simplest one first, and then build up from there.
- In this case the simplest one is
parse_hex_seg()
which is used to parse a single hex segment. Inside this function we callmap_res()
w/ the suppliedinput
and simply return the result. This is also a very common thing to do, is to wrap calls to other parsers in functions and then re-use them in other parsers. - Finally, the
hex_color_no_alpha()
function is used to parse a hex color w/o an alpha channel.- The
tag()
combinator is used to match the#
character. - The
tuple()
combinator is used to match the 3 hex segments. - The
?
operator is used to return the error if there is one. - The
Ok()
is used to return the parsedColor
struct and the remaining input.
- The
Build a Markdown parser #
💡 You can get the source code for the Markdown parser shown in this article from the
r3bl_rs_utils
repo.🌟 Please star this repo on github if you like it 🙏.
The md_parser
module
in the r3bl_rs_utils
repo contains a fully functional Markdown parser (and isn’t written as a test
but a real module that you can use in your projects that need a Markdown parser). This parser
supports standard Markdown syntax as well as some extensions that are added to make it work w/ R3BL
products. It makes a great starting point to study how a relatively complex parser is written. There
are lots of tests that you can follow along to understand what the code is doing.
Here are some entry points into the codebase.
-
The main function
parse_markdown()
that does the parsing of a string slice into aDocument
. The tests are provided alongside the code itself. And you can follow along to see how other smaller parsers are used to build up this big one that parses the whole of the Markdown document.- All the parsers related to parsing metadata specific for R3BL applications which are not
standard Markdown can be found in
parser_impl_metadata
. - All the parsers that are related to parsing the main “blocks” of Markdown, such as order
lists, unordered lists, code blocks, text blocks, heading blocks, can be found
parser_impl_block
- All the parsers that are related to parsing a single line of Markdown text, such as links,
bold, italic, etc. can be found
parser_impl_element
- All the parsers related to parsing metadata specific for R3BL applications which are not
standard Markdown can be found in
-
The types that are used to represent the Markdown document model (
Document
) and all the other intermediate types (Fragment
,Block
, etc) & enums required for parsing.