๐Ÿ‘€ Watch Rust ๐Ÿฆ€ live coding videos on our YouTube Channel.

Introduction #

This tutorial is a guide to parsing with nom. It covers the basics of parsing and how to use nom to parse a string into a data structure. We will cover a variety of different examples ranging from parsing simple CSS like syntax to a full blown Markdown parser.

This tutorial has 2 examples in it:

  1. CSS style syntax
  2. Markdown parser
๐Ÿ‘€ Watch Rust ๐Ÿฆ€ live coding videos on our YouTube Channel.



๐Ÿ“ฆ Install our useful Rust command line apps using cargo install r3bl-cmdr (they are from the r3bl-open-core project):
  • ๐Ÿฑgiti: run interactive git commands with confidence in your terminal
  • ๐Ÿฆœedi: edit Markdown with style in your terminal

giti in action

edi in action

For more information on general Rust type system design (functional approach rather than object oriented), please take a look at this paper by Will Crichton demonstrating Typed Design Patterns with Rust.

Documentation #

nom is a huge topic. This tutorial takes a hands on approach to learning nom. However, the resources listed below are very useful for learning nom. Think of them as a reference guide and deep dive into how the nom library works.

Getting to know nom using a simple example #

nom is a parser combinator library for Rust. You can write small functions that parse a specific part of your input, and then combine them to build a parser that parses the whole input. nom is very efficient and fast, it does not allocate memory when parsing if it doesnโ€™t have to, and it makes it very easy for you to do the same. nom uses streaming mode or complete mode, and in this tutorial & code examples provided we will be using complete mode.

Roughly the way it works is that you tell nom how to parse a bunch of bytes in a way that matches some pattern that is valid for your data. It will try to parse as much as it can from the input, and the rest of the input will be returned to you.

You express the pattern that youโ€™re looking for by combining parsers. nom has a whole bunch of these that come out of the box. And a huge part of learning nom is figuring out what these built in parsers are and how to combine them to build a parser that does what you want.

Errors are a key part of it being able to apply a variety of different parsers to the same input. If a parser fails, nom will return an error, and the rest of the input will be returned to you. This allows you to combine parsers in a way that you can try to parse a bunch of different things, and if one of them fails, you can try the next one. This is very useful when you are trying to parse a bunch of different things, and you donโ€™t know which one you are going to get.

Parsing hex color codes #

Letโ€™s dive into nom using a simple example of parsing hex color codes.

//! This module contains a parser that parses a hex color string into a [Color] struct.
//! The hex color string can be in the following format `#RRGGBB`.
//! For example, `#FF0000` is red.

use std::num::ParseIntError;
use nom::{bytes::complete::*, combinator::*, error::*, sequence::*, IResult, Parser};

#[derive(Debug, PartialEq)]
pub struct Color {
    pub red: u8,
    pub green: u8,
    pub blue: u8,
}

impl Color {
    pub fn new(red: u8, green: u8, blue: u8) -> Self {
        Self { red, green, blue }
    }
}

/// Helper functions to match and parse hex digits. These are not [Parser]
/// implementations.
mod helper_fns {
    use super::*;

    /// This function is used by [map_res] and it returns a [Result], not [IResult].
    pub fn parse_str_to_hex_num(input: &str) -> Result<u8, std::num::ParseIntError> {
        u8::from_str_radix(input, 16)
    }

    /// This function is used by [take_while_m_n] and as long as it returns `true`
    /// items will be taken from the input.
    pub fn match_is_hex_digit(c: char) -> bool {
        c.is_ascii_hexdigit()
    }

    pub fn parse_hex_seg(input: &str) -> IResult<&str, u8> {
        map_res(
            take_while_m_n(2, 2, match_is_hex_digit),
            parse_str_to_hex_num,
        )(input)
    }
}

/// These are [Parser] implementations that are used by [hex_color_no_alpha].
mod intermediate_parsers {
    use super::*;

    /// Call this to return function that implements the [Parser] trait.
    pub fn gen_hex_seg_parser_fn<'input, E>() -> impl Parser<&'input str, u8, E>
    where
        E: FromExternalError<&'input str, ParseIntError> + ParseError<&'input str>,
    {
        map_res(
            take_while_m_n(2, 2, helper_fns::match_is_hex_digit),
            helper_fns::parse_str_to_hex_num,
        )
    }
}

/// This is the "main" function that is called by the tests.
fn hex_color_no_alpha(input: &str) -> IResult<&str, Color> {
    // This tuple contains 3 ways to do the same thing.
    let it = (
        helper_fns::parse_hex_seg,
        intermediate_parsers::gen_hex_seg_parser_fn(),
        map_res(
            take_while_m_n(2, 2, helper_fns::match_is_hex_digit),
            helper_fns::parse_str_to_hex_num,
        ),
    );
    let (input, _) = tag("#")(input)?;
    let (input, (red, green, blue)) = tuple(it)(input)?; // same as `it.parse(input)?`
    Ok((input, Color { red, green, blue }))
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn parse_valid_color() {
        let mut input = String::new();
        input.push_str("#2F14DF");
        input.push('๐Ÿ”…');

        let result = dbg!(hex_color_no_alpha(&input));

        let Ok((remainder, color)) = result else { panic!(); };
        assert_eq!(remainder, "๐Ÿ”…");
        assert_eq!(color, Color::new(47, 20, 223));
    }

    #[test]
    fn parse_invalid_color() {
        let result = dbg!(hex_color_no_alpha("๐Ÿ”…#2F14DF"));
        assert!(result.is_err());
    }
}

What does this code do, how does it work? #

Please note that:

  • This string can be parsed: #2F14DF๐Ÿ”….
  • However, this string canโ€™t ๐Ÿ”…#2F14DF.

So what is going on in the source code above?

  1. The intermediate_parsers::hex_color_no_alpha() function is the main function that orchestrates all the other functions to parse an input: &str and turn it into a (&str, Color).

    • The tag combinator function is used to match the # character. This means that if the input doesnโ€™t start with #, the parser will fail (which is why ๐Ÿ”…#2F14DF fails). It returns the remaining input after #. And the output is # which we throw away.
    • A tuple is created that takes 3 parsers, which all do the same exact thing, but are written in 3 different ways just to demonstrate how these can be written.
      1. The helper_fns::parse_hex_seg() function is added to a tuple.
      2. The higher order function intermediate_parsers::gen_hex_seg_parser_fn() is added to the tuple.
      3. Finally, the map_res combinator is directly added to the tuple.
    • An extension function on this tuple called parse() is called w/ the input (thus far). This is used to parse the input hex number.
      • It returns the remaining input after the hex number which is why #2F14DF๐Ÿ”… returns ๐Ÿ”… as the first item in the tuple.
      • The second item in the tuple is the parsed color string turned into a Color struct.
  2. Letโ€™s look at the helper_fns::parse_hex_seg (the other 2 ways shown above do the same exact thing). The signature of this function tells nom that you can call the function w/ input argument and it will return IResult<Input, Output, Error>. This signature is the pattern that we will end up using to figure out how to chain combinators together. Hereโ€™s how the map_res combinator is used by parse_hex_seg() to actually do the parsing:

    1. take_while_m_n: This combinator takes a range of characters (2, 2) and applies the function match_is_hex_digit to determine whether the char is a hex digit (using is_ascii_hexdigit() on the char). This is used to match a valid hex digit. It returns a &str slice of the matched characters. Which is then passed to the next combinator.
    2. parse_str_to_hex_num: This parser is used on the string slice returned from above. It simply takes string slice and turns it into a Result<u8>, std::num::ParseIntError>. The error is important, since if the string slice is not a valid hex digit, then we want to return this error.
  3. The key concept in nom is the Parser trait which is implemented for any FnMut that accepts an input and returns an IResult<Input, Output, Error>.

    • If you write a simple function w/ the signature fn(input: Input) -> IResult<Input, Output, Error> then you are good to go! You just need to call parse() on the Input type and this will kick off the parsing.
    • Alternatively, you can just call the nom tuple function directly via nom::sequence::tuple(...)(input)?. Or you can just call the parse() method on the tuple since this is an extension function on tuples provided by nom.
    • IResult is a very important type alias. It encapsulates 3 key types that are related to parsing:
      1. The Input type is the type of the input that is being parsed. For example, if you are parsing a string, then the Input type is &str.
      2. The Output type is the type of the output that is returned by the parser. For example, if you are parsing a string and you want to return a Color struct, then the Output type is Color.
      3. The Error type is the type of the error that is returned by the parser. For example, if you are parsing a string and you want to return a nom::Err::Error error, then the Error type is nom::Err::Error. This is very useful when you are developing your parser combinators and you run into errors and have to debug them.

Generalized workflow #

After the really complicated walk through above, we could have just written the entire thing concisely like so:

pub fn parse_hex_seg(input: &str) -> IResult<&str, u8> {
  map_res(
    take_while_m_n(2, 2, |it: char| it.is_ascii_hexdigit()),
    |it: &str| u8::from_str_radix(it, 16),
  )(input)
}

fn hex_color_no_alpha(input: &str) -> IResult<&str, Color> {
  let (input, _) = tag("#")(input)?;
  let (input, (red, green, blue)) = tuple((
    helper_fns::parse_hex_seg,
    helper_fns::parse_hex_seg,
    helper_fns::parse_hex_seg,
  ))(input)?;
  Ok((input, Color { red, green, blue }))
}

This is a very simple example, but it shows how you can combine parsers together to create more complex parsers. You start w/ the simplest one first, and then build up from there.

  • In this case the simplest one is parse_hex_seg() which is used to parse a single hex segment. Inside this function we call map_res() w/ the supplied input and simply return the result. This is also a very common thing to do, is to wrap calls to other parsers in functions and then re-use them in other parsers.
  • Finally, the hex_color_no_alpha() function is used to parse a hex color w/o an alpha channel.
    • The tag() combinator is used to match the # character.
    • The tuple() combinator is used to match the 3 hex segments.
    • The ? operator is used to return the error if there is one.
    • The Ok() is used to return the parsed Color struct and the remaining input.

Build a Markdown parser #

๐Ÿ’ก You can get the source code for the Markdown parser shown in this article from the r3bl-open-core repo.

๐ŸŒŸ Please star this repo on github if you like it ๐Ÿ™.

The md_parser module in the r3bl-open-core repo contains a fully functional Markdown parser (and isnโ€™t written as a test but a real module that you can use in your projects that need a Markdown parser). This parser supports standard Markdown syntax as well as some extensions that are added to make it work w/ R3BL products. It makes a great starting point to study how a relatively complex parser is written. There are lots of tests that you can follow along to understand what the code is doing.

Here are some entry points into the codebase.

  1. The main function parse_markdown() that does the parsing of a string slice into a Document. The tests are provided alongside the code itself. And you can follow along to see how other smaller parsers are used to build up this big one that parses the whole of the Markdown document.

    1. All the parsers related to parsing metadata specific for R3BL applications which are not standard Markdown can be found in parser_impl_metadata.
    2. All the parsers that are related to parsing the main โ€œblocksโ€ of Markdown, such as order lists, unordered lists, code blocks, text blocks, heading blocks, can be found parser_impl_block
    3. All the parsers that are related to parsing a single line of Markdown text, such as links, bold, italic, etc. can be found parser_impl_element
  2. The types that are used to represent the Markdown document model (Document) and all the other intermediate types (Fragment, Block, etc) & enums required for parsing.

๐Ÿ“ฆ Install our useful Rust command line apps using cargo install r3bl-cmdr (they are from the r3bl-open-core project):
  • ๐Ÿฑgiti: run interactive git commands with confidence in your terminal
  • ๐Ÿฆœedi: edit Markdown with style in your terminal

giti in action

edi in action

Related Posts