abstract class TopDown::Parser

Overview

TopDown::Parser is the main class to derive for building a parser.

Basis

A minimal parser could be:

class Parser < TopDown::Parser
  root :expression

  syntax :expression do
  end
end

which will parse anything but EOF.

root indicate the starting point of parsing, the rest is to do inside syntax.

Syntax are like functions, in which, any code can be added. Inside it, #parse & parse! can be used to define what to parse.

syntax :expression do
  a = parse('a')
  parse(/( )+/)
  foo = parse!("foo")

  {a, foo}
end
"a   foo" # ~> {'a', "foo"}
"a   bar" # ~> Unexpected character 'b', expected 'foo'
"ab"      # ~> Unexpected character 'b', expected expression

NOTE In this documentation, "foo" # ~> result is used as a shortcut for Parser.new("foo").parse # => result

In the above, if parse!("foo") fail to parse, it raises an error. But parse('a') don't raises, instead it use a break (break Fail.new) to stop the current sequence to be parsed. This failure is caught by the root :expression, that why it next raises an exception.

This difference is important because is on what is based TopDown, using #parse let a change to the surrounding context to handle the failure, whereas parse! permit the raise directly, leading to better errors.

For example, using parse! above:

syntax :expression do
  a = parse!('a')
  parse!(/( )+/)
  foo = parse!("foo")

  {a, foo}
end
"ab" # ~> Unexpected character 'b', expected pattern /( )+/

Gives a more precise error.

Repeat, union, maybe

repeat, union, maybe allow interesting things to be parsed. They take both a block, in which parse failure can occur.

syntax :expression do
  repeat do
    parse('a')
    parse!('b')
  end
  maybe { parse('c') }
  parse!('d')

  union do
    parse('e')
    parse('é')
    parse('è')
  end
  "ok"
end

This is equivalent to parse /(ab)*c?d[eéè]/, in a bit more verbose. However, doing so allow to insert code between and retrieve only needed result, storing in array, or to insert some logic.

For example:

syntax :expression do
  array = [] of {Char, Char}
  repeat do
    array << {parse('a'), parse!('b')}
  end
  have_c = maybe { parse('c') }
  parse!('d')

  if have_c
    e = union do
      parse('e')
      parse('é')
      parse('è')
    end
  end
  {array, e}
end

Here, we store each ab in an array and accept the union 'e'|'é'|'è' only if the optional 'c' have been parsed.

"ababd" # ~> {[{'a', 'b'}, {'a', 'b'}], nil}
"cde"   # ~> {[], 'e'}
"dé"    # ~> Unexpected character 'é', expected 'EOF'
"abac"  # # Unexpected character 'c', expected 'b'

NOTE When using repeat, union, and maybe, we always use #parse at first (without exclamation), because they break on failure, which is caught by the repeat/maybe/union. They know so they should continue parsing without a failure.

Infix

infix can be used inside a union. It permits to parse operators with precedence easily.

syntax :expression do
  union do
    parse(/\d+/).to_i
    infix 30 "**" { left() ** parse!(:expression) }
    infix 20 '*' { left() * parse!(:expression) }
    infix 20 '/' { left() / parse!(:expression) }
    infix 10 '+' { left() + parse!(:expression) }
    infix 10 '-' { left() - parse!(:expression) }
  end
end

Infix are treated specially by the union. They are parsed after any other member of the union, the left() result is updated each time. They are parsed in function of their precedence (first argument), higher precedence mean grouped first.

"3*6+6*4" # ~> 42

Precedence:

current_precedence() is initially zero, it changes when entering in infix. This is this value that guide the recursion for parsing operators. Its value can be changed at specific places, so parsing order can be entirely controlled.

For example, assuming we want to parse a ternary if _ ? _ : _ that have a higher precedence that classical operators (e.g. "1+1?2:3+3" gets parsed as (1 + (1?2:3)) + 3.

We can do the following:

syntax(:expression) do
  union do
    parse(/\d+/).to_i
    infix 10, :plus
    infix 90, :ternary_if
  end
end

syntax :plus, '+' do
  left() + parse!(:expression)
end

syntax :ternary_if, '?' do
  cond = left()
  then_arm = parse!(:expression)
  parse!(':')
  else_arm = parse!(:expression)

  cond != 0 ? then_arm : else_arm
end
"1+1?2:3+3" # ~> 6

However the following is not what we except:

"1+1?2+2:3+3" # => Unexpected character '+', expected ':'

This happens because the + inside the then_arm have lower precedence, it wants so let its left with higher precedence finish before parsing itself. So the then_arm finish but we except a : right after, so it fails (got '+', expected ':').

To fix that, we can set the current_precedence() for the then arm to 0:

then_arm = parse!(:expression, with_precedence: 0)
"1+1?2+2:3+3" # ~> 8

Defined in:

ast.cr
hooks.cr
parser/parselets.cr
parser/parser.cr
parser/syntax_error.cr
parser/tokens.cr

Macro Summary

Instance Method Summary

Instance methods inherited from class TopDown::CharReader

each_char(& : Char -> ) each_char, location : Location location, location=(location : Location) location=, next_char : Char next_char, peek_char peek_char, previous_assci_char? : Char | Nil previous_assci_char?, source : String source, source=(source : String) source=

Constructor methods inherited from class TopDown::CharReader

new(source : String) new

Macro Detail

macro ast(ast_class, *args, at = nil, **options) #

TODO docs


[View source]
macro begin_location #

Returns the Location at the beginning of the syntax.

The given location is after skipping characters.


[View source]
macro capture(&) #

Captures all characters parsed inside the block.

Returns a String

NOTE skip characters are still captured, use partial_capture instead.

capture do
  repeat(/ +/) do
    parse(/\w+/)
  end
end
"a   bc d e"  # ~> "a   bc d e"
"Hello World" # ~> "Hello World"

[View source]
macro current_precedence #

Returns the current precedence.

Initially zero. Changes inside an infix or when with_precedence is used.


[View source]
macro end_word #

Allows to match a word only if a non-alphanumeric follows

parse("foo") { end_word }

Equivalent to: (but faster)
parse(/foo\b/) { nil }
"foo bar" # ~> "foo"
"foobar"  # ~> Unexpected character 'b', expected 'EOF'

NOTE end_word returns nil.


[View source]
macro forward_fail(result) #

Returns the given result, or break the current sequence if result is a Fail.

This macro is the recommended ways to handle Fail.

result = Array.new(3) do
  parse("a")
end

typeof(result) # => Array(String) | TopDown::Parser::Fail

result = forward_fail(result)
typeof(result) # => Array(String)

[View source]
macro infix(precedence, parselet, associativity = "right", &block) #

TODO docs


[View source]
macro left #

In the context of an infix member, returns the left result of the current union.

Returns nil in a context other than inside an infix.

Each time a union member is successfully parsed, left value is updated.

As a union continue parsing infix until possible, left allows to retrieve the subsequent results.

syntax(:exp) do
  union do
    parse('a')
    infix(1, :comma)
  end
end
syntax(:comma) do
  l = left() # left returns either
  r = parse(:exp)
  "#{l}, #{r}"
end

[View source]
macro maybe(parselet = nil, &) #

Parses the sequence inside the block, returns nil if it fails.

x = parse('1').to_i
y =
  maybe do
    parse('+')
    parse!('1').to_i
  end
parse!(';')
x + (y || 0)
"1;"   # ~> 1
"1+1;" # ~> 2
"1+*;" # ~> "Unexpected character '*', expected '1'"
"1*;"  # ~> "Unexpected character '*', expected ';'"

[View source]
macro maybe_union(&block) #

Equivalent to maybe { union { ... } }


[View source]
macro no_skip(&) #

Prevents the skip rules to be triggered inside the given block.


[View source]
macro parse(parselet, with_precedence = nil, &block) #

Parses the given parselet (ParseletLiteral), and yield/return the according result.

failure:

If the given parselet fails to parse, it break the current sequence with a Fail. Failure is catch by the surrounding context.

  • inside an union, tell the union that member have fail. The union tries to parse the next member.
  • inside a maybe, the maybe will return nil.
  • inside a repeat, make the repeat to stop.
  • inside a syntax, the syntax is considered to fail, it will in turn break or raises.

block:

A block could be given to let return the value of the block.

precedence:

with_precedence (NumberLiteral) changes the current_precedence, the given parselet will be parsed as if the contained syntax have this precedence. Allow to handle multi-precedence for ternary-or-more operator.


[View source]
macro parse!(parselet, error = nil, at = nil, with_precedence = nil, &block) #

Similar to Parser.parse, but raises SyntaxError if the parsing fail.

error: allows to customize error message, it can be:

  • a StringLiteral: Following percent interpolations are available:
    • %{got}: The character or token causing the failure.
    • %{expected}: The expected Char, String, Regex, Token or syntax name (Symbol).
  • a ProcLiteral taking two arguments: 'got' and 'expected', with types explained above.

at: indicates where the error is raised, it can be a Location, Range(Location, Location) or Range(Location, Nil).

failure:

Because it raises, this shouldn't be used as first parselet inside a union, maybe, repeat and syntax. This would raises before the surrounding context could try an other solution.

However, this should generally be used everywhere else, to allow more localized errors.


[View source]
macro partial_capture(&block) #

Captures chosen characters parsed inside the block.

Yields a String::Builder in which characters to capture can be added.

Returns a String.

partial_capture do |io|
  repeat(/ +/) do
    io << parse(/\w+/)
  end
end
"a   bc d e"  # ~> "abcde"
"Hello World" # ~> "HelloWorld"

[View source]
macro peek(parselet, with_precedence = nil, &block) #

TODO docs parse parselet if succeed, don't move the cursor and return parselet result else, don't move the cursor and break a failure


[View source]
macro prefix(precedence, parselet, &block) #

[View source]
macro repeat(parselet = nil, &) #

Repeatedly parses the sequence inside the block until it fails.

Returns nil. Results should be either collected by capture or stored inside a variable or array.

x = parse('1').to_i
repeat do
  parse('+')
  x += parse!('1').to_i
end
parse!(';')
x
"1;"         # ~> 1
"1+1+1+1+1;" # ~> 5
"1+1+1+*;"   # ~> "Unexpected character '*', expected '1'"
"1*;"        # ~> "Unexpected character '*', expected ';'"

[View source]
macro repeat(parselet = nil, *, separator, &) #

Repeatedly parses the sequence inside the block until it fails, with a separator parselet between each iteration.

Returns nil. Results should be either collected by capture or stored inside a variable or array.

x = 0
parse('(')
repeat(',') do
  x += parse('1').to_i
end
parse!(')')
x
"()"          # ~> 0
"(1,1,1,1,1)" # ~> 5
"(1,1,)"      # ~> "Unexpected character ',', expected ')'"
"(11)"        # ~> "Unexpected character '1', expected ')'"

[View source]
macro repeat_count(*, separator = nil) #

TODO docs


[View source]
macro repeat_n(count, parselet = nil, &block) #

[View source]
macro repeat_to_a(type, *, separator = nil, &) #

TODO docs


[View source]
macro repeat_to_h(type, *, separator = nil, &) #

TODO docs


[View source]
macro repeat_to_s(*, separator = nil, &block) #

TODO docs


[View source]
macro repeat_union(&block) #

Equivalent to repeat { union { ... } }


[View source]
macro root(parselet) #

Defines the main syntax to parse.

parselet (ParseletLiteral): the syntax name or parselet to be the root.


[View source]
macro sequence(&) #

Allows to group a sequence of parsing (inside a union for example).

union do
  sequence do
    parse("a")
    parse!("b")
  end
  parse("2")

[View source]
macro skip(&members) #

Defines what is ignored during the parsing.

It is triggered every time prior a parselet is parsed.

It works similarly to the members of a Parser.union, in which each members are consumed while possible.

class Parser < TopDown::Parser
  skip do
    # Skips spaces:
    parse ' '
    parse '\n'
    parse '\t'
    parse '\r'

    # Line comments:
    parse "//" { repeat { parse not('\n') } }

    # Block comments:
    parse "/*" do
      repeat { parse not("*/") }
      parse "*/"
    end
  end
end

[View source]
macro syntax(syntax_name, *prefixs, end! = nil, &block) #

Defines a syntax that can be called with parse(SymbolLiteral).

syntax_name (SymbolLiteral): The identifier of the syntax.

*prefix (ParseletLiteral): A list of parselet, to parse at the beginning of the syntax, results are yielded in the given block.

block: The body of the syntax, work like a usual body function.

class Parser < TopDown::Parser
  syntax :expression do
    a = parse('a')
    parse(/( )+/)
    foo = parse!("foo")

    {a, foo}
  end
end

[View source]
macro token(token_name, parselet = nil, &block) #

Defines a token to parse. Can be used only inside Parser.tokens.

Each token have a token_name (StringLiteral), and is parsed according to parselet. The block indicate the value of the resulting Token.

  • If no parselet provided, it deduced from token_name.
  • If no block provided, token value is nil. (use &.itself to actually keep the parsed string)

Example:

class Parser < TopDown::Parser
  tokens do
    token("+")                            # Parses '+', produces Token[+]
    token("hey")                          # Parses "hey", produces Token[hey]
    token("new_line", '\n')               # Parses '\n', produces Token[new_line]
    token("int", /\d+/) { |v| v.to_i }    # Parses /\d+/, produces Token[int:<int_value>]
    token("string", :tk_string, &.itself) # Parses syntax :tk_string, produces Token[string:<string_value>]
  end
end

[View source]
macro tokens(&block) #

This macro allows to define how tokens are parsed.

Each line of the block correspond a token.

class Parser < TopDown::Parser
  tokens do
    token("+")
    token("-")
    token("**")
    token("*")
    token("/")
    token("new_line", '\n')
  end
end

Use Parser.token to define how a token is parsed

The order of token matter, because there are parsed in this order. Hence, if * is moved before **, the result will be two token * parsed instead of one **.

In addition, characters are never skipped while parsing tokens (and all inner syntax calls), see no_skip.


[View source]
macro union(&members) #

Tries to parse each member of the union, returns the result of the first that succeed.

union do
  parse('1')
  parse('2')
  sequence do
    parse('a')
    parse('b')
    parse!('c')
  end
  parse('a')
end
"1"   # ~> '1'
"abc" # ~> 'c'
"ab*" # ~> "Unexpected character '*', expected 'c'"
"a"   # ~> 'a'
"*"   # ~> "Could not parse syntax ':main'

members:

Members are delimited by each expression in the Expressions of the given block.

NOTE a block begin/end doesn't group a member since it is inlined by the crystal parser. Use sequence instead.

failure:

If all members of the union fail, the union is considered to fail, and will break the current sequence

infix:

infix members could be added to a union.

infix members are always treated after each normal members, in the order there are defined. A union act as follow:

  1. Tries to parse each normal member.
  2. When one succeed, store the result. left allows to retrieve this result.
  3. Tries to parse infix members whose precedence is greater than current precedence (initial is zero).
  4. Inner of infix is executed, it potentially triggers parsing of this union recursively, but current precedence is sets to the infix precedence.
  5. When one fully succeed, store the result. left is updated.
  6. Repeats step 3-5) until there no more infix.
  7. Returns last result.

This is mainly the top-down operator precedence algorithm, also known as precedence climbing.


[View source]
macro union!(error = nil, at = nil, &members) #

Similar to Parser.union, but raises SyntaxError if the parsing fail.


[View source]

Instance Method Detail

def each_token(eof = nil, &) : Nil #

Yields successively parsed tokens.

Stops when EOF is hit, or raises if a token fail to parse.

The token name eof can be given to stop at that name. Can be useful if a EOF token have been defined.


[View source]
def hook_could_not_parse_any_token(got : Char, expected : Nil) #

Override this method to modify the default error message when any token can be parsed.


[View source]
def hook_expected_any_character_but(got : Char, expected : Char) #

Override this method to modify the default error message when parse!(not('a')) fail.


[View source]
def hook_expected_any_character_but_range(got : Char, expected : Range(Char, Char)) #

Override this method to modify the default error message when parse!(not('a'..'z')) fail.


[View source]
def hook_expected_any_in_range(got : Char, expected : Range(Char, Char)) #

Override this method to modify the default error message when parse!('a'..'z') fail.


[View source]
def hook_expected_any_token_but(got : Token | Nil, expected : String) #

Override this method to modify the default error message when parse!(not(["TOKEN"])) fail.


[View source]
def hook_expected_character(got : Char, expected : Char) #

Override this method to modify the default error message when parse!('a') fail.


[View source]
def hook_expected_pattern(got : Char, expected : Regex) #

Override this method to modify the default error message when parse!(/regex/) fail.


[View source]
def hook_expected_syntax(got : Char, expected : Symbol) #

Override this method to modify the default error message when parse!(:syntax) fail.


[View source]
def hook_expected_token(got : Token | Nil, expected : String) #

Override this method to modify the default error message when parse!(["TOKEN"]) fail.


[View source]
def hook_expected_word(got : Char, expected : String) #

Override this method to modify the default error message when parse!("string") fail.


[View source]
def hook_union_failed(got : Char, expected : Nil) #

[View source]
def location=(location : Location) #

Move the cursor to the new location.

The location should be well formed, otherwise error display won't be right. It is recommended to always use a location got by self.location.

This methods is used to backtrack the parser. However, prefer using Parser.union, Parser.maybe, and Parser.repeat over manual backtracks.


[View source]
def parse #

Parses the source contained in this parser.

Returns the result of the root syntax. Expects eof after parsing root syntax. Raises SyntaxError if fail to parse.


[View source]
def raise_syntax_error(message : String, at location : Location = self.location, source : String = self.source) #

Raises a SyntaxError at current location and current source.

location could be a range between two Location. A endless range mean a location up to self.location.


[View source]
def raise_syntax_error(message : String, at location : Range(Location, Location), source : String = self.source) #

Raises a SyntaxError at current location and current source.

location could be a range between two Location. A endless range mean a location up to self.location.


[View source]
def raise_syntax_error(message : String, at location : Range(Location, Nil), source : String = self.source) #

Raises a SyntaxError at current location and current source.

location could be a range between two Location. A endless range mean a location up to self.location.


[View source]
def source=(source : String) #

Modifies the source and reset the cursor location to zero.


[View source]
def tokens(eof = nil) #

Returns the array of parsed tokens.

Stops when EOF is hit, or raises if a token fail to parse.

The token name eof can be given to stop at that name. Can be useful if a EOF token have been defined.


[View source]