abstract class TopDown::Parser
- TopDown::Parser
- TopDown::CharReader
- Reference
- Object
Overview
TopDown::Parser
is the main class to derive for building a parser.
Basis
A minimal parser could be:
class Parser < TopDown::Parser
root :expression
syntax :expression do
end
end
which will parse anything but EOF.
root
indicate the starting point of parsing, the rest is to do inside syntax
.
Syntax are like functions, in which, any code can be added.
Inside it, #parse
& parse!
can be used to define what to parse.
syntax :expression do
a = parse('a')
parse(/( )+/)
foo = parse!("foo")
{a, foo}
end
"a foo" # ~> {'a', "foo"}
"a bar" # ~> Unexpected character 'b', expected 'foo'
"ab" # ~> Unexpected character 'b', expected expression
NOTE
In this documentation, "foo" # ~> result
is used as a shortcut for
Parser.new("foo").parse # => result
In the above, if parse!("foo")
fail to parse, it raises an error.
But parse('a')
don't raises, instead it use a break
(break Fail.new
) to stop the current sequence to be parsed.
This failure is caught by the root :expression
, that why it next raises an exception.
This difference is important because is on what is based TopDown
, using #parse
let a change to the surrounding context to handle the failure, whereas
parse!
permit the raise directly, leading to better errors.
For example, using parse!
above:
syntax :expression do
a = parse!('a')
parse!(/( )+/)
foo = parse!("foo")
{a, foo}
end
"ab" # ~> Unexpected character 'b', expected pattern /( )+/
Gives a more precise error.
Repeat, union, maybe
repeat
, union
, maybe
allow interesting things to be parsed. They take both a block, in which parse failure can occur.
syntax :expression do
repeat do
parse('a')
parse!('b')
end
maybe { parse('c') }
parse!('d')
union do
parse('e')
parse('é')
parse('è')
end
"ok"
end
This is equivalent to parse /(ab)*c?d[eéè]/
, in a bit more verbose. However, doing so allow to insert code between and retrieve
only needed result, storing in array, or to insert some logic.
For example:
syntax :expression do
array = [] of {Char, Char}
repeat do
array << {parse('a'), parse!('b')}
end
have_c = maybe { parse('c') }
parse!('d')
if have_c
e = union do
parse('e')
parse('é')
parse('è')
end
end
{array, e}
end
Here, we store each ab
in an array and accept the union 'e'|'é'|'è'
only if the optional 'c'
have been parsed.
"ababd" # ~> {[{'a', 'b'}, {'a', 'b'}], nil}
"cde" # ~> {[], 'e'}
"dé" # ~> Unexpected character 'é', expected 'EOF'
"abac" # # Unexpected character 'c', expected 'b'
NOTE
When using repeat
, union
, and maybe
, we always use #parse
at first (without exclamation), because they break
on failure,
which is caught by the repeat
/maybe
/union
. They know so they should continue parsing without a failure.
Infix
infix
can be used inside a union. It permits to parse operators with precedence easily.
syntax :expression do
union do
parse(/\d+/).to_i
infix 30 "**" { left() ** parse!(:expression) }
infix 20 '*' { left() * parse!(:expression) }
infix 20 '/' { left() / parse!(:expression) }
infix 10 '+' { left() + parse!(:expression) }
infix 10 '-' { left() - parse!(:expression) }
end
end
Infix are treated specially by the union
. They are parsed after any other member of the union, the left()
result is updated each time.
They are parsed in function of their precedence (first argument), higher precedence mean grouped first.
"3*6+6*4" # ~> 42
Precedence:
current_precedence()
is initially zero, it changes when entering in infix
. This is this value that guide the recursion for parsing operators.
Its value can be changed at specific places, so parsing order can be entirely controlled.
For example, assuming we want to parse a ternary if _ ? _ : _
that have a higher precedence that classical operators (e.g. "1+1?2:3+3"
gets parsed as (1
+ (1?2:3
)) + 3
.
We can do the following:
syntax(:expression) do
union do
parse(/\d+/).to_i
infix 10, :plus
infix 90, :ternary_if
end
end
syntax :plus, '+' do
left() + parse!(:expression)
end
syntax :ternary_if, '?' do
cond = left()
then_arm = parse!(:expression)
parse!(':')
else_arm = parse!(:expression)
cond != 0 ? then_arm : else_arm
end
"1+1?2:3+3" # ~> 6
However the following is not what we except:
"1+1?2+2:3+3" # => Unexpected character '+', expected ':'
This happens because the +
inside the then_arm
have lower precedence, it wants so
let its left with higher precedence finish before parsing itself. So the then_arm
finish
but we except a :
right after, so it fails (got '+', expected ':').
To fix that, we can set the current_precedence()
for the then arm
to 0:
then_arm = parse!(:expression, with_precedence: 0)
"1+1?2+2:3+3" # ~> 8
Defined in:
ast.crhooks.cr
parser/parselets.cr
parser/parser.cr
parser/syntax_error.cr
parser/tokens.cr
Macro Summary
-
ast(ast_class, *args, at = nil, **options)
TODO docs
- begin_location
-
capture(&)
Captures all characters parsed inside the block.
-
current_precedence
Returns the current precedence.
-
end_word
Allows to match a word only if a non-alphanumeric follows
-
forward_fail(result)
Returns the given result, or
break
the current sequence if result is aFail
. -
infix(precedence, parselet, associativity = "right", &block)
TODO docs
- left
-
maybe(parselet = nil, &)
Parses the sequence inside the block, returns
nil
if it fails. -
maybe_union(&block)
Equivalent to `maybe { union { ...
-
no_skip(&)
Prevents the
skip
rules to be triggered inside the given block. -
parse(parselet, with_precedence = nil, &block)
Parses the given parselet (
ParseletLiteral
), and yield/return the according result. -
parse!(parselet, error = nil, at = nil, with_precedence = nil, &block)
Similar to
Parser.parse
, but raisesSyntaxError
if the parsing fail. -
partial_capture(&block)
Captures chosen characters parsed inside the block.
-
peek(parselet, with_precedence = nil, &block)
TODO docs parse parselet if succeed, don't move the cursor and return parselet result else, don't move the cursor and break a failure
-
prefix(precedence, parselet, &block)
Equivalent to
parse(parselet, with_precedence: precedence)
-
repeat(parselet = nil, &)
Repeatedly parses the sequence inside the block until it fails.
-
repeat(parselet = nil, *, separator, &)
Repeatedly parses the sequence inside the block until it fails, with a separator parselet between each iteration.
-
repeat_count(*, separator = nil)
TODO docs
- repeat_n(count, parselet = nil, &block)
-
repeat_to_a(type, *, separator = nil, &)
TODO docs
-
repeat_to_h(type, *, separator = nil, &)
TODO docs
-
repeat_to_s(*, separator = nil, &block)
TODO docs
-
repeat_union(&block)
Equivalent to `repeat { union { ...
-
root(parselet)
Defines the main syntax to parse.
-
sequence(&)
Allows to group a sequence of parsing (inside a union for example).
-
skip(&members)
Defines what is ignored during the parsing.
-
syntax(syntax_name, *prefixs, end! = nil, &block)
Defines a syntax that can be called with
parse(SymbolLiteral)
. -
token(token_name, parselet = nil, &block)
Defines a token to parse.
-
tokens(&block)
This macro allows to define how tokens are parsed.
-
union(&members)
Tries to parse each member of the union, returns the result of the first that succeed.
-
union!(error = nil, at = nil, &members)
Similar to
Parser.union
, but raisesSyntaxError
if the parsing fail.
Instance Method Summary
-
#each_token(eof = nil, &) : Nil
Yields successively parsed tokens.
-
#hook_could_not_parse_any_token(got : Char, expected : Nil)
Override this method to modify the default error message when any token can be parsed.
-
#hook_expected_any_character_but(got : Char, expected : Char)
Override this method to modify the default error message when
parse!(not('a'))
fail. -
#hook_expected_any_character_but_range(got : Char, expected : Range(Char, Char))
Override this method to modify the default error message when
parse!(not('a'..'z'))
fail. -
#hook_expected_any_in_range(got : Char, expected : Range(Char, Char))
Override this method to modify the default error message when
parse!('a'..'z')
fail. -
#hook_expected_any_token_but(got : Token | Nil, expected : String)
Override this method to modify the default error message when
parse!(not(["TOKEN"]))
fail. -
#hook_expected_character(got : Char, expected : Char)
Override this method to modify the default error message when
parse!('a')
fail. -
#hook_expected_pattern(got : Char, expected : Regex)
Override this method to modify the default error message when
parse!(/regex/)
fail. -
#hook_expected_syntax(got : Char, expected : Symbol)
Override this method to modify the default error message when
parse!(:syntax)
fail. -
#hook_expected_token(got : Token | Nil, expected : String)
Override this method to modify the default error message when
parse!(["TOKEN"])
fail. -
#hook_expected_word(got : Char, expected : String)
Override this method to modify the default error message when
parse!("string")
fail. - #hook_union_failed(got : Char, expected : Nil)
-
#location=(location : Location)
Move the cursor to the new location.
-
#parse
Parses the source contained in this parser.
-
#raise_syntax_error(message : String, at location : Location = self.location, source : String = self.source)
Raises a
SyntaxError
at currentlocation
and currentsource
. -
#raise_syntax_error(message : String, at location : Range(Location, Location), source : String = self.source)
Raises a
SyntaxError
at currentlocation
and currentsource
. -
#raise_syntax_error(message : String, at location : Range(Location, Nil), source : String = self.source)
Raises a
SyntaxError
at currentlocation
and currentsource
. -
#source=(source : String)
Modifies the source and reset the cursor
location
to zero. -
#tokens(eof = nil)
Returns the array of parsed tokens.
Instance methods inherited from class TopDown::CharReader
each_char(& : Char -> )
each_char,
location : Location
location,
location=(location : Location)
location=,
next_char : Char
next_char,
peek_char
peek_char,
previous_assci_char? : Char | Nil
previous_assci_char?,
source : String
source,
source=(source : String)
source=
Constructor methods inherited from class TopDown::CharReader
new(source : String)
new
Macro Detail
Returns the Location
at the beginning of the syntax
.
The given location is after skipping characters.
Captures all characters parsed inside the block.
Returns a String
NOTE skip
characters are still captured, use partial_capture
instead.
capture do
repeat(/ +/) do
parse(/\w+/)
end
end
"a bc d e" # ~> "a bc d e"
"Hello World" # ~> "Hello World"
Returns the current precedence.
Initially zero.
Changes inside an infix
or when with_precedence
is used.
Allows to match a word only if a non-alphanumeric follows
parse("foo") { end_word }
Equivalent to: (but faster)
parse(/foo\b/) { nil }
"foo bar" # ~> "foo"
"foobar" # ~> Unexpected character 'b', expected 'EOF'
NOTE end_word returns nil
.
Returns the given result, or break
the current sequence if result is a Fail
.
This macro is the recommended ways to handle Fail
.
result = Array.new(3) do
parse("a")
end
typeof(result) # => Array(String) | TopDown::Parser::Fail
result = forward_fail(result)
typeof(result) # => Array(String)
In the context of an infix
member, returns the left
result of the current union
.
Returns nil
in a context other than inside an infix
.
Each time a union
member is successfully parsed, left
value is updated.
As a union
continue parsing infix
until possible,
left
allows to retrieve the subsequent results.
syntax(:exp) do
union do
parse('a')
infix(1, :comma)
end
end
syntax(:comma) do
l = left() # left returns either
r = parse(:exp)
"#{l}, #{r}"
end
Parses the sequence inside the block, returns nil
if it fails.
x = parse('1').to_i
y =
maybe do
parse('+')
parse!('1').to_i
end
parse!(';')
x + (y || 0)
"1;" # ~> 1
"1+1;" # ~> 2
"1+*;" # ~> "Unexpected character '*', expected '1'"
"1*;" # ~> "Unexpected character '*', expected ';'"
Parses the given parselet (ParseletLiteral
), and yield/return the according result.
failure:
If the given parselet fails to parse, it break
the current sequence with a Fail
. Failure is catch by the surrounding context.
- inside an
union
, tell the union that member have fail. The union tries to parse the next member. - inside a
maybe
, the maybe will returnnil
. - inside a
repeat
, make the repeat to stop. - inside a
syntax
, the syntax is considered to fail, it will in turnbreak
or raises.
block:
A block could be given to let return the value of the block.
precedence:
with_precedence (NumberLiteral
) changes the current_precedence
, the given parselet will be parsed as if the contained syntax have this precedence.
Allow to handle multi-precedence for ternary-or-more operator.
Similar to Parser.parse
, but raises SyntaxError
if the parsing fail.
error: allows to customize error message, it can be:
- a
StringLiteral
: Following percent interpolations are available:%{got}
: The character or token causing the failure.%{expected}
: The expectedChar
,String
,Regex
,Token
or syntax name (Symbol
).
- a
ProcLiteral
taking two arguments: 'got' and 'expected', with types explained above.
at: indicates where the error is raised, it can be a Location
, Range(Location, Location)
or Range(Location, Nil)
.
failure:
Because it raises, this shouldn't be used as first parselet inside a union
, maybe
, repeat
and syntax
.
This would raises before the surrounding context could try an other solution.
However, this should generally be used everywhere else, to allow more localized errors.
Captures chosen characters parsed inside the block.
Yields a String::Builder
in which characters to capture can be added.
Returns a String
.
partial_capture do |io|
repeat(/ +/) do
io << parse(/\w+/)
end
end
"a bc d e" # ~> "abcde"
"Hello World" # ~> "HelloWorld"
TODO docs parse parselet if succeed, don't move the cursor and return parselet result else, don't move the cursor and break a failure
Equivalent to parse(parselet, with_precedence: precedence)
Repeatedly parses the sequence inside the block until it fails.
Returns nil
. Results should be either collected by capture
or stored inside a variable or array.
x = parse('1').to_i
repeat do
parse('+')
x += parse!('1').to_i
end
parse!(';')
x
"1;" # ~> 1
"1+1+1+1+1;" # ~> 5
"1+1+1+*;" # ~> "Unexpected character '*', expected '1'"
"1*;" # ~> "Unexpected character '*', expected ';'"
Repeatedly parses the sequence inside the block until it fails, with a separator parselet between each iteration.
Returns nil
. Results should be either collected by capture
or stored inside a variable or array.
x = 0
parse('(')
repeat(',') do
x += parse('1').to_i
end
parse!(')')
x
"()" # ~> 0
"(1,1,1,1,1)" # ~> 5
"(1,1,)" # ~> "Unexpected character ',', expected ')'"
"(11)" # ~> "Unexpected character '1', expected ')'"
Defines the main syntax to parse.
parselet (ParseletLiteral
): the syntax name or parselet to be the root.
Allows to group a sequence of parsing (inside a union for example).
union do
sequence do
parse("a")
parse!("b")
end
parse("2")
Defines what is ignored during the parsing.
It is triggered every time prior a parselet is parsed.
It works similarly to the members of a Parser.union
, in which each members are consumed while possible.
class Parser < TopDown::Parser
skip do
# Skips spaces:
parse ' '
parse '\n'
parse '\t'
parse '\r'
# Line comments:
parse "//" { repeat { parse not('\n') } }
# Block comments:
parse "/*" do
repeat { parse not("*/") }
parse "*/"
end
end
end
Defines a syntax that can be called with parse(SymbolLiteral)
.
syntax_name (SymbolLiteral
): The identifier of the syntax.
*prefix (ParseletLiteral
): A list of parselet, to parse at the beginning of the syntax, results are yielded in the given block.
block: The body of the syntax, work like a usual body function.
class Parser < TopDown::Parser
syntax :expression do
a = parse('a')
parse(/( )+/)
foo = parse!("foo")
{a, foo}
end
end
Defines a token to parse. Can be used only inside Parser.tokens
.
Each token have a token_name (StringLiteral
), and is parsed according to parselet.
The block indicate the value of the resulting Token
.
- If no parselet provided, it deduced from token_name.
- If no block provided, token value is nil. (use
&.itself
to actually keep the parsed string)
Example:
class Parser < TopDown::Parser
tokens do
token("+") # Parses '+', produces Token[+]
token("hey") # Parses "hey", produces Token[hey]
token("new_line", '\n') # Parses '\n', produces Token[new_line]
token("int", /\d+/) { |v| v.to_i } # Parses /\d+/, produces Token[int:<int_value>]
token("string", :tk_string, &.itself) # Parses syntax :tk_string, produces Token[string:<string_value>]
end
end
This macro allows to define how tokens are parsed.
Each line of the block correspond a token.
class Parser < TopDown::Parser
tokens do
token("+")
token("-")
token("**")
token("*")
token("/")
token("new_line", '\n')
end
end
Use Parser.token
to define how a token is parsed
The order of token matter, because there are parsed in this order.
Hence, if *
is moved before **
, the result will be two token *
parsed instead of one **
.
In addition, characters are never skipped while parsing tokens (and all inner syntax calls), see no_skip
.
Tries to parse each member of the union, returns the result of the first that succeed.
union do
parse('1')
parse('2')
sequence do
parse('a')
parse('b')
parse!('c')
end
parse('a')
end
"1" # ~> '1'
"abc" # ~> 'c'
"ab*" # ~> "Unexpected character '*', expected 'c'"
"a" # ~> 'a'
"*" # ~> "Could not parse syntax ':main'
members:
Members are delimited by each expression in the Expressions
of the given block.
NOTE a block begin
/end
doesn't group a member since it is inlined by the crystal parser. Use sequence
instead.
failure:
If all members of the union fail, the union is considered to fail, and will break
the current sequence
infix:
infix
members could be added to a union.
infix
members are always treated after each normal members, in the order there are defined.
A union act as follow:
- Tries to parse each normal member.
- When one succeed, store the result.
left
allows to retrieve this result. - Tries to parse
infix
members whose precedence is greater than current precedence (initial is zero). - Inner of
infix
is executed, it potentially triggers parsing of this union recursively, but current precedence is sets to theinfix
precedence. - When one fully succeed, store the result.
left
is updated. - Repeats step 3-5) until there no more infix.
- Returns last result.
This is mainly the top-down operator precedence algorithm, also known as precedence climbing.
Similar to Parser.union
, but raises SyntaxError
if the parsing fail.
Instance Method Detail
Yields successively parsed tokens.
Stops when EOF is hit, or raises if a token fail to parse.
The token name eof can be given to stop at that name. Can be useful if a EOF token have been defined.
Override this method to modify the default error message when any token can be parsed.
Override this method to modify the default error message when parse!(not('a'))
fail.
Override this method to modify the default error message when parse!(not('a'..'z'))
fail.
Override this method to modify the default error message when parse!('a'..'z')
fail.
Override this method to modify the default error message when parse!(not(["TOKEN"]))
fail.
Override this method to modify the default error message when parse!('a')
fail.
Override this method to modify the default error message when parse!(/regex/)
fail.
Override this method to modify the default error message when parse!(:syntax)
fail.
Override this method to modify the default error message when parse!(["TOKEN"])
fail.
Override this method to modify the default error message when parse!("string")
fail.
Move the cursor to the new location.
The location should be well formed, otherwise error display won't be right.
It is recommended to always use a location got by self.location
.
This methods is used to backtrack the parser.
However, prefer using Parser.union
, Parser.maybe
, and Parser.repeat
over manual backtracks.
Parses the source contained in this parser.
Returns the result of the root syntax.
Expects eof
after parsing root syntax.
Raises SyntaxError
if fail to parse.
Raises a SyntaxError
at current location
and current source
.
location could be a range between two Location
. A endless range mean a location up to self.location
.
Raises a SyntaxError
at current location
and current source
.
location could be a range between two Location
. A endless range mean a location up to self.location
.
Raises a SyntaxError
at current location
and current source
.
location could be a range between two Location
. A endless range mean a location up to self.location
.
Returns the array of parsed tokens.
Stops when EOF is hit, or raises if a token fail to parse.
The token name eof can be given to stop at that name. Can be useful if a EOF token have been defined.