S ::= script* script ::= '{' pattern '::' (action ';')* '}' pattern ::= ID [condition] '(' pattern ')' condition ::= '-->.' pattern '.<--' pattern '.-->' pattern '<--.' pattern '->.' pattern '.<-' pattern '.->' pattern '<-.' pattern '>' pattern '>>' pattern '<' pattern '<<' pattern '$++' pattern '$--' pattern '$+' pattern '$-' pattern attr string_cond 'is_top' 'is_leaf' 'can_head' ID 'can_be_headed_by' ID '==' ID '(' condition ')' 'not' condition condition 'and' condition condition 'or' condition string_cond ::= "STRING" 'STRING' /REGEX/i /REGEX/g /REGEX/ig /REGEX/gi action ::= ('copy' | 'move') ('node' | 'group') ID ('before' | 'after') ('node' | 'group') ID 'delete' ('node' | 'group') ID 'set' attr ID STR ('set_head' | 'try_set_head') ID ('headed_by' | 'heads') ID 'group' ID ID attr ::= 'form' 'lemma' 'cpostag' 'postag' 'feats' 'deprel'
Scripts consist of a sequence of patterns, each pattern paired with a list of actions.
# 1. Delete all "cat"s.
{
x form /cat/i
::
delete node x;
}
# 2. Copy all "dog"s to the beginning.
{
x form /dog/i $-- (start not $- w)
::
copy node x before node start;
}
Patterns and actions are separated by ::.
Steps of the script are applied sequentially: first #1 several times, then #2 several times, etc.
On each step, a script is applied to every possible node of the tree once, and not applied to the nodes created by the script itself.
An example:
+---------+
| +--+ | +--+
| v | v | v
ROOT cat and dog
# 1: pattern
x node /cat/i
+---------+
| +--+ | +--+
| v | v | v
ROOT cat and dog
{x}
#1: actions
delete node x
+---------+
| | +--+
| v | v
ROOT and dog
# 1: doesn't match
# 2: pattern
x node /dog/i $-- (start not $- w)
+---------+
| | +--+
| v | v
ROOT and dog
{start}{x}
#2: actions
copy node x before node start
+---------+
| +--+ | +--+
| v | v | v
ROOT dog and dog
(new) (old)
# 2: doesn't match
# - Node "dog" (new) was created by script #2, and scripts are not applied
# to nodes created by themselves.
# - Node "dog" (old) was already matched by script #2.
# Done.
ATTR STR_COND | Attribute matches string condition. Available attributes: form, lemma, cpostag, postag, feats, deprel. |
is_top | Node’s parent is the root |
is_leaf | Node has no children |
can_head ID | Whether the tree stays valid (connected & acyclic) if we attach a given backreference to the node. |
can_be_headed_by ID | If X can_be_headed_by Y matches whenever Y can_head X does. |
== ID | Node matches a backreference |
Backreference matches can only be made in subconditions of the pattern where the reference was set. Like this:
vvvv------ backreference match
a <--. (c .<-- (b not == a))
^ ^^^^^^^^^^^^^^^^^^^----- subcondition of 'a'
+------------------------------- reference setup of 'a'
This is wrong:
vvvv--- BAD backreference match
c .<-- (a) and .<-- (b not == a)
^^------------------------ 'a' has no subconditions
|
+------------------------- reference setup of 'a'
Warning
If the backreference match is not in a subcondition, the system might not raise an error. Be careful.
Node conditions like form or deprel can be used either to match the form (or dependency relation) exactly, or with a regular expression.
n1 form 'cat'
n1 form "dog"
n1 form /dog|cat/
Strings can be enclosed either in single ' or double " quotes.
Regular expressions use extended PCRE syntax.
Regular expressions are matched to the whole string. If you want a substring match, e.g. to match a word with a “ni” inside, write /ni/g.
Regular expressions are case-sensitive. Use /.../i for case-insensitive matching.
Strings support no escaping. E.g. you can’t write a single-quoted string with a single quote inside.
In a similar fashion, regular expressions support no escaping of /: you can’t make a regular expression with / inside.
Conditions on FEATS field work like this:
Feats are printed as a string.
Noun|Pnon|Nom|A3sg
A string condition is applied.
w1 feats /Noun/g
-->. | Has a child to the right |
.<-- | Has a child to the left |
.--> | Has a head to the left |
<--. | Has a head to the right |
->. | Has a child immediately to the right |
.<- | Has a child immediately to the left |
.-> | Has a head immediately to the left |
<-. | Has a head immediately to the right |
> | Node has a child. |
< | Node has a parent. |
>> | Node has a descendant. |
<< | Node has an ancestor. |
$++ | Has a neighbor to the right |
$-- | Has a neighbor to the left |
$+ | Has a neighbor immediately to the right |
$- | Has a neighbor immediately to the left |
(move|copy) (node|group) ID (after|before) (node|group) ID | Move or copy node (or the whole group) to given position |
delete (node|group) ID | Delete a node (or the whole group) |
set ATTR ID STR | Set node’s attribute. Available attributes: form, lemma, cpostag, postag, feats, deprel |
set_head IDa (headed_by|heads) IDb | Set node’s head (IDb becomes the head of IDa if IDa headed_by IDb, otherwise vice versa). Fail if tree becomes cyclic or disconnected |
try_set_head IDa headed_by IDb | Set node’s head. Do not fail if tree becomes cyclic or disconnected |
group IDa IDb | Consider IDa in a group of IDb and vice versa |
There is a special node in the tree, that binds it together: the ROOT node.
+-----------+
| +--+ | +--+
| v | v | v
(ROOT) cat and dog
It is introduced for the tree to always be connected in case the tree syntactically encodes more than one sentence.
+---------------------+
|+----+ |
|| |+----+ +---+|
|| v| v v |v
(ROOT) cat . And dog
\_____/ \______/<--- Sentence 2
^------------------ Sentence 1
The root node is never matched by any pattern.
1 (highest) | not |
2 | and |
3 (lowest) | or |
Also, and and or append conditions to the innermost node, e.g.
a <--. b <--. c and .<-- d
Is equivalent to
a <--. (b <--. c and .<-- d)
\____/ \____/ <----- Condition 2 on "b"
^-------------------- Condition 1 on "b"
NOT to
a <--. (b <--. c) and (.<-- d)
\_____________/ \______/ <-- Condition 2 on "a"
^------------------------- Condition 1 on "a"