Table of Contents
How to use the LiRI Corpus Platform (LCP)
The LCP lets you query corpora and visualize results using your browser. On the query page, you can select a corpus on the left, and write queries on the right, as illustrated on the image. This page is primarily meant to describe the specific language it uses: DQD
The DQD language
A DQD script is composed of two parts: the query part and the results part. The query part specifies the matches to look for in the corpus, and the results part specifies how to format the output matches.
Queries
Queries consist of blocks, in which statements are structured by indentation. A block typically starts with either a layer name (eg. Segment or Token) or a keyword (sequence
, set
, group
, (¬|NOT|!) EXISTS
, AND
, OR
). In the absence of a coordinator (AND
, OR
) successive blocks form conjunctions. Each corpus may define its own set of layers, so that the list of layer names that one can reference might vary with each corpus. Here we will assume the existence of a layer named Segment that contains units from another layer named Token
The first line of a block will typically end with the name of a variable, which can later be used to refer back to the entity matched by that block. For example, a one-line block Segment s
will match any segment in the corpus, which can then be referred back to as s
One way to use those names is to suffix the layer name of a following block with @<name>
to match layers that are contained in the referenced one. For example, following up the one-line block Segment s
with Token@s t1
will have the effect of matching tokens that are part of the matched segments, and name those t1
. One can specify further constraints on the tokens to match by indenting new lines in that block; for example, inserting a new indented line upos = "VERB"
will make sure that the matched tokens are verbs
To summarise, the simple query below will match any segment and any token in the corpus such that the latter is a verb and is contained in the former. Applied to a realistic corpus, this is likely to return a lot of matches, since the only constraint is for the token to be a verb
Segment s Token@s t1 upos = "VERB"
Note that there is no specific constraint on the segment, which makes the first block superfluous for now. However, we can insert yet another token block referring to s
to match additional tokens that should be part of the same segment: Token@s tx
. In the absence of a reference to any segment, inserting a new token block would look for all possible pairs of tokens, including tokens that belong to different segments, which is an unrealistically large query to perform
Constraints
Simple constraints usually use the format left operator right
. left
refers to an attribute of the entity; for example, it is standard for corpora to define a upos
attribute on tokens to represent their universal part-of-speech. operator
is a standard operator, typically =
or !=
. right
accepts four distinct formats:
- If enclosed with simple (
'
) or double ("
) quotes, it represents a litteral string and the constraint then states that the value of the attribute should be a perfect match with that string (once stripped from the enclosing quote characters). Example:upos = "VERB"
- If enclosed with forward dashes (
/
) it represents a regular expression and the constraint then states that the value of the attribute should match the regular expression. Example:upos = /VERB|AD[JV]/
- If it is an arithmetic expression (ie it is exclusively numeric or it contains
+
,-
, etc.) then the constraint states that the value of the attribute should correspond to the result of evaluating that arithmetic expression. Example:length < 5
- Otherwise,
right
should be an alpha-numeric expression, possibly containing a dot (.
) and it then refers to an entity or an attribute (if the attribute is prefixed by a variable name followed by a dot). The constraint then states that the value of the attributeleft
should correspond to the valueright
. Example:lemma = t1.form
DepRel
For corpora that model universal dependency relations, one can further add constraints using the DepRel
keyword (a reserved name for a layer that models dependency relations). For example, the query below (repeated from the screenshot on this page) will look for all pairs of tokens that belong to the same segment, where one is a form of the verb take and the other is the object of that verb (in this example the dependency-relation layer uses the label "dobj"
to tag direct objects, but that may vary from corpus to corpus)
Segment s Token@s t1 upos = "VERB" lemma = "take" Token@s tx DepRel head = t1 dep = tx label = "dobj"
sequence
One can use the keyword sequence
to look for consecutive entities. The query above, for example, can be modified to look for verbs that are immediately followed by their object. While the query above would have matched both Moisha takes coffee with milk
and Moisha takes not only coffee but also tea with milk
, the query below would only match the former, where the object (coffee) immediately follows the verb (takes) and it would not match the latter, where the objects (coffee and tea) are separated from the verb by intervening words.
Segment s sequence seq Token@s t1 upos = "VERB" lemma = "take" Token@s tx DepRel head = t1 dep = tx label = "dobj"
Note that entity references declared inside a sequence
block are accessible from outside the block. For example, one can additionally look for dependencies of the object of the verb, without writing additional code directly under sequence
and therefore without requiring that they follow the object, by writing a Token
block outside the sequence
block:
Segment s sequence seq Token@s t1 upos = "VERB" lemma = "take" Token@s tx DepRel head = t1 dep = tx label = "dobj" Token@s tdo DepRel head = tx dep = tdo
Assuming that milk is coded as a dependency of coffee in Moisha takes coffee with milk
, this query would match tdo = <milk>
, even though milk does not immediately follow coffee (for the two are separated by the word with) because the last Token
block was declared outside the sequence
block.
set
By default, each block matches one corresponding occurrence at a time. This means that the query below will match each possible pair of a verb with any of its (possibly many) dependencies as a separate result.
Segment s Token@s tv upos = "VERB" Token@s to DepRel head = tv dep = to
Say your corpus contains a segment with this sequence of tokens: Moisha gave you something
. Assuming standard dependency relations, the query will match s = <Moisha gave you something>, tv = <gave>, to = <you>
, but it will also match s = <Moisha gave you something>, tv = <gave>, to = <something>
as a separate result; indeed, the latter triplet is different from the former and, as such, it constitutes a distinct match, even though the segment and the first token are the same in both matches
If you'd rather capture all possible dependencies of the verb as part of the same, single match, you can declare the corresponding Token
block inside a set
block:
Segment s Token@s tv upos = "VERB" set tos Token@s to DepRel head = tv dep = to
This query will now match Moisha gave you something
with s = <Moisha gave you something>, tv = <gave>, tos = [<you>,<something>]
, and that's it!
One way to understand this behavior is that blocks that start with a name of a layer, such as a Token
block, are existentially bound and implicitly conjoined. Having two successive Token
blocks roughly means “any token such that {this} and any token such that {that}”. The keyword set
, however, can be seen as carrying a universal force, so that when a Token
block is embedded inside a set
block, it becomes bound by the universal quantification. The set
block above would then roughly translate as “all the entities that correspond to [any token that depends on the verb]”
Results
There is no formal separation between the query part and the results part of a DQD script, other than the convention of writing the results part after the query part. Results blocks always start with a variable name, a fat arrow (⇒
) and one of the three keywords plain
, analysis
or collocation
.
1. plain
The plain
keyword will give you back matching entities in the context in which they occur. It is formed of two sub-blocks defined by the keywords context
and entities
. entities
should reference variable names of the matching entities you are interested in, and context
should reference a variable name of a matching entity that contains the ones in entities
.
The example below repeats the query part from the DepRel section and adds a simple plain
results part, which asks to show each possible pair of the (possibly inflected) verb take with an object of its, shown in the context of the segment that contains them
Segment s Token@s t1 upos = "VERB" lemma = "take" Token@s tx DepRel head = t1 dep = tx label = "dobj" myKWIC1 => plain context s entities t1 tx
The LCP will display the results in a tab named myKWIC1, as specified by the variable name before the fat arrow. That tab will show one segment per row, in which the instances of take and its object will be highlighted
2. analysis
The analysis
keyword will give you back a statistical transformation of attributes, optionally filtered. It is formed of two (optionally three) sub-blocks defined by the keywords attributes
and functions
(and optionally filter
). attributes
should reference attributes of entities, using the format entity_variable.attribute
, that the statistical transformations will be applied to. functions
should reference one or more function names that apply a statistical transformation: frequency
, minimum
, maximum
, average
or stddev
. Finally, the optional filter
block lets you exclude some lines from the results; for example, specifying frequency > 5
in the filter
block below has the effect of excluding lemmas that appear less than 6 times from the myStat1
table
Example:
myStat1 => analysis attributes tv.lemma to.lemma functions frequency filter frequency > 5
The LCP will display the results in a tab named myStat1, as specified by the variable name before the fat arrow. That tab will show one lemma per row, along with how many times that lemma occurs in the queried corpus (as long as it occurs at least 6 times).
3. collocation
The collocation
keyword will give you back a table listing how often entities appear near the referenced entity/entities. It comes in two different formats:
- One option is to provide a
center
sub-block and awindow
sub-block.center
should reference an entity variable (egt1
) andwindow
should specify how many entities ahead and behind of that reference entity the collocation measure should be performed (eg-2..2
) - Another option is to provide a
space
sub-block which should reference a set variable, in which case the collocation measure will be performed between the first and the last entity in the set.
Example:
myColl1 => collocation center t1 window -2..+2 attribute lemma
The LCP will display the results in a tab named myColl1, as specified by the variable name before the fat arrow. That tab will show one lemma per row, along with how many times that lemma co-occurs within 2 tokens ahead and 2 tokens behind the t1
matches.