LiRI Wiki

Linguistic Research Infrastructure - University of Zurich

User Tools

Site Tools


langtech:lcp:tmp:start

How to use the LiRI Corpus Platform (LCP)

The LCP lets you query corpora and visualize results using your browser. On the query page, you can select a corpus on the left, and write queries on the right, as illustrated on the image. This page is primarily meant to describe the specific language it uses: DQD

The DQD language

A DQD script is composed of two parts: the query part and the results part. The query part specifies the matches to look for in the corpus, and the results part specifies how to format the output matches.

Queries

Queries consist of blocks, in which statements are structured by indentation. A block typically starts with either a layer name (eg. Segment or Token) or a keyword (sequence, set, group, (¬|NOT|!) EXISTS, AND, OR). In the absence of a coordinator (AND, OR) successive blocks form conjunctions. Each corpus may define its own set of layers, so that the list of layer names that one can reference might vary with each corpus. Here we will assume the existence of a layer named Segment that contains units from another layer named Token

The first line of a block will typically end with the name of a variable, which can later be used to refer back to the entity matched by that block. For example, a one-line block Segment s will match any segment in the corpus, which can then be referred back to as s

One way to use those names is to suffix the layer name of a following block with @<name> to match layers that are contained in the referenced one. For example, following up the one-line block Segment s with Token@s t1 will have the effect of matching tokens that are part of the matched segments, and name those t1. One can specify further constraints on the tokens to match by indenting new lines in that block; for example, inserting a new indented line upos = "VERB" will make sure that the matched tokens are verbs

To summarise, the simple query below will match any segment and any token in the corpus such that the latter is a verb and is contained in the former. Applied to a realistic corpus, this is likely to return a lot of matches, since the only constraint is for the token to be a verb

Segment s

Token@s t1
      upos = "VERB"

Note that there is no specific constraint on the segment, which makes the first block superfluous for now. However, we can insert yet another token block referring to s to match additional tokens that should be part of the same segment: Token@s tx. In the absence of a reference to any segment, inserting a new token block would look for all possible pairs of tokens, including tokens that belong to different segments, which is an unrealistically large query to perform

Constraints

Simple constraints usually use the format left operator right. left refers to an attribute of the entity; for example, it is standard for corpora to define a upos attribute on tokens to represent their universal part-of-speech. operator is a standard operator, typically = or !=. right accepts four distinct formats:

  1. If enclosed with simple (') or double (") quotes, it represents a litteral string and the constraint then states that the value of the attribute should be a perfect match with that string (once stripped from the enclosing quote characters). Example: upos = "VERB"
  2. If enclosed with forward dashes (/) it represents a regular expression and the constraint then states that the value of the attribute should match the regular expression. Example: upos = /VERB|AD[JV]/
  3. If it is an arithmetic expression (ie it is exclusively numeric or it contains +, -, etc.) then the constraint states that the value of the attribute should correspond to the result of evaluating that arithmetic expression. Example: length < 5
  4. Otherwise, right should be an alpha-numeric expression, possibly containing a dot (.) and it then refers to an entity or an attribute (if the attribute is prefixed by a variable name followed by a dot). The constraint then states that the value of the attribute left should correspond to the value right. Example: lemma = t1.form

DepRel

For corpora that model universal dependency relations, one can further add constraints using the DepRel keyword (a reserved name for a layer that models dependency relations). For example, the query below (repeated from the screenshot on this page) will look for all pairs of tokens that belong to the same segment, where one is a form of the verb take and the other is the object of that verb (in this example the dependency-relation layer uses the label "dobj" to tag direct objects, but that may vary from corpus to corpus)

Segment s

Token@s t1
    upos = "VERB"
    lemma = "take"
      
Token@s tx
    DepRel
        head = t1
        dep = tx
        label = "dobj"

sequence

One can use the keyword sequence to look for consecutive entities. The query above, for example, can be modified to look for verbs that are immediately followed by their object. While the query above would have matched both Moisha takes coffee with milk and Moisha takes not only coffee but also tea with milk, the query below would only match the former, where the object (coffee) immediately follows the verb (takes) and it would not match the latter, where the objects (coffee and tea) are separated from the verb by intervening words.

Segment s

sequence seq
    Token@s t1
        upos = "VERB"
        lemma = "take" 
    Token@s tx
        DepRel
            head = t1
            dep = tx
            label = "dobj"

Note that entity references declared inside a sequence block are accessible from outside the block. For example, one can additionally look for dependencies of the object of the verb, without writing additional code directly under sequence and therefore without requiring that they follow the object, by writing a Token block outside the sequence block:

Segment s

sequence seq
    Token@s t1
        upos = "VERB"
        lemma = "take"
    Token@s tx
        DepRel
            head = t1
            dep = tx
            label = "dobj"

Token@s tdo
    DepRel
        head = tx
        dep = tdo

Assuming that milk is coded as a dependency of coffee in Moisha takes coffee with milk, this query would match tdo = <milk>, even though milk does not immediately follow coffee (for the two are separated by the word with) because the last Token block was declared outside the sequence block.

set

By default, each block matches one corresponding occurrence at a time. This means that the query below will match each possible pair of a verb with any of its (possibly many) dependencies as a separate result.

Segment s

Token@s tv
    upos = "VERB"

Token@s to
    DepRel
        head = tv
        dep = to

Say your corpus contains a segment with this sequence of tokens: Moisha gave you something. Assuming standard dependency relations, the query will match s = <Moisha gave you something>, tv = <gave>, to = <you>, but it will also match s = <Moisha gave you something>, tv = <gave>, to = <something> as a separate result; indeed, the latter triplet is different from the former and, as such, it constitutes a distinct match, even though the segment and the first token are the same in both matches

If you'd rather capture all possible dependencies of the verb as part of the same, single match, you can declare the corresponding Token block inside a set block:

Segment s

Token@s tv
    upos = "VERB"

set tos
    Token@s to
        DepRel
            head = tv
            dep = to

This query will now match Moisha gave you something with s = <Moisha gave you something>, tv = <gave>, tos = [<you>,<something>], and that's it!

One way to understand this behavior is that blocks that start with a name of a layer, such as a Token block, are existentially bound and implicitly conjoined. Having two successive Token blocks roughly means “any token such that {this} and any token such that {that}”. The keyword set, however, can be seen as carrying a universal force, so that when a Token block is embedded inside a set block, it becomes bound by the universal quantification. The set block above would then roughly translate as “all the entities that correspond to [any token that depends on the verb]”

Results

There is no formal separation between the query part and the results part of a DQD script, other than the convention of writing the results part after the query part. Results blocks always start with a variable name, a fat arrow () and one of the three keywords plain, analysis or collocation.

1. plain

The plain keyword will give you back matching entities in the context in which they occur. It is formed of two sub-blocks defined by the keywords context and entities. entities should reference variable names of the matching entities you are interested in, and context should reference a variable name of a matching entity that contains the ones in entities.

The example below repeats the query part from the DepRel section and adds a simple plain results part, which asks to show each possible pair of the (possibly inflected) verb take with an object of its, shown in the context of the segment that contains them

Segment s

Token@s t1
    upos = "VERB"
    lemma = "take"
      
Token@s tx
    DepRel
        head = t1
        dep = tx
        label = "dobj"


myKWIC1 => plain
    context
        s
    entities
        t1
        tx

The LCP will display the results in a tab named myKWIC1, as specified by the variable name before the fat arrow. That tab will show one segment per row, in which the instances of take and its object will be highlighted

2. analysis

The analysis keyword will give you back a statistical transformation of attributes, optionally filtered. It is formed of two (optionally three) sub-blocks defined by the keywords attributes and functions (and optionally filter). attributes should reference attributes of entities, using the format entity_variable.attribute, that the statistical transformations will be applied to. functions should reference one or more function names that apply a statistical transformation: frequency, minimum, maximum, average or stddev. Finally, the optional filter block lets you exclude some lines from the results; for example, specifying frequency > 5 in the filter block below has the effect of excluding lemmas that appear less than 6 times from the myStat1 table

Example:

myStat1 => analysis
    attributes
        tv.lemma
        to.lemma
    functions
        frequency
    filter
        frequency > 5

The LCP will display the results in a tab named myStat1, as specified by the variable name before the fat arrow. That tab will show one lemma per row, along with how many times that lemma occurs in the queried corpus (as long as it occurs at least 6 times).

3. collocation

The collocation keyword will give you back a table listing how often entities appear near the referenced entity/entities. It comes in two different formats:

  1. One option is to provide a center sub-block and a window sub-block. center should reference an entity variable (eg t1) and window should specify how many entities ahead and behind of that reference entity the collocation measure should be performed (eg -2..2)
  2. Another option is to provide a space sub-block which should reference a set variable, in which case the collocation measure will be performed between the first and the last entity in the set.

Example:

myColl1 => collocation
    center
        t1
    window
        -2..+2
    attribute
        lemma

The LCP will display the results in a tab named myColl1, as specified by the variable name before the fat arrow. That tab will show one lemma per row, along with how many times that lemma co-occurs within 2 tokens ahead and 2 tokens behind the t1 matches.

langtech/lcp/tmp/start.txt · Last modified: 2024/03/07 15:47 by Johannes Graën

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki