korpusové lingvistiky a počítačové lexikografie
April 3, 2017
Corpus Query Language
Use CQL query type
us Query Language
Use CQL query type
■ Query - pattern matching a set of single tokens or token sequences
Corpus Query Language
Use CQL query type
■ Each token consists of attributes (depending on corpus configuration):
word, lemma, tag, lempos, Ic
■ Use [attribute="value"]tor each token sub-pattern.
Very simple queries
[word="The"] [word="dream"]
[word="the"] [lemma="dream"]
[tag="AJ0"] [lempos="dream-n"]
Regular Expression in Attributes
Value is a regular expression in a [attribute="value"] expression.
[word="[0-9]*"] [lc="dreams"]
[tag="NN."] [lempos="dream-v"]
[word="[0-9]{5,}"] [word="\."]
[word="\("] [word="0[0-9]{3}"] [word="\)"]
[word—" ) " ] [word—" . " ]
[word=" [A-Z] [0-9A-Z] {2,3}"] [word=" [0-9] [0-9A-Z] {2}"]
Regular Expressions
PCRE library used for evaluation of REs Several useful special sequences
■ \d-any decimal digit
■ \d - any character that is not a decimal digit
■ \w - any "word" character
■ \w - any "non-word" character
■ (?i) - ignore case
Logical combinations of attributes
Boolean combinations {AND, Of? and NOT) of [attribute="value"] expressions. Use: &, |, !=, ()
[word="dream" & tag="NNl"] [ lemma="dream" & tag="W."] [word="dream" | word="Dream"]
[word="the" [word="the"
tag="DPS"][lempos="dream-n" & tag="NN2"] (tag="DPS" & lemma!="my")][lemma="dream"]
Regular expressions of tokens
Regular expressions on token level:
? optional token * any number of repetition + at least one {N} exact number of repetitions {M,N} from M to N repetitions [ ] any token
[tag="DPSf!] [] [lemma="dream" ] [tag="DPSf!] [tag="AJOf!] ? [ lemma="dream" ] [tag="AJ0"]{2} [lemma="dream"] [word="the"] []{0,3} [lempos="dream-n"]
within keyword at the end of a query
■ within restricts result to one sentence
■ within restricts result to a subcorpus
[lemma="dream"] within [word="thef!] [] {3,5} [lemma="dream" ] [word="thef!] []{3,5} [lemma="dream"] within
More within combinations: Boolean combinations of regular expressions
[lemma="dream"] within
[lemma="dream"] within
[word="theM] []{3,5} [lemma="dream"]
within within
[word="theM] []{3,5} [lemma="dream"] within
within could be inverted
[word="THE"] within [word="THE"] within !
Structure boundaries
Structure boundaries: start/end of a structure, whole structure
[lemma="dream"] [word=="?M]
Global conditions
Global condition
■ numeric labels of tokens
■ testing agreement or disagreement of attribute values
[tag="NN."] [word="ancT ] [tag="NN."]
Global conditions
Global condition
■ numeric labels of tokens
■ testing agreement or disagreement of attribute values
[tag="NN."] [word="ancT ] [tag="NN."]
1: [tag!="NN.f!] [word="andf!] 2: [tag!="NN.f!]
1:[] [word="andf!] 2:[] & l.k=2.k & 1. c=2 . c
& 1. tag
2 . tag
Parallel corpora
Parallel corpora - separate corpus for each language, 1 -to-1 alignment using tag.
Query can limit the search to segments with aligned parts containing a subquery hits.
[lemma="hrad"] within kacen: [word="castle"] [lemma="hrad"] within ! kacen: [word="castle"]
Meet/Union queries
■ combining and nesting simple (one-token) queries
■ not a sequence of tokens
■ meet and union operators
Union operator: ■ union Q1 Q2
(union [word="dream"] [word="dreams"]) [word="dream" | word="dreams"]
Meet operator:
■ meet Q1 Q2 W-BEG W-END
■ find Q1 with Q2 in window from W-BEG to W-END
■ W-BEG. W-END defaults to 1
(meet [word="myM] [word="dream"])
[word="myM] [word="dream"]
(meet [word="myM] [word="dream"] 1 3)
[word="myM] []{0,2] [word="dream"]
(meet [word="black"] [word="white"] -3 3)
Meet/union combination
use a meet/union operator in place of a simple query
(meet [word="and"] (meet [word="black"] [word="whiteM] -3 3) -2 2)
Within keyword
within works with any subquery not only a structure
[lemma="dream"] within ([word="my"] [lemma="dream"])
(meet [lemma="dream"] [word="myM] -1 -1)
[word="theM] []{0,3} [lemma="dream"]
within ([tag="AT."] [tag="AJ."] {0,4} [tag="NN."])
containing keyword
containing keyword
■ inverts within keyword
■ matches results of the first subquery which contains matches of the second subquery
containing [lemma="dream"]
(meet [ lemma="dream" ] [word="myf!] -1 -1)
[word="thef!] [] {1,3} [lemma="dream" ]
containing [lemma="wild"]
Combinations of containing/within
Both keyword forms a query which can be used as subquery, they can be nested.
[lemma="break"] within ( containing [lemma="rule"])
[lemma="student"] within
( containing [lemma="break"] containing [lemma="rule"])
[lemma="break"] within ([]{5} containing [lemma="rule"])
Sketch Engine API
■ web API using same addresses as web interface
■ add format = json to the URL
■ result as json