Simple queries - regular expressions Using structures - within part Meet/Union queries
Pavel Rychlý
pary@fi.muni.cz
24. března 2014
Simple queries - regular expressions Using structures - within part Meet/Union queries
Corpus Query Language
Test it from http : //ske . f i .muni
Use CQL query type
Simple queries - regular expressions Using structures - within part Meet/Union queries
Corpus Query Language
Test it from http : //ske . f i .muni . cz/
Use CQL query type
■ Query - pattern matching a set of single tokens or token sequences
Corpus Query Language
Simple queries - regular expressions Using structures - within part Meet/Union queries
Test it from http : //ske . f i .muni . cz/
Use CQL query type
■ Query - pattern matching a set of single tokens or token sequences
■ Each token consists of attributes (depending on corpus configuration):
word, lemma, tag, lempos, Ic
■ Use [attribute="value"]tor each token sub-pattern.
Very simple queries
Simple queries - regular expressions Using structures - within part Meet/Union queries
[word="dream"]
[word="Dream"]
[lc="dream"]
[lemma="dream"]
[lempos="dream-n"]
[word="The"] [word="dream"]
[word="the"] [lemma="dream"]
[tag="AJ0"] [lempos="dream-n"
Pavel Rychlý IB047
ular Expression in Attributes
Simple queries - regular expressions Using structures - within part Meet/Union queries
Value is a regular expression in a [attribute="v
[word="dream.*"]
[word="[dD]ream"]
[word="[0-9]*"] [lc="dreams"]
[tag="NN."] [lempos="dream-v"]
[word="[0-9]{5,}"] [word="\."]
[word="\("] [word="0[0-9]{3}"]
[word—" ) " ] [word—" . " ]
[word="[A-Z][0-9A-Z]{2,3}"] [word="[0
[word="\)"] 9][0-9A-Z]{2}
Simple queries - regular expressions Using structures - within part Meet/Union queries
Regular Expressions
PCRE library used for evaluation of REs Several useful special sequences
■ \d-any decimal digit
■ \d - any character that is not a decimal digit
■ \w - any "word" character
■ \w - any "non-word" character
■ (?i) - ignore case
[word="\d\d\Wf!]
Pavel Rychlý IB047
Simple queries - regular expressions Using structures - within part Meet/Union queries
Logical combinations of attributes
Boolean combinations {AND, Of? and NOT) of [attribute="value"] expressions. Use: &, |, !=, ()
[word="dream" & tag="NNl"] [ lemma="dream" & tag="W."] [word="dream" | word="Dream"]
[word="the" [word="the"
tag="DPS"][lempos="dream-n" & tag="NN2"] (tag="DPS" & lemma!="my")][lemma="dream"]
ular expressions of tokens
Simple queries - regular expressions Using structures - within part Meet/Union queries
Regular expressions on token level:
? optional token * any number of repetition + at least one {N} exact number of repetitions {M,N} from M to N repetitions [ ] any token
[tag="DPSf!] [] [lemma="dream" ] [tag="DPSf!] [tag="AJOf!] ? [ lemma="dream" ] [tag="AJ0"]{2} [lemma="dream"] [word="the"] []{0,3} [lempos="dream-n"]
Pavel Rychlý
IB047
within keyword at the end of a query
■ within restricts result to one sentence
■ within restricts result to a subcorpus
[lemma="dream"] within [word="thef!] [] {3,5} [lemma="dream" ] [word="thef!] []{3,5} [lemma="dream"] within
More within combinations: Boolean combinations of regular expressions
[lemma="dream"] within
[lemma="dream"] within
[word="theM] []{3,5} [lemma="dream"]
within within
[word="theM] []{3,5} [lemma="dream"] within
Simple queries - regular expressions Using structures - within part Meet/Union queries
Within
within could be inverted
[word="THE"] within
[word="THE"] within !
Pavel Rychlý IB047
Simple queries - regular expressions Using structures - within part Meet/Union queries
Structure boundaries
Structure boundaries: start/end of a structure, whole structure
[lemma="dream"]
[word=="?M]
within
Pavel Rychlý IB047
Simple queries - regular expressions Using structures - within part Meet/Union queries
Global conditions
Global condition
■ numeric labels of tokens
■ testing agreement or disagreement of attribute values
[tag="NN."] [word="ancT ] [tag="NN."]
Global conditions
Simple queries - regular expressions Using structures - within part Meet/Union queries
Global condition
■ numeric labels of tokens
■ testing agreement or disagreement of attribute values
[tag="NN."] [word="ancT ] [tag="NN."]
1: [tag!="NN.f!] [word="andf!] 2: [tag!="NN.f!]
1:[] [word="andf!] 2:[] & l.k=2.k & 1. c=2 . c
& 1. tag
2 . tag
Parallel corpora
Simple queries - regular expressions Using structures - within part Meet/Union queries
Parallel corpora - separate corpus for each language, 1 -to-1 alignment using tag.
Query can limit the search to segments with aligned parts containing a subquery hits.
[lemma="hrad"] within kacen: [word="castle"] [lemma="hrad"] within ! kacen: [word="castle"
Simple queries - regular expressions Using structures - within part Meet/Union queries
Meet/Union queries
■ combining and nesting simple (one-token) queries
■ not a sequence of tokens
■ meet and union operators
Simple queries - regular expressions Using structures - within part Meet/Union queries
nion
Union operator: ■ union Q1 Q2
(union [word="dream"] [word="dreams"]) [word="dream" | word="dreams"]
Simple queries - regular expressions Using structures - within part Meet/Union queries
Meet
Meet operator:
■ meet Q1 Q2 W-BEG W-END
■ find Q1 with Q2 in window from W-BEG to W-END
■ W-BEG, W-END defaults to 1
(meet [word="my"] [word="dream"])
[word="myM] [word="dream"]
(meet [word="myM] [word="dream"] 1 3)
[word="myM] []{0,2} [word="dream"]
(meet [word="black"] [word="white"] -3 3)
Simple queries - regular expressions Using structures - within part Meet/Union queries
Meet/union combination
use a meet/union operator in place of a simple query
(meet [word="and"] (meet [word="black"] [word="whiteM] -3 3) -2 2)
< □ ► 4 ► <
Simple queries - regular expressions Using structures - within part Meet/Union queries
Within keyword
within works with any subquery not only a structure
[lemma="dream"] within ([word="my"] [lemma="dream"])
(meet [lemma="dream"] [word="myM] -1 -1)
[word="theM] []{0,3} [lemma="dream"]
within ([tag="AT."] [tag="AJ."] {0,4} [tag="NN."])
Pavel Rychlý IB047
Simple queries - regular expressions Using structures - within part Meet/Union queries
containing keyword
containing keyword
■ inverts within keyword
■ matches results of the first subquery which contains matches of the second subquery
containing [lemma="dream"]
(meet [ lemma="dream" ] [word="myf!] -1 -1)
[word="thef!] [] {1,3} [lemma="dream" ]
containing [lemma="wild"]
Combinations of containina/within
Simple queries - regular expressions Using structures - within part Meet/Union queries
Both keyword forms a query which can be used as subquery, they can be nested.
[lemma="break"] within ( containing [lemma="rule"])
[lemma="student"] within
( containing [lemma="break"] containing [lemma="rule"])
[lemma="break"] within ([]{5} containing [lemma="rule"])