Simple queries - regular expressions Using structures - within part
Pavel Rychlý
pary@fi.muni.cz
24. března 2014
Pavel Rychlý IB047
Simple queries - regular expressions Using structures - within part Meet/Union queries
Corpus Query Language
Test it from http: / /ske Use CQL query type
fi.muni.cz/
► < g ► <
■O0.O
Simple queries - regular expressions Using structures - within part Meet/Union queries
Corpus Query Language
Test it from http : //ske . f i .muni . cz/ Use CQL query type ■ Query - pattern matching a set of single tokens or token sequences
Pavel
IB047
Simple queries - regular expressions Using structures - within part Meet/Union queries
Corpus Query Language
Test it from http : //ske . f i .muni . cz/ Use CQL query type
■ Query - pattern matching a set of single tokens or token sequences
■ Each token consists of attributes (depending on corpus configuration):
word, lemma, tag, lempos, Ic
■ Use [attribute="value"]for each token sub-pattern.
Simple queries - regular expressions Using structures - within part
Very simple queries
[word="dream"] [word="Dream"] [lc="dream"] [lemma="dream"] [lempos="dream-n"] [word="The"] [word="dream"] [word="the"] [lemma="dream"] [tag—"AJO"] [lempos="dream-n"]
Pavel Rychly IB047
Simple queries - regular expressions Using structures - within part Meet/Union queries
Regular Expression in Attributes
Value is a regular expression in a [attribute="value"] expression.
[word="dream.*"]
[word="[dD]ream"]
[word="[0-9]*"] [lc="dreams"]
[tag="NN."] [lempos="dream-v"]
[word="[0-9]{5,}"] [word="\."]
[word="\("] [word="0[0-9]{3}"] [word="\)"]
[word—")"] [word—"."]
[word="[A-Z][0-9A-Z]{2,3}"] [word="[0-9][0-9A-Z]{2}
Simple queries - regular expressions Using structures - within part Meet/Union queries
Regular Expressions
PCRE library used for evaluation of REs Several useful special sequences
■ \d - any decimal digit
■ \d - any character that is not a decimal digit
■ \w - any "word" character
■ \w - any "non-word" character
■ (?i) - ignore case
[word="\d\d\W"]
Pavel Rychlý IB047
Simple queries - regular expressions Using structures - within part Meet/Union queries
Logical combinations of attributes
Boolean combinations (AND, OR and NOT) of [attribute="value"] expre ss i o n s. Use: &, ,!=,()
[word="dream" & tag="NNl"] [lemma="dream" & tag="W."] [word="dream" | word="Dream"]
[word="the" | tag="DPS"][lempos="dream-n" & tag="NN2"] [word="the" | (tag="DPS" & lemma!="my")][lemma="dream";
Pavel Rychly IB047
Simple queries - regular expressions Using structures - within part Meet/Union queries
Regular expressions of tokens
Regular expressions on token level: ? optional token * any number of repetition + at least one {N} exact number of repetitions {M,N} from M to N repetitions [ ] any token
[tag="DPS"] [] [lemma="dream"] [tag="DPS"] [tag="AJ0"]? [lemma="dream"; [tag="AJ0"]{2} [lemma="dream"] [word="the"] []{0,3} [lempos="dream-n"]
Pavel Rychly IB047
Simple queries - regular expressions Using structures - within part
Within
within keyword at the end of a query
■ within restricts result to one sentence
■ within restricts result to a subcorpus
[lemma="dream"] within [word="the"] []{3,5} [lemma="dream"] [word="the"] []{3,5} [lemma="dream"] withi
Simple queries - regular expressions Using structures - within part Meet/Union queries
Within
More within combinations: Boolean combinations of regular expressions
[lemma="dream"] within
[lemma="dream"] within
[word="the"] []{3,5} [lemma="dream"]
within within
[word="the"] []{3,5} [lemma="dream"] within
Pavel Rychlý
IB047
Simple queries - regular expressions Using structures - within part Meet/Union queries
Within
within could be inverted
[word="THE"] within >ord="THE"] within !
Simple queries - regular expressions Using structures - within part Meet/Union queries
Structure boundaries
Structure boundaries: start/end of a structure, whole structure
[lemma="dream"] [word=="?"]
within tag.
Query can limit the search to segments with aligned parts containing a subquery hits.
[lemma="hrad"] within kacen: [word="castle"] [lemma="hrad"] within ! kacen: [word="castle"
Simple queries - regular expressions Using structures - within part Meet/Union queries
Meet/Union queries
■ combining and nesting simpl
■ not a sequence of tokens
■ meet and union operators
Simple queries - regular expressions Using structures - within part
Union
Union operator: ■ union Q1 Q2
(union [word="dream"] [word="dreams'r [word="dream" | word="dreams"]
4 □ ► 4 & k 4 = *
Pavel Rychly IB047
Simple queries - regular expressions Using structures - within part
Meet
Meet operator:
■ meet Q1 Q2 W-BEG W-END
■ find Q1 with Q2 in window from W-BEG to W-END
■ W-BEG, W-END defaults to 1
(meet [word="my"] [word="dream"])
[word="my"] [word="dream"]
(meet [word="my"] [word="dream"] 1 3)
[word="my"] []{0,2} [word="dream"]
(meet [word="black"] [word="white"] -3 3)
Simple queries - regular expressions Using structures - within part Meet/Union queries
Meet/union combination
use a meet/union operator in place of a simple query
(meet [word="and"] (meet [word="black"] [word="white"l -3 3) -2 2)
Simple queries - regular expressions Using structures - within part Meet/Union queries
Within keyword
within works with any subquery not only a structure
[lemma="dream"] within ([word="my"] [lemma="dream"]) (meet [lemma="dream"] [word="my"] -1 -1) [word="the"] []{0,3} [lemma="dream"]
within ([tag="AT."] [tag="AJ."] {0,4} [tag="NN."])
Pavel Rychlý
IB047
Simple queries - regular expressions Using structures - within part
containing keyword
containing keyword
■ inverts within keyword
■ matches results of the first subquery which contains matches of the second subquery
containing [lemma="dream"]
(meet [lemma="dream"] [word="my"] -1 -1)
[word="the"] []{1,3} [lemma="dream"] containing [lemma="wild"]
Simple queries - regular expressions Using structures - within part Meet/Union queries
Combinations of containing/within
Both keyword forms a query which can be used as subquery, they can be nested.
[lemma="break"] within ( containing [lemma="rul
[lemma="student"] within
( containing [lemma="break"] containing [lemma="rule"])
[lemma="break"] within ([]{5} containing [lemma="ru