CQP Language Guide for Researchers
CQP Language Guide for Researchers
May 2019
Contents
1 Introduction 4
1.1 The IMS Open Corpus Workbench (CWB) . . . . . . . . . . . . . . . . . . . . . . . . 4
6.3 Subqueries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
8 Undocumented CQP 50
8.1 Zero-width assertions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
8.4 MU queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
A Appendix 60
A.1 Summary of regular expression syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
1 Introduction
1.1 The IMS Open Corpus Workbench (CWB)
• Tool development
1999 2003: Implementation of YAC chunk parser for German (PhD Kermes)
site no longer online
Technical aspects
assumes Latin-1 encoding, but compatible with other 8-bit ASCII extensions
(Unicode text in UTF-8 encoding can be processed with some caveats)
Corpus data format is platform-independent and compatible with all releases since 2001
• global registry holds information about corpora (name, attributes, data path)
The following steps illustrate the transformation of textual data with some XML markup into the
CWB data format.
2. Text with XML markup (at the level of texts, words or characters)
<text id=42 lang="English"> <s>An easy example.</s><s> Another <i>very</i> easy
example.</s> <s><b>O</b>nly the <b>ea</b>siest ex<b>a</b>mples!</s> </text>
Each (token-level) annotation layer corresponds to a column in the table, called a positional
attribute or p-attribute (note that the original word forms are also treated as an attribute with
the special name word). Annotations are always interpreted as character strings, which are
collected in a separate lexicon for each positional attribute. The CWB data format uses lexicon
IDs for compact storage and fast access.
Matching pairs of XML start and end tags are encoded as token regions, identied by the corpus
positions of the rst token (immediately following the start tag) and the last token (immediately
preceding the end tag) of the region. (Note how the corpus position of an XML tag in Figure 1
is identical to that of the following or preceding token, respecitvely.) Elements of the same name
(e.g. <s>...</s> or <text>...</text>) are collected and referred to as a structural attribute
or s-attribute. The corresponding regions must be non-overlapping and non-recursive. Dierent
s-attributes are completely independent in the CWB: a hierarchical nesting of the XML elements
is neither required nor can it be guaranteed.
Key-value pairs in XML start tags can be stored as an annotation of the corresponding s-attribute
region. All key-value pairs are treated as a single character string, which has to be parsed by a
CQP query that needs access to individual values. In the recommended encoding procedure, an
additional s-attribute (named element _key ) is automatically created for each key and is directly
annotated with the corresponding value (cf. <text_id> and <text_lang> in Figure 1).
is not allowed in a CWB corpus (the embedded <np> region will automatically be dropped).
2 In
the recommended encoding procedure, embedded regions (up to a pre-dened level of embedding)
are automatically renamed by adding digits to the element name:
2
Recall that only the nesting of a <np> region within a larger <np> region constitues recursion in the CWB data model.
The nesting of <pp> within <np> (and vice versa) is unproblematic, since these regions are encoded in two independent
s-attributes (named pp and np).
Pre-encoded versions of these corpora are distributed free of charge together with the IMS Corpus
Workbench. Perl scripts for encoding the British National Corpus (World Edition) can be provided at
request.
See Appendix A.3 for a detailed description of the token-level annotations and structural markup of
the tutorial corpora (positional and structural attributes).
$ cqp -e
in a shell window (the $ indicates a shell prompt)
• when command-line editing is activated, CQP will automatically add a semicolon at the end
of each input line if necessary; explicit ; characters are only necessary to separate multiple
commands on a single line in this mode
• change the registry directory (where CQP will look for available corpora)
• activate corpus for subsequent queries (use TAB key for name completion)
• search single word form (single or double quotes are required: '...' or "...")
> "interesting";
→ shows all occurrences of interesting
> "interest(s|(ed|ing)(ly)?)?";
→ interest, interests, interested, interesting, interestedly, interestingly
3
The -e mode is not enabled by default for reasons of backward compatibility. When command-line editing is active,
multi-line commands are not allowed, even when the input is read from a pipe.
• the regular expression avour used by CQP is Perl Compatible Regular Expressions usually
known as PCRE; lots of documentation and rexamples can be fouind on the WWW
• note that special characters have to be escaped with backslash (\)
NB: this feature is deprecated; it works only for the Latin-1 encoding and cannot be deactivated
version 3.0.3 introduces two-digit hex escapes for inserting arbitrary byte values:
• CWB 3.5: full support for PCRE regular expressions, including two- and four-digit hex escapes
if the string does not contain both single and double quotes, simply pick an appropriate
quote character: "'em" vs. '12"-screen'
otherwise, double every occurrence of the quote character inside the string; our two examples
could also be matched with the queries '''em' and "12""-screen"
• if query results do not t on screen, they will be displayed one page at a time
• press SPC (space bar) to see next page, RET (return) for next line, and q to return to CQP
• some pagers support b or the backspace key to go to the previous page, as well as the use of the
cursor keys, PgUp, and PgDn
• at the command prompt, use cursor keys to edit input (← and →, Del, backspace key) and repeat
previous commands (↑ and ↓)
• show/hide annotations
• create .cqprc le in your home directory with your favourite settings
(contains arbitrary CQP commands that will be read and executed during startup)
• see Appendix A.2 for a list of useful part-of-speech tags and regular expressions
• or nd out with the /codist[] macro (more on macros in Sections 6.4 and 6.5):
• operators: & (and), | (or), ! (not), -> (implication, cf. Section 4.1)
• repetition operators:
? (0 or 1), * (0 or more), + (1 or more), {n} (exactly n), {n,m} (n . . . m)
• grouping with parentheses: (...)
• disjunction operator: | (separates alternatives)
DICKENS>
[pos = "IN"] after
[pos = "DT"]? a
(
[pos = "RB"]? pretty
[pos = "JJ.*"] long
) *
[pos = "N.*"]+ ; pause
GERMAN-LAW>
(
[pos = "APPR"] [pos = "ART"] nach dem
|
[pos = "APPRART"] zum
)
(
[pos = "ADJD|ADV"] ? wirklich
[pos = "ADJA"] ersten
)*
[pos ="NN"]; Mal
> sort;
without an attribute name to restore the natural ordering by corpus position
• query results can also be sorted in random order (to avoid looking only at matches from the rst
part of a corpus when paging through query results):
• select descending order with desc(ending), or sort matches by sux with reverse;
note the ordering when the two options are combined:
> "interesting";
> sort by word %cd on matchend[1] .. matchend[42]; (right context)
> sort by word %cd on match[-1] .. match[-42]; (left context, by words)
> sort by word %cd on match[-42] .. match[-1] reverse; (same by characters)
• see Sections 3.2 and 3.3 for an explanation of the syntax used in these examples and more
information about the sort and count commands
• store query result in memory under specied name (should begin with capital letter)
• result of last query is implicitly named Last; commands such as cat, sort, and count operate on
Last by default; note that Last is always temporary and will be overwritten when a new query
is executed (or a subset command, cf. Section 3.5)
• Due to a long-standing bug in CQP, this feature should not be used with cat or any other com-
mand that generates KWIC output (such as sort). Doing so will corrupt the context descriptor,
which holds information about all available attributes, those selected for printing, and the KWIC
context size.
> GERMAN-LAW;
> show cd;
> cat DICKENS:Time;
> show cd;
• The context descriptor can only be repaired by temporarily activating a dierent corpus and then
re-activating the desired corpus.
> DICKENS;
> GERMAN-LAW;
• write KWIC output to text le (use TAB key for lename completion)
• if the lename ends in .gz or .bz2, the le will automatically be compressed (provided that the
respective command-line utilities gzip and bzip2 are available)
• append to an existing le with >>; this also works for compressed les
• you can also write to a pipe (this example saves only matches that occur in questions, i.e. sentences
ending in ? )
• the result of a (complex) query is a list of token sequences of variable length (⇒ matches )
• only a single token can be marked as target; if multiple @ markers are used (or if the marker
is in the scope of a repetition operator such a +), only the rightmost matching token
4 will be
marked
• anchor points allow a exible specication of sort keys with the general form
both start point and end point are specied as an anchor, plus an optional oset in square
brackets; for instance, match[-1] refers to the token before the start of the match, matchend to
the last token of the match, matchend[1] to the rst token after the match, and target[-2] to
a position two tokens left from the target anchor
NB: the target anchor should only be used in the sort key when it is always dened
1019887 1019888 -1 -1
1924977 1924979 1924978 -1
1986623 1986624 -1 -1
2086708 2086710 2086709 -1
2087618 2087619 -1 -1
2122565 2122566 -1 -1
note that a previous sort or count command aects the ordering of the rows (so that the n-th
row corresponds to the n-th line in a KWIC display obtained with cat)
• the output of a dump command can be written (>) or appended (>>) to a le, if the rst character
of the lename is |, the ouput is sent to the pipe consisiting of the following command(s); use
the following trick to display the distribution of match lengths in the query result A:
• see Section 7.2 for an opposite to the dump command, which may be useful for certain tasks such
as locating a specic corpus position
• you can write the output of the group command to a text le (or pipe)
• named queries can be copied, especially before destructive modication (see below)
> B = A;
> C = Last;
• compute subset of named query result by constraint on one of the anchor points
> A = "time";
> size A;
it is often desirable to look at a random selection to get a quick overview (rather than just seeing
matches from the rst part of the corpus); one possibility is to do a sort randomize and then
go through the rst few pages of random matches:
• as an alternative to randomized ordering, the reduce command randomly selects a given number
or proportion of matches, deleting all other matches from the named query; since this operation
is destructive, it may be necessary to make a copy of the original query results rst (see above)
• set random number generator seed before reduce for reproducible selection
• a second method for obtaining a random subset of a named query result is to sort the matches
in random order and then take the rst n matches from the sorted query; the example below has
the same eect as reduce A to 100; (though it will not select exactly the same matches)
reproducible subsets can be obtained with a suitable randomize command before the sort; the
main dierence from the reduce command is that cut cannot be used to select a percentage of
matches (i.e., you have to determine the number of matches in the desired subset yourself )
• the most important advantage of the second method is that it can produce stable and incremental
random samples
• for a stable random ordering, specify a positive seed value directly in the sort command:
• additional keyword anchor can be set after query execution by searching for a token that matches
a given search pattern (see Figure 3)
• search starts from the given anchor point (excluding the anchored token itself ), or from the left
and right boundaries of the match if match is specied
• with inclusive, search includes the anchored token, or the entire match, respectively
• the match and matchend anchors can also be set, modifying the actual matches
6
6
The keyword and target anchors are set to undened (-1) when no match is found for the search pattern, while the
match and matchend anchors retain their previous values. In this way, a set match or set matchend command may only
modify some of the matches in a named query result.
• label references are usually evaluated within the global constraint introduced by ::
> adj:[pos = "ADJ."] :: adj < 500;
→ adjectives among the rst 500 tokens
• labels are not part of the query result and must be used within the query expression (otherwise,
CQP will abort with an error message)
• to avoid error messages, test whether label is dened before accessing attributes
• labels are used to specify additional constraints that are beyond the scope of ordinary regular
expressions
• however, a label cannot be used within the pattern it refers to; use the special this label repre-
sented by a single underscore (_) instead to refer to the current corpus position
• the standard anchor points (match, matchend, and target) are also available as labels (with the
same names)
• XML tags match start/end of s-attribute region (shown as XML tags in Figure 1)
• pairs of start/end tags enclose single region (if StrictRegions option is enabled)
• the name of a structural attribute (e.g. np) used within a pattern evaluates to true i the corre-
sponding token is contained in a region of this attribute (here, a <np> region)
• new in CQP v3.4.13: Built-in functions lbound_of() and rbound_of() return the corpus posi-
tions of the start/end of a region. Because of technical limitations, the anchor position has to be
specied explicitly as a second argument, which will often be the this label:
• The lbound_of() and rbound_of() functions are mainly used in connection with distance()
or distabs(). For example, to nd occurrences of the word end within the rst 40 tokens of a
chapter:
• most linguistic queries should include the restriction within s to avoid crossing sentence bound-
aries; note that only a single within clause may be specied
• XML markup of NPs and PPs in the DICKENS corpus (cf. Appendix A.3)
<s len=9>
<np h="it" len=1> It </np>
is
<np h="story" len=6> the story
<pp h="of" len=4> of
<np h="man" len=3> an old man </np>
</pp>
</np>
.
</s>
• key-value pairs within XML start tags are accessible in CQP as additional s-attributes with
annotated values (marked [A] in the show cd; listing): s_len, np_h, np_len, pp_h, pp_len (cf.
Section 1.2)
• <np> and <pp> tags are usually shown without XML attribute values;
they can be displayed explicitly as <np_h>, <np_len>, . . . tags:
• use this label for direct access to s-attribute values within pattern
• regions representing the attributes in XML start tags are renamed as well:
⇒ <np_h1>, <np_h2>, ..., <pp_len1>, <pp_len2>, ...
• observe how results depend on matching strategy (see Section 6.1 for details)
• when the query expression shown above is embedded in a longer query, the matching strategy
usually has no inuence
• annotations of a region at an arbitrary embedding level can only be accessed through constraints
on key-value pairs in the start tags:
• CWB can encode information about sentence-level alignment between parallel corpora in its
index. For each pair of source and target corpus, only a single alignment may be dened; the
name of the corresponding alignment attribute ( a-attribute) is a lowercase version of the CWB
name of the target corpus.
> EUROPARL-EN;
is aligned to the French (EUROPARL-FR) and German (EUROPARL-DE) components inter alia.
• The available alignment attributes are listed as Aligned Corpora: in the output of show cd;.
• One or more alignments can be displayed in the KWIC output produced by cat. For example,
in order to nd out how the idiom take the biscuit can be expressed in Franch and German, we
activate the corresponding a-attributes:
• It is recommended to set the KWIC context for the source language to a full sentence
> "price-tag";
is translated 1:1 into French, but combined with the previous sentence in the German translation
(resulting in a 2:1 bead).
• An unocial feature allows setting the KWIC context to an a-attribute, ensuring that a complete
alignment bead (for the selected target corpus) is displayed.
• You will nd that some sentences have no translation into the target language, e.g.
• In order to exclude matches outside alignment beads (i.e. without a translation), you can add a
trivial alignment constraint to the query (cf. Sec. 5.2). The example below shows that out of 49
occurrences of cats, only 46 have a translation into French:
• This section explains how alignment information can be used as a lter in CQP queries. As a
rst example, let us consider the word nuclear power, which can be translated into German as
Kernkraft, Kernenergie, Atomkraft or Atomenergie.
• The alignment constraint is always specied after the main query (including the within clause),
but before a cut statement (which applies to the ltered query results). Multiple alignment
constraints can be chained and must all be satised.
• Alignment constraints can be negated by placing ! immediately after the marker. In this case,
only those matches are kept for which the alignment constraint is not satised.
• By chaining negated constraints, we can identify cases where the French translation is also dif-
ferent from the expected nucléaire :
• A named query result can be translated to an aligned corpus, which allows more exible display
of the aligned regions, access to metadata, etc. (new in CQP v3.4.9).
> EUROPARL-DE;
> set Context 1 s;
> Zeit = [lemma = "Zeit"];
• The NQR Zeit now contains all occurrences of the German word for time in the German part of
EuroParl. The following command translates the NQR to the English part of EuroParl, i.e. it
replaces each match by the complete aligned region in the target corpus (as would be displayed
with show +europarl-en;.
> Time = from Zeit to EUROPARL-EN;
• This creates a new NQR EUROPARL-EN:Time containing the aligned regions. You can now e.g.
tabulate or count metadata:
while it looks similar to a corpus query or set operation, the assignment to a new NQR is
mandatory (otherwise the parser won't accept the syntax)
note that the new NQR must be specied as a short name; the name of the target corpus is
implied and added automatically with the assignment
matching ranges that are not aligned to the target corpus are silently discarded; you cannot
expect the new NQR to contain the same number of hits as the original NQR
if there are multiple matches in the same alignment bead, they will not be collapsed in the
target corpus; i.e. the new NQR will contain several identical ranges
in order to collate source matches with the aligned regions, make sure to discard unaligned
hits from the original NQR rst:
> Zeit;
> ZeitAligned = <match> [] :EUROPARL-EN [] !;
• Do not cat the translated query directly (cat EUROPARL-EN:Time;) without rst activating the
target corpus, as this would corrupt the context descriptor (cf. Sec. 3.1). The correct procedure
is
> EUROPARL-EN;
> cat Time;
You can now customize the KWIC display as desired.
• It is safe to apply dump, tabulate, group, count and similar operations. Only commands that
auto-print the NQR (including a bare sort or a set operation) will trigger the bug.
• The problem is mentioned in this section because users are most likely to be tempted to do this
when working with a set of aligned corpora.
> EUROPARL-DE;
> Other = from EUROPARL-EN:Other to EUROPARL-DE;
• We can now run a subquery on the aligned regions in the German part of EuroParl in order search
for possible translations other than Kern- and Atom-. One possibility is that nuclear power plant
has been translated into the acronym AKW (for Atomkraftwerk ).
> Other;
> [lemma = "AKW"];
• Further translation candidates can be found by computing a frequency breakdown of all nouns
in the aligned sentences:
• in shortest mode, ?, * and + operators match smallest number of tokens possible (refers to
regular expressions at token level)
⇒ nds shortest sequence matching query,
⇒ optional elements at the start or end of the query will never be included
• in the default standard mode, CQP uses an early match strategy: optional elements at the
start of the query are included, while those at the end are not
• the somewhat inconsistent matching strategy of earlier CQP versions is currently still available
in the traditional mode, and can sometimes be useful (e.g. to extract cooccurrences between
multiple adjectives in a noun phrase and the head noun)
• new in CQP v3.4.12: The matching strategy can be set temporarily with an embedded modier
at the start of a CQP query, e.g.
search pattern:
DET? ADJ* NN (PREP DET? ADJ* NN)*
input:
the old book on the table in the room
6.3 Subqueries
• activate named query instead of system corpus (here: sentences containing interest )
• the matches of the named query First now dene a virtual structural attribute on the corpus
DICKENS with the special name match
• all following queries are evaluated with an implicit within match clause
(an additional explicit within clause may be specied as well)
DICKENS:First[624]> DICKENS;
DICKENS>
• XML tag notation can also be used for the temporary match regions
• iftarget/keyword anchors are set in the activated query result, the corresponding XML tags
(<target>, <keyword>, . . . ) can be used, too
• appending the keep operator ! to a subquery returns full matches from the activated query result
(equivalent to an implicit expand to match)
• complex queries (or parts of queries) can be stored as macros and re-used
MACRO np(0)
(
[pos = "DT"]
([pos = "RB.*"]? [pos = "JJ.*"])*
[pos = "NNS?"]
)
;
NB: The start (MACRO ...) and end (;) markers must be on separate lines in a macro denition
le.
• macros accept up to 10 arguments; in the macro denition, the number of arguments must be
specied in parentheses after the macro name
• in the macro body, each occurrence of $0, $1, . . . is replaced by the corresponding argument value
(escapes such as \$1 will not be recognised)
• e.g. a simple PP macro with 2 arguments: the initial preposition and the number of adjectives
in the embedded noun phrase
MACRO pp(2)
[(pos = "IN") & (word="$0")]
[pos = "DT"]
[pos = "JJ.*"]{$1}
[pos = "NNS?"]
;
• invoking macros with arguments
• the quotes are not part of the argument value and hence will not be interpolated into the macro
body; nested macro invocations will have to specify additional quotes
• argument names are not used during macro denition and evaluation
MACRO adjp()
[pos = "RB.*"]?
[pos = "JJ.*"]
;
MACRO np($0=N_Adj)
[pos = "DT"]
( /adjp[] ){$0}
[pos = "NNS?"]
;
• Macro denition les can import other macro denition les using statements of the form
IMPORT other_macros.txt
Each import statement must be written on a separate line. It is recommended (but not required)
to collect all IMPORTs at the top of the le. Note that les are searched relative to the CQP
working directory, not the location of the current macro le.
• note that string arguments need to be quoted when they are passed to nested macros (since
quotes from the original invocation are stripped before interpolating an argument)
• single or double quote characters in macro arguments should be avoided whenever possible; while
the string 's can be enclosed in double quotes ("'s") in the macro invocation, the macro body
may interpolate the value between single quotes, leading to a parse error
• in macro denitions, use double quotes which are less likely to occur in argument values
MACRO np_start()
(<np>|<np1>|<np2>)
;
MACRO np_end()
(</np2>|</np1>|</np>)
;
MACRO np()
( /np_start[] []* /np_end[] )
;
• then use /np_start[] and /np_end[] instead of <np> and </np> tags in CQP queries, as well as
/np[] /region[np]
instead of
MACRO anyregion($0=Tag)
(<$0>|<$01>|<$02>)
[]*
(</$02>|</$01>|</$0>)
;
• extend /codist[] macro to two constraints:
• feature set attributes use special notation, separating set members by | characters
• nominal agreement features of determiners, adjectives and nound are stored in the agr attribute,
using the pattern shown in Figure 7 (see Figure 8 for an example)
der |Dat:F:Sg:Def|Gen:F:Pl:Def|Gen:F:Sg:Def
|Gen:M:Pl:Def|Gen:N:Pl:Def|Nom:M:Sg:Def|
Stoffe |Akk:M:Pl:Def|Dat:M:Sg:Def|Gen:M:Pl:Def|Nom:M:Pl:Def
|Akk:M:Pl:Ind|Dat:M:Sg:Ind|Gen:M:Pl:Ind|Nom:M:Pl:Ind
|Akk:M:Pl:Nil|Dat:M:Sg:Nil|Gen:M:Pl:Nil|Nom:M:Pl:Nil|
• both contains and matches use regular expressions and accept the %c and %d ags
• in the GERMAN-LAW corpus, NPs and other phrases are annotated with partially disambiguated
agreement information; these features sets can also be tested with the contains and matches
operators, either indirectly through label references or directly in XML start tags
• however, the /unify[] macro cannot be used unless the features within each set are in canonical
sorted order. The members of a set are sorted at indexing-time only when a feature set is explicitly
declared.
• feature sets should not be used to encode ordered lists of values; if you need to distinguish between
a rst, second, . . . alternative, you might add this information explicitly as a feature component,
e.g.
|1:Zeuge|2:Zeug|3:Zeugen|
• CQP is a useful tool for interactive work, but many tasks become tedious when they have to
be carried out by hand; macros can be used as templates, providing some relief; however, full
scripting is still desirable (and in some cases essential)
• similarly, the output of CQP requires post-processing at times: better formatting of KWIC lines
(especially for HTML output), dierent sort options for frequency tables, frequency counts on
normalised word forms (or other transformations of the values)
• for both purposes, an external scripting tool or programming language is required, which has to
interact dynamically with CQP (which acts as a query engine)
• CQP provides some support for such interfaces: when invoked with the -c ag, it switches to
child mode (which could also be called slave mode):
-::-EOL-::-
as a marker into CQP's output
when the ProgressBar option is activated, progress messages are not echoed in a single
screen line (using carriage returns) on stderr, but rather printed in separate lines on stdout;
these lines have the standardized format
• the CWB/Perl interface makes use of all these features to provide an ecient and robust interface
between a Perl script and the CQP backend
• the output of many CQP commands is neatly formatted for human readers; this pretty printing
feature can be switched o with the command
show named; lists all named query results on separate lines in the format
ags TAB query name TAB no. of matches
show; concatenates the output of show corpora; and show named; without any separator;
it is recommended to invoke the two commands separately when using CQP as a backend
show active; prints the name of the currently active corpus on a line on its own (this is
in fact available when using CQP interactively, albeit useless because the active corpus is
displayed in the CQP command prompt!)
show cd; lists all attributes that are dened for the currently active corpus; each attribute
is printed on a separate line with the format
• running CQP as a backend can be a security risk, e.g. when queries submitted to a Web server
are passed through to the CQP process unaltered; this may allow malicious users to execute
arbitrary shell commands on the Web server; as a safeguard against such attacks, CQP provides
a query lock mode, which allows only queries to be executed, while all other commands (including
cat, sort, group, etc.) are blocked
> unlock n;
using the same number n
• An important aspect of interfacing CQP with other software is to exchange the corpus positions of
query matches (as well as target and keyword anchors). This is a prerequisite for the extraction
of further information about the matches by direct corpus access, and it is the most ecient
way of relating query matches to external data structures (e.g. in a SQL database or spreadsheet
application).
• The dump command (Section 3.3) prints the required information in a tabular ASCII format that
can easily be parsed by other tools or read into a SQL database.
8 Each row of the resulting table
corresponds to one match of the query, and the four columns give the corpus positions of the
match, matchend, target and keyword anchors, respectively. The example below is reproduced
from Section 3.3
8
Since this command dumps the matches of a named query in their current sort order, the natural order should rst
be restored by calling sort without a by clause. One exception is a CGI interface that uses the dumped corpus positions
for a KWIC display of the query results in their sorted order.
1019887 1019888 -1 -1
1924977 1924979 1924978 -1
1986623 1986624 -1 -1
2086708 2086710 2086709 -1
2087618 2087619 -1 -1
2122565 2122566 -1 -1
Undened target anchors are represented by -1 in the third column. Even though no keywords
were set for the query, the fourth column is included in the dump table, but all values are set to
-1.
• The table created by the dump command is printed on stdout by default, where it can be captured
by a program running CQP as a backend (e.g. the CWB/Perl interface, cf. Sec. 7.1). The dump
table can also be redirected to a le:
• Alternatively, the output can also be redirected to a pipe, e.g. to create a dump le without the
superuous keyword column
• The format of the lemydump.tbl is almost identical to the output of dump, but it contains only
two columns for the match and matchend positions (in the default setting). The example below
shows a valid dump le for the DICKENS corpus, which can be read with undump to create a query
result containing 5 matches:
20681 20687
379735 379741
1915978 1915983
2591586 2591591
2591593 2591598
Save these lines to a text le named dickens.tbl, then enter the following commands:
> DICKENS;
> undump Twas < "dickens.tbl";
> cat Twas;
• Further columns for the target and keyword anchors (in this order) can optionally be added. In
this case, you must append the modier with target or with target keyword to the undump
command:
• Dump les can also be read from a pipe or from standard input. In the latter case the table of
corpus positions has to be preceded by a header line that species the total number of matches:
5
20681 20687
379735 379741
1915978 1915983
2591586 2591591
2591593 2591598
CQP uses this information to pre-allocate internal storage for the query result, as well as to
validate the le format. This format can also be used as a more ecient alternative if the dump
is read from a regular le. CQP automatically detects which of the two formats is used.
• Pipes can e.g. be used to read a dump table generated by another program. They are indicated
by a pipe symbol (|) at the start of the lename (new in CQP v3.4.11) or at the end of the
lename (earlier versions). Before CQP v3.4.11, pipes were also needed to read a dump table
from a compressed le:
second format is allowed, i.e. you have to enter the total number of matches rst. Try entering
the example table above after typing
> undump B;
• If the rows of the undump table are not sorted in their natural order (i.e. by corpus position),
they have to be re-ordered internally so that CQP can work with them. However, the original
sort order is recorded automatically and will be used by the cat and dump commands (until it
is reset by a new sort command). If you sort a query result A, save it with dump to a text le,
and then read this le back in as named query B, then A and B will be sorted in exactly the same
order.
• In many cases, overlapping or unsorted matches are not intentional but rather errors in an
automatically generated dump table. In order to catch such errors, the additional keyword
ascending (or asc) can be specied before the < character:
• A typical use case for dump and undump is to link CQP queries to corpus metadata stored in
an external SQL database. Assume that a corpus consists of a large collection of transcribed
dialogues, which are marked as <dialogue> regions. A rich amount of metadata (about the
speakers, setting, topic, etc.) is available in a SQL database. The database entries can be
linked directly to the <dialogue> regions by recording their start and end corpus positions in the
databae.
10 The following commands generate a dump table with the required information, which
can easily be loaded into the database (ignoring the third and fourth columns of the table):
9
For this reason, CWB/Perl and smilar interfaces cannot use the direct input option and have to create a temporary
le with the dump information.
10
Of course, it is also possible to establish an indirect link through document IDs, which are annotated as <dialogue
id=XXXX> .. </dialogue>. If the corpus contains a very large number of dialogues, the direct link approach is usually
much more ecient, though.
• For many applications it is important to compute frequency tables for the matching strings,
tokens in the immediate context, attribute values at dierent anchor points, dierent attributes
for the same anchor, or various combinations thereof.
• frequency tables for the matching strings, optionally normalised to lowercase and extended or
reduced by an oset, can easily be computed with the count command (cf. Sections 2.9 and 3.3);
when pretty-printing is deactivated (cf. Section 7.1), its output has the form
the instances (tokens) for a given string type can easily be identied, since the underlying
query result is automatically sorted by the count command, so that these instances appear
as a block starting at match number rst line
• an alternative solution is the group command (cf. Section 3.4), which computes frequency dis-
tributions over single tokens (i.e. attribute values at a given anchor position) or pairs of tokens
(recall the counter-intuitive command syntax for this case); when pretty-printing is deactivated,
its output has the form
• the advantages of these two commands are for the most part complementary (e.g., it is not possible
to normalise the values of s-attributes, or to compute joint frequencies of two non-adjacent multi-
token strings); in addition, they have some common weaknesses, such as relatively slow execution,
no options for ltering and pooling data, and limitations on the types of frequency distributions
that can be computed (only simple joint frequencies, no nested groupings)
• new in CQP v3.4.9: The group command has been re-implemented with a hash-based algorithm.
It is very fast now, even for large frequency tables. The other limitations still apply, though.
• therefore, it is often necessary (and usually more ecient) to generate frequency tables with
external programs such as dedicated software for statistical computing or a relational database;
these tools need a data table as input, which lists the relevant feature values (at specied anchor
positions) and/or multi-token strings for each match in the query result; such tables can often
be created from the output of cat (using suitable PrintOptions, Context and show settings)
• this procedure involves a considerable amount of re-formatting (e.g. with Unix command-line
tools or Perl scripts) and can easily break when there are unusual attribute values in the data;
both cat output and the re-formatting operations are expensive, making this solution inecient
when there is a large number of matches
• in most situations, the tabulate command provides a more convenient, more robust and faster
solution; the general form is
• just as with dump and cat, the table can be restricted to a contiguous range of matches, and the
output can be redirected to a le or pipe
• if an anchor point is undened or falls outside the corpus (because of an oset), tabulate prints
an empty string or the corpus position -1 (correct behaviour implemented in v3.4.10)
• a range between to anchor points prints the values of the selected attribute for all tokens in the
specied range; usually, this only makes sense for positional attributes; the following example
prints the lemma values of 5 tokens to the left and right of each match, which can be used to
identify collocates of the matching string(s)
• any items in the range that fall outside the bounds of the corpus are printed as empty strings or
corpus positions -1; if either the start or end of the range is an undened anchor, a single empty
string or cpos -1 is printed for the entire range (correct behaviour implemented in v3.4.10)
• the end position of a range must not be smaller than its start position, so take care to order items
properly and specify sensible osets; in particular, a range specication such as match .. target
must not be used if the target anchor might be to the left of the match; the behaviour of CQP
in such cases is unspecied
• attribute values can be normalised with the ags %c (to lowercase) and %d (remove diacritics);
the command below uses Unix shell commands to compute the same frequency distribution as
count A by word %c; in a much more ecient manner
> tabulate A match .. matchend word %c > "| sort | uniq -c | sort -nr";
• note that in contrast to sort and count, a range is considered empty when the end point lies
before the start point and will always be printed as an empty string
8 Undocumented CQP
8.1 Zero-width assertions
• constraints involving labels have to be tested either in the global constraint or in one of the token
patterns; this means that macros cannot easily specify constraints on the labels they dene: such
a macro would have to be interpolated in two separate places (in the sequence of token patterns
as well as in the global constraint)
• zero-width assertions allow constraints to be tested during query evaluation, i.e. at a specic
point in the sequence of token patterns; an assertion uses the same Boolean expression syntax as
a pattern, but is delimited by [: ... :] rather than simple square brackets ([...]); unlike an
ordinary pattern, an assertion does not consume a token when it is matched; it can be thought
of as a part of the global constraint that is tested in between two tokens
• with the help of assertions, NPs with agreement checks can be encapsulated in a macro
• when the this label (_) is used in an assertion, it refers to the corpus position of the following
token; the same holds for direct references to attributes
• in this way, assertions can be used as look-ahead constraints, e.g. to match maximal sequences
of tokens without activating longest match strategy
• returning to the np_agr macro from Section 8.1, we note a problem with this query:
when the second NP does not contain any adjectives but the rst does, the b label will still point
to an adjective in the rst NP; consequently, the agreement check may fail even if both NPs are
really valid
• in order to solve this problem, the two NPs should use dierent labels; for his purpose, every
macro has an implicit $$ argument, which is set to a unique value for each interpolation of the
macro; in this way, we can construct unique labels for each NP:
The CQP query language oers a number of built-in functions that can be applied to attribute values
within query constraints (but not anywhere else, e.g. in group or tabulate commands). While some
of these functions have been available for a long time and are documented in this tutorial, others have
been added more recently and may be unsupported or experimental. The list below shows all built-in
functions that are currently available. Those marked as experimental are not guaranteed to function
correctly and may be changed or disabled in future releases of CQP.
f(att ): frequency of the current value of p-attribute att (cannot be used with s-attributes or literal
values); e.g. [word = ".*able" & f(word) < 10]
dist(a, b ), distance(a, b ): signed distance between two tokens referenced by labels a and b ; explicit
numeric corpus positions may be specied instead of labels; computes the dierence b − a; e.g.
... :: dist(matchend, match) >= 10;
distabs(a, b ): unsigned distance between two tokens; e.g. [dist(_, 1000) <= 10] as an inecient
way to match 10 tokens to the left and right of corpus position 1000
int(str ): cast str to a signed integer number so numeric comparisons can be made; raises an error if
str is not a number string; e.g. ... :: int(match.text_year) <= 1900;
lbound(att ), rbound(att ): evaluates to true if current corpus position is the rst or last token in a
region of s-attribute att, respectively
lbound_of(att, a ), rbound_of(att, a ): returns the corpus position of the start or end of the region
of s-attribute att containing the token referenced by label a, suitable for use with dist();11 if
a is not within a region of att, an undened value is returned, which evaluates to false in most
contexts [new in v3.4.13; experimental]
unify(fs1 , fs2 ): compute the intersection of two sorted feature sets specied as strings fs1 and fs2 ,
corresponding to a unication of feature bundles; if the rst argument is an undened value, fs2
is returned; see Sec. 6.6 for details
ambiguity(fs ): compute the size of a feature set specied as string fs, i.e. the number of elements; if
fs is an undened value, a size of 0 is returned (same as for |); see Sec. 6.6 for details
add(x, y ), sub(x, y ), mul(x, y ): simple arithmetic on integer values x and y, which can also be
corpus positions specied as labels; in order to make computations on corpus annotations, they
have to be typecast with int() rst [experimental]
prex(str1 , str2 ): returns longest common prex of strings str1 and str2 ; warning: this function
operates on bytes and may return an incomplete UTF-8 character [ experimental]
is_prex(str1 , str2 ): returns true if string str1 is a prex of str2 ; e.g. [is_prefix(lemma, word)]
[experimental]
minus(str1 , str2 ): removes the longest common prex of str1 and str2 from the string str1 and returns
the remaining sux; warning: this function operates on bytes and may return an incomplete
UTF-8 character [ experimental]
ignore(a ): ignore the label a and always return true; for internal use by the /undef[] macro, see
Sec. 8.2 for details
normalize(str, ags ): apply case-folding and/or diacritic folding to the string str and return the
"c", "d" or "cd" (with an optional %, e.g. "%cd");
normalized value; ags must be a literal string
e.g. [normalize(word, "cd") != normalize(lemma, "cd")] to nd non-trivial dierences be-
tween word form and lemma; [new in v3.4.11; experimental]
11
The second argument is necessitated by technical limitations of built-in functions. To locate the start of a sentence
containing the current token, use the this label: lbound_of(s, _).
8.4 MU queries
• CQP oers search-engine like Boolean queries in a special meet-union (MU) notation. This
feature goes back to the original developer of CWB and is not supported ocially. In particular,
there is no precise specication of the semantics of MU queries and the original implementation
does not produce consistent results.
• new in v3.4.12: Recently, MU queries have found more widespread use as proximity queries in the
CEQL simple query syntax of BNCweb and CQPweb, giving them a semi-ocial status. For this
reason, the implementation was modied to ensure a consistent and well-dened behaviour, al-
though it may not always correspond to what is desired intuitively. The new MU implementation
is documented here.
• Warning: both the syntax and the semantics of MU queries are subject to fundamental revisions
in the next major release of CWB (version 4.0).
• A meet-union query consists of nested meet and union operations forming a binary-branching
tree that is written in LISP-like prex notation. MU queries always start with the keyword MU
and are completely separate from standard CQP syntax.
• The simplest form of a MU query species a single token pattern, which may also be given
in shorthand notation for the default p-attribute. These queries are fully equivalent to the
corresponding standard queries.
• In order to match only prenominal adjectives, we change the window to three tokens preceding
the noun (i.e. osets -3 . . . -1):
• Alternatively, we can search for co-occurrence within sentences or other s-attribute regions.
Again, the ordering of the token constrains determines whether we focus on tea or cakes.
• Keep in mind that the nal result includes only the corpus positions of the leftmost token pattern.
If you want to nd instances of course in this multiword expression, rewrite the query as
> MU(meet [pos="NNS?"] (union [pos="JJS"] (meet [pos="JJ"] "most" -1 -1)) -2 4);
• Like standard queries, MU queries can be used as subquery lters (followed by !) or combined
with a cut and/or expand clause. However, other elements of standard queries are not sup-
ported: labels, target markers (@), zero-width assertions (obviously), global constraints (after
::), alignment constraints and within clauses.
• new in v3.4.12: tabular (TAB) queries are an obscure undocumented feature of CQP. They
were dysfunctional for a long time, but have now been resurrected. The implementation is still
considered experimental and might be changed or retired without notice.
• A tab query starts with the keyword TAB and matches a sequence of one or more token patterns
with optional exible gaps. In its simplest form, it corresponds to a standard query matching a
xed sequence of tokens, but is often executed faster. Compare
• The most substantial performance gains are achieved for sequences that start with very frequent
items and end in a selective token pattern, e.g.
• The main purpose of tabular queries is to match sequences with exible gaps. The following
two-word TAB query nds cats followed by dogs with a gap of up to two intervening tokens:
All gap specications behave as if the repetition operator had been applied to a matchall ([]) in
a standard query.
• TAB queries can additionally be restricted by a within clause. For example, the query
12
Namely, cats followed by an arbitrary token, followed by a gap of up to two tokens, followed by dogs. Entering this
command will print an error message because matchall patterns are not allowed in TAB queries.
• TAB queries always return the full range of tokens containing the specied items. Individual items
cannot be marked in any way (i.e. neither as target pattern nor with labels), due to limitations
of the current CQP implementation.
• For more complex TAB queries, it is important to understand how the greedy matching algorithm
works, since its results may be dierent from the corresponding standard CQP queries.
for every possible start position, i.e. each match of the rst token pattern
scan for a match of the second token pattern within the specied range
starting from this position, scan for a match of the third token pattern
etc.
If a complete match is found, CQP continues with the next possible start position, so there can
be at most one match for each start position. In addition, nested matches are discarded as in
standard CQP queries (hence old train above is actually matched by the algorithm, but then
discarded as a nested match).
• Always keep in mind that CQP does not perform an expensive combinatorial search to consider
other matches that might also fall within the specied ranges!
• There are two special cases in which TAB queries are guaranteed to nd every early match that
satises the gap specications:
1. All gaps have a xed size ({n }), which can be dierent for each gap. This includes in
particular the case where token patterns are directly adjacent.
> TAB "Mr" {1} "Mrs" [pos = "N.*"]; > TAB "in" "due" "course";
2. All gaps are specied as * and the search range is only restriced by a within clause. Note
that* and xed-size gaps (even direct adjacency) must not be mixed in this case.
• new in v3.4.16: This section describes an experimental feature introduced with CQP v3.4.16,
which is still work in progress and may be modied and extended throughout this minor release.
Users and wrapper scripts relying on the documentation here should always make sure to upgrade
both their CWB installation and the CQP tutorial to the latest SVN version.
• In addition to the implicit match and matchend anchors, CQP queries allow a single additional
token pattern to be marked with an @ sign, setting the target anchor to the matching corpus
position. If multiple target markers are specied, the one encountered last during query evaluation
wins.
• Users would quite often like to mark multiple positions, however. Consider the query below,
which has three additional tokens of interest (adverb, rst adjective, second adjective); only one
of them can be marked with @.
> [pos="DT"] [pos="RB"] [pos="JJ"] [pos="JJ"] [pos="N.*"];
• It is now possible to mark up to 10 potential targets with numbered markers @0 . . . @9. Only
two of them are active at a given time, controlled by the user options AnchorNumberTarget (ant)
and AnchorNumberKeyword (ank).
> [pos="DT"] @0[pos="RB"] @1[pos="JJ"] @2[pos="JJ"] [pos="N.*"];
• The query above will mark the adverb position as target keyword,
and the rst adjective as
because @0 and @1 are active by default. Re-run the query after changing AnchorNumberTarget
in order to mark the second adjective instead of the adverb.
• The main purpose of the new feature is to enable wrapper scripts to simulate up to 10 target
anchor positions in a way that is fully compatible with CQP macros and does not require any
custom extensions, so queries can be tested in an interactive CQP session or in CQPweb.
• Since the wrapper cannot know which numbered target markers are used in a query (especially
with nested macros), every query has to be run 5 times, collecting the target and keyword
positions from each run and combining them into a single table at the end. The extra runs can
be executed as anchored subqueries to reduce the overhead for complex search patterns in large
corpora.
13
Cautious programmers might want to verify that the matching ranges of each dump Temp; correspond to those of the
main query Result before discarding the rst two columns of the dump. Alternatively, tabulate Temp target, keyword;
can be used to avoid redundant information.
• NB: The wrapper script should not forget to re-activate the main corpus and to reset
AnchorNumberTarget and AnchorNumberTarget to their previous or default settings.
• For backward compatibility, the plain @ marker unconditionally sets the target anchor, regardless
of the value of AnchorNumberTarget. Queries should never mix @ with the numbered potential
target markers.
• All target markers can be followed by an optional colon : similar to the notation used for labels
(including @: for the unconditional target).
• CQP macros that will be embedded in more complex queries should always use parameterized
target markers. Consider the following denition of a macro matching simple noun phrases:
• As an example, let us search for the pattern NP Prep NP and extract the head noun and adjective
(if present) of the rst NP as well as the preposition and head noun of the PP (e.g. this festive
season of the year).
• Note that pairs of markers can be used to mark the start and end of a sub-pattern of exible
length. It is often convenient to enclose the sub-pattern in parentheses (if necessary) and use
zero-width assertions to set the markers. In order to extract multi-word noun compounds, we
could change our NP pattern as follows:
• starting with version 3.0 of the Corpus Workbench, CQP comes with a built-in regular expression
optimiser ; this optimiser detects simple regular expressions commonly used for prex, sux or
inx searches such as
> "under.+";
> ".+ment";
> ".+time.+";
and replaces the normal regexp evaluation with a highly ecient Boyer-Moore search algorithm
• the optimiser will also recognise some slightly more complex regular expressions; if you want to
test whether a given expression can be optimised or not, switch on debugging output with
• the ocial LTS releases v3.0 and v3.5 of CQP have no hidden features
A Appendix
A.1 Summary of regular expression syntax
At the character level, CQP supports regular expressions using one of two regex libraries:
CWB 3.0: Uses POSIX 1003.2 regular expressions (as provided by the system libraries). A full
description of the regular expression syntax can be found on the regex(7) manpage.
CWB 3.5: Uses PCRE (Perl Compatible Regular Expressions ). A full description of the regular
expression syntax can be found on the pcrepattern(3) manpage; see also http://www.pcre.org/.
Various books such as Mastering Regular Expressions give a gentle introduction to writing regular
expressions and provide a lot of additional information. There are also many tutorials to be found
online using Your Favourite Web Search Engine
TM .
• A regular expression is a concise descriptions of a set of character strings (which are called words
in formal language theory). Note that only certain sets of words with a relatively simple structure
can be represented in such a way. Regular expressions are said to match the words they describe.
The following examples use the notation:
letters and digits are matched literally (including all non-ASCII characters)
word → word ; C3PO → C3PO ; déjà → déjà
• Backslash (\) escapes special characters, i.e. forces them to match literally
chapter chapters
chapter_num number of the chapter
chapter_title optional title of the chapter
p paragraphs
p_len length of the paragraph (in words)
s sentences
s_len length of the sentence (in words)
np noun phrases
np_h head lemma of the noun phrase
np_len length of the noun phrase (in words)
pp prepositional phrases
pp_h functional head of the PP (preposition)
pp_len length of the PP (in words)
Each agreement feature has the form ccc :g :nn :ddd with
<s> sentences
<pp> prepositional phrases
<np> noun phrases
<ap> adjectival phrases
<advp> adverbial phrases
<vc> verbal complexes
<cl> subclauses
<s len="..">
<pp f=".." h=".." agr=".." len="..">
<np f=".." h=".." agr=".." len="..">
<ap f=".." h=".." agr=".." len="..">
<advp f=".." len="..">
<vc f=".." len="..">
<cl f=".." h=".." vlem=".." len="..">
len = length of region (in tokens)
f = properties (feature set, see next page)
h = lexical head of phrase (<pp_h>: prep :noun )
agr = nominal agreement features (feature set, partially disambiguated)
vlem = lemma of main verb
• Properties of syntactic structures (f key in start tags)
Reserved words cannot be used as identiers (i.e. corpus handles, attribute names, query names or
labels) in CQP queries and interactive commands.
• new in CQP v3.4.13: Reserved words can now be quoted between backticks.
> left: [pos = "NN"] "after" right: [pos = "NN"] :: left.lemma = right.lemma;
• The usual rules for identiers still apply, so e.g. size `007`; will not be accepted.
a: asc ascending
b: by
c: cat cd collocate contains cut
d: def define delete desc descending diff difference discard dump
e: exclusive exit expand
f: farthest foreach
g: group
h: host
i: inclusive info inter intersect intersection
j: join
k: keyword
l: left leftmost
m: macro maximal match matchend matches meet MU
n: nearest no not NULL
o: off on
r: randomize reduce RE reverse right rightmost
s: save set show size sleep sort source subset
t: TAB tabulate target target[0-9] to
u: undump union unlock user
w: where with within without
y: yes
This appendix lists all the CQP options that can be changed using the set command during a CQP
session. There are many more congurable settings, but they cannot be set by the user during a session.
Instead, they must be set when CQP is invoked (see cqp -h for more).
Boolean (true/false) options are set as on or off, or alternatively as yes or no. Their present value is
always displayed as yes or no.
Abbr. Option Summary
AutoSave Automatically save subcorpora/query results to disk
as AutoShow Automatically display query results
sub AutoSubquery Automatically enter subquery mode by activating
new subcorpus/query result on creation
col Colour Enable colour highlighting
es ExternalSort Use external helper program to sort queries
h Highlighting Highlight hits (and target/keyword anchors) within
KWIC output
o Optimize Enable experimental optimisations
p Paging Use external pager program to display KWIC
pp PrettyPrint Format output neatly for human readers
pb ProgressBar Show the progress of query execution
SaveOnExit Save all unsaved subcorpora/query results when CQP exits
sta ShowTagAttributes Display key-value pairs in XML tags (in KWIC)
st ShowTargets Print identier numbers for target (0) and keyword (1) in KWIC;
same eect as show +targets
sr StrictRegions Make XML start/end tags within query match a single region
Timing Print time taken to execute queries
wh WriteHistory Write all commands entered to a history le
Integer options are set to a numeric value (a whole number). The valid range is usually restricted and
will be checked when setting the option. Keep in mind that numeric values must not be enclosed in
quotation marks.
String options contain a line of data that has some eect on or role in CQP's operation. When setting
a string option, it must be enclosed in quote marks - which will not form part of the actual option
value.
Enumerated options - a sub-type of string - can only be set to one of a xed list of values. In the case
of PrintOptions, each item on the list sets one of a set of Boolean output formatting options either
on or o, and some items are synonyms.
Context options - a sub-type of string - set the width of the left or right context in the KWIC display
(or both) in units of characters, words, or s-attribute regions.