= Methods Documentation = On this page, you can find instructions how to manipulate with Sketch Engine API (i.e. it is addressed mainly to programmers). The page contains description of available Sketch Engine methods and attributes that can be used with them as well as various ways how to use the interface. Note: On this page, the reference version of the Sketch Engine is '''beta'''. However, the list of possible methods and attributes differs minimally from the stable version. == General notes == The communication with the Sketch Engine works through the use of CGI queries and looks like this: 1. create an authenticated connection to the Sketch Engine 1. create a query that you want to work with 1. process the output that is sent as a response In the Examples section in the bottom of this page, there are some simple examples demonstrating how to connect the server, send a prepared query and process the output (in Java and Python). Construction of custom queries and more detailed output parsing is described in the following sections. == Creating a query == A Sketch Engine query has the following structure: {{{ /? }}} where * '''''' is the path to the main CGI script, "run.cgi", e.g. {{{ http://beta.sketchengine.co.uk/auth/corpora/run.cgi }}} for beta. * '''''' is the particular method name, e.g. "wsketch" for word sketches. For list and documentation of all methods see the following section. * '''''' is the list of attributes and values in the CGI notation, that is {{{ attribute_1=value_1;attribute_2=value_2; ... ;attribute_n=value_n }}}. For more detailed description of all attributes see the following section. An example of Sketch Engine query can look like this: {{{ http://beta.sketchengine.co.uk/auth/corpora/run.cgi/wsketch?corpname=preloaded/bnc;lemma=test;lpos=-n }}} It is a query that returns word sketch HTML page for corpus "preloaded/bnc", and lemma test as a noun ("lpos=-n") == Methods and specific attributes == In this section, all methods are listed and attributes specific to each method are described. The "universal attributes" (that can be used with all methods) are described in the next section. Note that some characters (e.g. space) that can be contained in the attribute values must be escaped. For more information, see e.g. http://en.wikipedia.org/wiki/Percent-encoding === wordlist === This method provides a functionality of "word list" and "keywords" funcions that are normally available under the link "Word List" in the web interface. Attributes: * '''keywords''' - if empty, "wordlist" is returned, else "keywords" function is used * '''wlattr''' - corpus attribute that you want to work with. This attribute is '''__required__'''. * '''wlminfreq''' - minimum frequency in corpus (default 5) * '''wlmaxitems''' - maximum number of displayed lines (default 100) * '''wlpat''' - regular expression that specifies the word list pattern (default '.*' - all words). Relevant only in combination with "wordlist" function. * '''wlicase''' - "ignore case" mark. Values '1', '0' (default). Relevant only in combination with "wordlist" function. * '''wlsort''' - if 'f', resulting wordlist is sorted according to frequency. Else alphabetically according to attribute (default). Relevant only in combination with "wordlist" function. * '''ref_corpname''' - corpus name (in the short form, e.g. 'bnc') of the reference corpus. Relevant only in combination with "keywords" function. In this case, it is '''__required__'''. * '''ref_usesubcorp''' - reference subcorpus name. Relevant only in combination with "keywords" function. Note: '''wordlist_form''' method (that returns the wordlist input form) is associated. === wsketch === This method returns the word sketch tables. Attributes: * '''lemma''' - lemma. This attribute is '''__required__'''. * '''lpos''' - part of speech in notation '-n', '-v', ... (but the particular notation depends on corpus). If the corpus contains the "lempos" attribute, it is '''__required__''', else it has no effect. * '''sort_gramrels''' - "sort grammatical relation" mark. Values '0', '1' (default).. * '''minfreq''' - minimum frequency in corpus. Default is 'auto' that is a function of corpus size. Other possible values are natural numbers. * '''minscore''' - minimum salience. Default 0.0. * '''maxitems''' - maximum number of items in a grammatical relation. Default 25. * '''clustercolls''' - "cluster collocations" mark. Values '0', '1' (default) * '''minsim''' - minimum similarity between cluster items. Default 0.15. Relevant only when "clustercolls" is set to '1' Note: '''wsketch_form''' method (that returns the word sketch input form) is associated. === thes === This method returns the thesaurus list. Attributes: * '''lemma''' - lemma. This attribute is '''__required__'''. * '''lpos''' - the same attribute as at "wsketch" function, i.e. '''__required__''' in some cases. * '''maxthesitems''' - maximum number of items. Default 60. * '''clusteritems''' - "cluster items" mark. Values '1', '0' (default) * '''minsim''' - minimum similarity between cluster items. Default 0.15. Relevant only when "clusteritems" is set to '1' Note: '''thes_form''' method (that returns the thesaurus input form) is associated. === wsdiff === This method provides "Sketch-Dif" tables. Attributes: * '''lemma''' - first lemma. This attribute is '''__required__'''. * '''lemma2''' - second lemma. This attribute is '''__required__'''. * '''lpos''' - part of speech in notation '-n', '-v', ... (but the particular notation depends on corpus). If the corpus contains the "lempos" attribute, it is '''__required__''', else it has no effect. * '''sort_gramrels''' - "sort grammatical relation" mark. Values '0', '1' (default) * '''separate_blocks''' - "separate blocks" mark. '1' (default) = "common/exclusive blocks", '0' = "all in 1 block" * '''minfreq''' - minimum frequency in corpus. Default is 'auto' that is a function of corpus size. Other possible values are natural numbers. * '''maxcommon''' - maximum number of items in a grammatical relation of the common block (default 25) * '''maxexclusive''' - maximum number of items in a grammatical relation of the exclusive block Note: '''wsdiff_form''' method (that returns the Sketch-Diff input form) is associated. === view === This method provides access to concordance lines and all possibilities of sorting, sample selecting and filtering of them. The basic attribute is the '''q''' attribute that contains a list of search queries, that are processed incrementally. A list of queries can be transferred through the CGI interface as 'q=item1;q=item2...'; other possibility is to use the JSON interchange format, see the following sections. The first query specifies the basic search query, the next ones specify sorting and filtering options. The construction of a query is not trivial and therefore we will describe it here more precisely. The content of the '''q''' attribute is a string of the following structure: {{{ }}} where specifies the query type and it is one char from the set {'q', 'a', 'r', 's', 'n', 'p', 'w'} ('q', 'a' and 'w' queries can be used as the basic search query, the others behave as filters). The rest of the query depends on the query mark, as follows. Basic search queries: * '''q''' - is followed by a common CQL query with all its possibilities. Examples: {{{ q[lemma="drug"] q[lemma="drug"][lemma="test"] within q[lemma="drug"][lemma="test"]within }}} * '''a''' - the same like '''q''' but it is possible to specify the default attribute. Syntax and example: {{{ a, --------------------------------- alemma,"drug" [tag="N.*"] }}} * '''w''' - query from Word Sketch. This is used in links from word sketch tables to concordances. The 'w' character is followed by a number ID that specifies lines that match a particular word sketch relation. The ID can be pulled from the field 'seek' in the Word Sketch JSON output (see the next sections). More comma-delimited IDs can be specified; in this case, the result is union. Example: {{{ w4816743 w,4816743,4816826 }}} Sorting and filtering options: * '''r''' - selecting random sample from the concordance. The 'r' character is followed by a natural number or percentage that specifies the size (number of lines) of the sample. Examples: {{{ r250 r20% }}} * '''s''' - sorting the concordance. Syntax: {{{ s/ s// s/// or s* }}} The first three patterns stand for sorting options available under the "Sort" menu in the web interface. As can be seen from the patterns 2 and 3, also the multilevel sorting options are available. The last pattern indicates sorting according to GDEX (good dictionary examples) selection; '''''' stands for a natural number with meaning "number of lines to be sorted".[[BR]] Legend to the first three patterns: * '''''' is the particular corpus attribute used. It can also be a structure attribute, e.g. 'doc.id' for sorting according to the document IDs. * '''''' can be 'i', 'r', 'ir' or empty (!'') which means "ignore case", "reverse order", both of them or none of them * '''''' is the space character (' ') * '''''' is either a position or a range. [[BR]] * Positions can be referenced as follows: * '''integer number''' - where 0 is the first token in KWIC, -1 the rightmost token in the left context etc. * '''1:x''' - where x is one of the corpus structures (e.g. "doc" or "s" if the corpus has the particular markup). Its meaning is the first token in the structure, except when it is right boundary of a range - then it is the last token in the structure. Also other numbers can be used, e.g. -2:x, 3:x, etc. (-1 is the same as 1 with meaning "structure containing KWIC") * '''a<0''' - where 'a' stands for a position reference as described in the first two points with meaning "'a' positions before/after the first KWIC position" (so this is equivalent to 'a') * '''a>0''' - where 'a' stands for the same position reference with meaning "a positions before/after the last KWIC position" * in the previous two points, if '0' is substituted with a natural number 'k', it means "before/after 'k'-th collocation" instead of "before/after KWIC". Collocations are special token groups in the context, that can be added using positive filters (see below)[[BR]] Ranges can be referenced as '''a~b''' where 'a', 'b' stand for token identifiers as above. Examples for positions and ranges: * '''-1<0''' - rightmost token in the left context * '''3>0''' - third token in right context * '''0>0''' - last token in KWIC * '''0<0''' - first token in kwic * '''0<0~0>0''' - range of KWIC * '''-1<0~1>0''' - range of KWIC with one token from the left context and one from the right context * '''1:s''' - first token in the sentence containing KWIC (or its first token) * '''1:s>0''' - first token in the sentence containing KWIC (or its last token) * '''0<1''' - first token of the first-added collocation Examples: {{{ s*100 sword/ 1>0~3>0 sword/ 1>0~3>0 slemma/ 0<0~0>0 sword/i -1 sword/ 0 word/ir -1<0 tag/r -2<0 }}} * '''n''' - negative filter. Syntax: {{{ n }}} where: * '''''' stands for position reference as explained in the "'''s'''" section * '''''' is the space character * '''''' stands for "selected token". Values '-1' = last, '1' = first * '''''' stands for a query that - if found between the two specified positions - filters out the particular line of the concordance Examples: {{{ n-5 -1 -1 [lemma="drug"] n-5 -1 -1 [lc="drug" & tag="J.*"] }}} * '''p''' - positive filter; similar to the negative filter above. Syntax and example: {{{ p ----------------------- p-1 -1 -1 [word="drug"] }}} Other attributes of the "view" method: * '''pagesize''' - size (number of lines) of the resulting concordance. Default 20 * '''fromp''' - number of the page that is returned. Default 1 * '''kwicleftctx''' - size of the left context in KWIC view. Can be expressed as: * '''''' - number of tokens * '''#''' - number of characters (note that the '#' character must be escaped in URLs), e.g. '40#' (default value) * ''':''' - structural context, e.g. '-1:s' stands for left context of the whole sentence. In the left context, should be negative * '''kwicrightctx''' - size of the right context, similar. should be positive in the case of structural notation * '''viewmode''' - "KWIC" / "sentence" view mode. Values: 'kwic' (default), 'sen' * '''attrs''' - comma-delimited list of attributes that are returned for KWIC tokens. Examples of values: 'word' (default), 'word,lemma', 'lemma,tag,word' etc. * '''ctxattrs''' - comma-delimited list of attributes that are returned for context tokens. Examples of values: 'word' (default), 'word,lemma', 'lemma,tag,word' etc. * '''structs''' - comma-delimited list of structure tags that are returned/applied. Default: 'p,g' * '''refs''' - comma-delimited list of items returned in the "references" field. Default is '#' that stands for token number or value of option SHORTREF defined in the corpus configuration file. Other possible values are: {{{ = }}} where is an attribute of one of the corpus structures, e.g. {{{ doc.id }}} , {{{ s.n }}} ... The first notation displays the information in {{{ name=value }}} format, the second one returns only the value. Note: '''first, reduce, filter, viewattrsx, mlsortx, sortx''' are methods that return the same output as the "view" method using attributes and values taken from forms provided by methods '''first_form, reduce_form, viewattrs, sort'''. For example, the '''first''' method can take attribute '''lemma''' and does not need attribute '''q'''. However, they are here mainly for more comfortable work with graphical interface and are not universal. For this reason we will not describe them here. === freqs === This method provides access to the frequency statistics. Attributes: * '''q''' - query list, the same as for the "view" method. This attribute is '''__required__'''. * '''fcrit''' - object of frequency query, i.e. "frequency of what are you looking for?" (This attribute is '''__required__'''.) Syntax of values of this attribute is very similar to the sorting queries associated with the "view" method: {{{ / // /// }}} with all being the same as at sorting options except that can be only 'i' or empty (!'') Examples of possible values with explanation: * '''tag 0~0>0''' - frequency of tags of all KWIC tokens * '''tag 0''' - frequency of tags of first KWIC tokens * '''word/ 0 lemma/i -1<0''' - (multilevel) frequency of first word in KWIC and last lemma in the left context (with ignored case on)[[BR]] '''fcrit''' can be also a list, if so, the output contains more blocks. * '''flimit''' - frequency limit. Default 0 * '''freq_sort''' - identifier of column according to which should be the output sorted (its number counted from 0) or 'freq' (default), that means sorting according to frequency, or 'rel' that means sorting according to the "Rel[%]" column (if displayed) * '''ml''' - specifies if the "Rel[%]" column will be displayed. '0' (default) stands for yes, '1' for no. ("ml" stands for "Multi-Level style") Note: '''freqml, freqtt''' are methods that return the same output as the "freqs" method using attributes and values taken from forms provided by method '''freq'''. The situation is similar as by the "view" method, therefore we only mention these methods and will not describe them in detail. === collx === This method provides collocation candidates computation. Attributes: * '''q''' - query list, the same as for the "view" method. This attribute is '''__required__'''. * '''cattr''' - corpus attribute that is the computation performed over. Default is 'word' * '''cfromw''' - search range - "from" - in token index (only integer numbers allowed). Default -5 * '''ctow''' - search range - "to" - similar. Default 5 * '''cminfreq''' - minimum frequency in corpus. Default 5 * '''cminbgr''' - minimum frequency in given range. Default 3 * '''cmaxitems''' - maximum number of displayed lines. Default 50 * '''cbgrfns''' - list of displayed functions in the form: cbgrfns=f1;cbgrfns=f2;... Default ['t', 'm'] * '''csortfn''' - function according to which the result is sorted. Default 'f'. Notation of the functions: * '''t''' - T-score * '''m''' - MI * '''3''' - MI3 * '''l''' - log likelihood * '''s''' - min. sensitivity * '''c''' - salience * '''f''' - frequency Note: '''coll''' method (that returns the collocation candidates input form) is associated. === save* methods === This group of methods includes: '''savecoll, saveconc, savefreq, savethes, savewl, savews'''. These functions provide plain text or XML output of the system, i.e. of functions '''collx, view, freqs, thes, wordlist, wsketch'''. Each of the save* functions takes the same attributes as its "mother" method. The common attributes of the save* functions are as follows: * '''saveformat''' - specifies the format of the output. Values: 'text' (default), 'xml' * '''heading''' - specifies if a simple heading (corpus name, query etc.) will be included. Values: '1', '0' (default) The '''saveconc''' method is associated with few more attributes: * '''pages''' - indicates if the whole concordance will be saved (value '0', default), or particular page only (value '1') * '''numbering''' - indicates if the concordance lines will be numbered. Values '1', '0' (default) * '''align_kwic''' - indicates if a simple alignment method of KWIC tokens will be used. Values '1', '0' (default). Relevant only in combination with text output * '''maxsavelines''' - maximum number of saved lines. Default 1000 Note: '''savecoll_form, saveconc_form, savefreq_form, savethes_form, savewl_form, savews_form''' methods (that return the particular forms) are associated. === subcorp === This method performs creation and deletion of subcorpora. Attributes: * '''subcname''' - name of the new subcorpus (or subcorpus being deleted respectively). Default None (no operation with subcorpora). * '''delete''' - if not empty (that is default), delete subcorpus instead of creation it * corpus structural attributes and their values can be here used as attributes and values of the method. The selected values define the span of the subcorpus. Note: '''subcorp_form''' method (that returns the subcorpus input form) is associated. == Universal attributes == There are few attributes that can be used with any method: * '''corpname''' - corpus name (in the short form, e.g. 'bnc'). This attribute specifies the corpus that will be processed and is '''__required__''' in all methods * '''usesubcorp''' - name of subcorpus that will be processed. Default is empty (!'') that means working with the entire corpus * '''format''' - format of the output. Default is empty that is interpreted as HTML. The only (so far) different possible value is 'json' that means output in the JSON format (see below). Option 'json' does not currently work well with the "wsdiff" method. * '''json''' - all input attributes encoded as a string in JSON syntax (see below) == Using JSON == JSON (!JavaScript Object Notation, http://www.json.org/) is a lightweight data-interchange format. It is easy for humans to read and write as well as for machines to parse and generate. The Sketch Engine offers a possibility of using the JSON format as the input and/or output format. == JSON input == Input in the JSON format can be passed to the Sketch Engine by the universal '''json''' attribute. All attribute names and values (including numbers and comma-delimited lists) should be encoded as JSON strings (note that quotation mark characters from the CQL queries must be escaped). Lists of attributes (e.g. by the '''q''' attribute in the '''view''' method) should be encoded as JSON arrays. Example of a complete query using JSON: {{{ http://beta.sketchengine.co.uk/auth/corpora/run.cgi/view?json={"corpname":"preloaded/bnc", "q":["q[lemma=\"test\"]", "r250"]} }}} == JSON output == In this section, we describe the output of the system in case the '''format''' attribute is set to '''json'''. The resulting JSON object has quite intuitive structure, so we will describe it here rather briefly. We also do not describe the output completely since there are some data that are used only internally and their description might be confusing (for this reason, there are some fields in the examples that are not described in the output structure and might change in time). In the following, output of all methods listed before is described. Note also that all structure names (JSON objects, arrays) begin with a capital letter, while atom names (strings, numbers) always are lowercase. === wordlist === Structure of the 'wordlist' query result: * '''Items''' - list of items in the word list. One item contains: * '''str''' - string expression of the item (e.g. word) * '''freq''' - frequency of the item Structure of the 'keywords' query result: * '''Keywords''' - list of selected keyword items. One item contains: * '''arf''' - the ARF value * '''cfreq''' - frequency in the reference (sub)corpus * '''score''' - item score * '''sfreq''' - frequency in the selected (sub)corpus * '''str''' - string expression of the item (e.g. word) Example (query and result) - wordlist: {{{ http://beta.sketchengine.co.uk/auth/corpora/run.cgi/wordlist?corpname=preloaded/bnc;wlattr=word;wlpat=test.*;wlsort=f;wlmaxitems=2;format=json { "Items": [ { "freq": 11040, "str": "test" }, { "freq": 4472, "str": "tests" } ] } }}} Example (query and result) - keywords: {{{ http://beta.sketchengine.co.uk/auth/corpora/run.cgi/wordlist?corpname=preloaded/bnc;wlattr=word;keywords=1;usesubcorp=wri-to-be-spoken;wlsort=f;wlmaxitems=2;ref_corpname=preloaded/bnc;format=json { "Keywords": [ { "arf": 5.9, "cfreq": 402, "score": 679.1, "sfreq": 402, "str": "Video-Tape" }, { "arf": 47.2, "cfreq": 3765, "score": 679.1, "sfreq": 3765, "str": "Video-Taped" } ] } }}} === wsketch === Structure: * '''Gramrels''' - list of grammatical relations including all relevant collocates. Contains: * '''count''' - overall frequency of the gramrel * '''name''' - name of the gramrel * '''score''' - overall score of the gramrel * '''seek''' - pointer to the concordance (can be used in a '''w'''-type query in the '''view''' method) * '''Words''' - list of collocates in the gramrel. Each collocate contains: * '''count''' - frequency of the collocate in gramrel * '''score''' - collocate score * '''seek''' - collocate pointer to the concordance (can be used in a '''w'''-type query in the '''view''' method) * '''word''' - string expression of the collocate[[BR]][[BR]] If 'clustered collocations' are demanded, each collocate can contain information about the collocate cluster: * '''totalcount''' - overall frequency of the cluster (0 if the cluster is empty) * '''totalseek''' - cluster pointer to the concordance (can be used in a '''w'''-type query in the '''view''' method, but must be preceded by comma (',')) (!'' if the cluster is empty) * '''Clust''' - list of words in the cluster, each word has attributes '''count, score, seek, word''' as described above. If the cluster is empty, this attribute is '''__not included__''' Example (query and result): {{{ http://beta.sketchengine.co.uk/auth/corpora/run.cgi/wsketch?corpname=preloaded/bnc;lemma=test;lpos=-n;format=json { "Gramrels": [ { "Words": [ { "Clust": [ { "count": 32, "id": 848, "score": 12.63, "seek": 4816731, "word": "run" }, ... ], "count": 294, "id": 1029, "score": 43.96, "seek": 4816743, "totalcount": 384, "totalseek": "4816743,4816731,4816760,4816700,4816806,4816675", "word": "pass" }, ... ], "count": 3406, "name": "object_of", "score": 2.1, "seek": 79181 }, ... }}} === thes === Structure: * '''Words''' - list of similar words. Each word contains: * '''score''' - word score * '''word''' - string expression of the word[[BR]][[BR]] If 'clustered items' are demanded, each word can contain information about the word cluster: * '''Clust''' - list of words in the cluster, each word has attributes '''score, word''' as described above. If the cluster is empty, this attribute is '''__not included__''' * '''freq''' - frequency of the selected lemma in corpus Example (query and result): {{{ http://beta.sketchengine.co.uk/auth/corpora/run.cgi/thes?corpname=preloaded/bnc;lemma=test;lpos=-n;maxthesitems=6;clusteritems=1;format=json { "Words": [ { "Clust": [ { "id": 4226, "score": 0.223, "word": "examination" } ], "id": 941, "score": 0.243, "totalcount": 0, "totalseek": "", "word": "assessment" }, ... ], "commonurl": "corpname=preloaded\/bnc;lemma=test;lpos=-n", "freq": 15789, "lemma": "test", "lpos": "-n" } }}} === wsdiff === This method does not currently provide JSON output. === view === Structure: * '''Lines''' - list of concordance lines. Each line contains: * '''Kwic''' - list of KWIC segments (segment stands for one or more tokens). Each segment contains: * '''class''' - class name of the segment (e.g. 'attr' = attribute, 'coll' = collocation etc.) * '''str''' - string expression of the segment (attributes are preceded by '\/' for correct display on the HTML page) * '''Left''' - list of left context segments (same structure as '''Kwic''') * '''Right''' - list of right context segments (same structure as '''Kwic''') * '''ref''' - line reference ('reference' field content) * '''toknum''' - token number (of the first token in KWIC) * '''concsize''' - number of lines in concordance (or number of hits) * '''numofpages''' - number of pages in concordance Example (query and result): {{{ http://beta.sketchengine.co.uk/auth/corpora/run.cgi/view?corpname=preloaded/bnc;q=q[lemma="drug"][lemma="test"];pagesize=2;ctxattrs=word,tag;format=json { "Lines": [ { "Align": [], "Kwic": [ { "class": "col0 coll", "str": " drug test" } ], "Left": [ { "class": "attr", "str": "\/VM0" }, { "class": "", "str": " be" }, ... ], "Right": [ { "class": "", "str": " at" }, ... ], "hitlen": ";hitlen=2", "leftspace": "", "linegroup": "_", "ref": "A0M", "toknum": 654026 }, ... ], "concsize": 70, "fromp": 1, "lastlink": "fromp=35", "nextlink": "fromp=2", "numofpages": 35 } }}} === freqs === Structure: * '''Blocks''' - list of frequency blocks (tables). Each table contains: * '''Head''' - list of the table headings. Each heading contains: * '''n''' - string representation of the heading (name of the column) * '''s''' - ID of the column, can be used as a value of the '''freq_sort''' attribute * '''Items''' - list of lines in the table. Each line contains: * '''Word''' - list of items in the left part of the table (i.e. all columns except 'Freq' and "Rel[%]" column). Each item contains: * '''n''' - string representation of the item * '''freq''' - frequency (content of the "Freq" column) * '''rel''' - content of the "Rel[%]" column. If the column is not present, this attribute is '''__not included__''' Example (query and result): {{{ http://beta.sketchengine.co.uk/auth/corpora/run.cgi/freqs?q=q[lemma="test"];corpname=preloaded/bnc;fcrit=word/+0+lemma/+0+tag/+0;flimit=3000;ml=1;format=json { "Blocks": [ { "Head": [ { "n": "word", "s": 0 }, { "n": "lemma", "s": 1 }, { "n": "tag", "s": 2 }, { "n": "Freq", "s": "freq" } ], "Items": [ { "Word": [ { "n": "test" }, { "n": "test" }, { "n": "NN1" } ], "fbar": 301, "freq": 8609, "norel": 1 }, ... }}} === collx === Structure: * '''Head''' - list of table headings. Each heading contains: * '''n''' - name of the column. Can be empty. * '''s''' - column ID. Can be used as a value of the '''csortfn''' attribute. If '''n''' is empty, this is '''__not included__''' * '''Items''' - list of table lines. Each line contains: * '''Stats''' - list of the statistics in the line (in the same order as in the heading). Each statistic contains: * '''n''' - value itself (content of the column) * '''freq''' - collocation frequency * '''str''' - string expression of the collocate Example (query and result): {{{ http://beta.sketchengine.co.uk/auth/corpora/run.cgi/collx?q=q[lemma="test"];corpname=preloaded/bnc;csortfn=m { "Head": [ { "n": "" }, { "n": "Freq", "s": "f" }, { "n": "T-score", "s": "t" }, { "n": "MI", "s": "m" } ], "Items": [ { "Stats": [ { "s": "2.828" }, { "s": "12.938" } ], "freq": 8, "nfilter": "q=n-5 5 1 [word=\"Belvin\"]", "pfilter": "q=p-5 5 1 [word=\"Belvin\"]", "str": "Belvin" }, ... }}} === save* methods === These methods return the same output as their mother methods (see above) and are deprecated to be used for JSON output. === subcorp === Structure: * '''Subcorplist''' - available subcorpora list. Each subcorpus contains: * '''n''' - name of the subcorpus[[BR]] Fields available only if new subcorpus is created: * '''corpsize''' - size of the mother corpus (number of tokens) * '''subcsize''' - size of the created subcorpus (number of tokens) Example (query and result): {{{ http://beta.sketchengine.co.uk/auth/corpora/run.cgi/subcorp?corpname=preloaded/bnc;format=json { "SubcorpList": [ { "n": "book" }, { "n": "wri-to-be-spoken" } ] } }}} == Examples == In this section, we show several examples of how is the Sketch Engine accessible automatically from a program, mainly using the JSON format. The example set is expected to be growing in time. === Example 1 === This example presents how to connect the Sketch Engine server, send a query (in this particular case simple word list query) and parse the result for JSON syntax. Available for Java and Python. Note that many modules for JSON parsing are available, you do not have to use the one from the examples. Example 1 - Java source: {{{ package jsonexample; import java.io.BufferedReader; import java.io.InputStream; import java.io.InputStreamReader; import java.net.Authenticator; import java.net.PasswordAuthentication; import java.net.URL; import org.json.*; public class Main { public Main() { } public static void main(String[] args) { String data; // url with the query String url_string = "http://beta.sketchengine.co.uk/auth/corpora/run.cgi/wordlist?corpname=preloaded/bnc;wlattr=word;wlminfreq=5;wlmaxitems=100;wlpat=test.*;format=json" final String usr = ""; final String passwd = ""; // authentication issues Authenticator auth = new Authenticator() { protected PasswordAuthentication getPasswordAuthentication () { return new PasswordAuthentication(usr, passwd.toCharArray()); } }; Authenticator.setDefault(auth); try { // connecting the SketchEngine Server URL url = new URL(url_string); InputStream stream = url.openStream(); InputStreamReader isr = new InputStreamReader(stream); BufferedReader reader = new BufferedReader(isr); // json data receiving data = reader.readLine(); // json data are on the first line // now, in the 'data' variable, there is a json string // that can be parsed for json syntax JSONObject json = new JSONObject(data); System.out.println(json.toString(3)); } catch(Exception e) { e.printStackTrace(); } } } }}} Example 1 - Python source: {{{ #!/usr/bin/python import urllib2, base64 import simplejson url = 'http://beta.sketchengine.co.uk/auth/corpora/run.cgi/wordlist?corpname=preloaded/bnc;wlattr=word;wlminfreq=5;wlmaxitems=100;wlpat=test.*;format=json' usr = '' passwd = '' request = urllib2.Request(url) # authentication base64string = base64.encodestring('%s:%s' % (usr, passwd))[:-1] request.add_header("Authorization", "Basic %s" % base64string) # json data receiving file = urllib2.urlopen(request) data = file.read() file.close() # now, in the 'data' variable, there is a json string that can be parsed # for json syntax (e.g. by simplejson) json_obj = simplejson.loads(data) print simplejson.dumps(json_obj, sort_keys=True, indent=3) }}} === Example 2 === This example presents an easy way how to convert usual structures (dictionaries for Python, Maps for Java) to JSON objects and how to use the obtained JSON objects as a query to Sketch Engine. Available for Java and Python. Example 2 - Java sample: [http://trac.sketchengine.co.uk/attachment/wiki/SkE/Methods/example2.java (view full source)] {{{ String data, url_string; String base_url = "http://beta.sketchengine.co.uk/auth/corpora/run.cgi/"; String method = "wordlist"; Map attrs; JSONObject json_query; ... // creating query string attrs = new HashMap(); attrs.put("corpname", "preloaded/bnc"); attrs.put("wlattr", "word"); attrs.put("wlpat", "test.*"); attrs.put("format", "json"); json_query = new JSONObject(attrs); url_string = base_url + method + "?json=" + json_query.toString(); }}} Example 2 - Python sample: [http://trac.sketchengine.co.uk/attachment/wiki/SkE/Methods/example2.py (view full source)] {{{ import urllib, urllib2, base64 import simplejson ... base_url = 'http://beta.sketchengine.co.uk/auth/corpora/run.cgi/' method = 'wordlist' # creating query string attrs = dict(corpname='preloaded/bnc', wlattr='word', wlpat='test.*', format='json') encoded_attrs = urllib.quote(simplejson.JSONEncoder().encode(attrs)) url = base_url + method + '?json=%s' % encoded_attrs }}}