2007 Mike Scott
Version 5.0
WordSmith
Tools
WordSmith Tools
version 5.0
by Mike Scott
2007 Mike Scott
All rights reserved. No parts of this work may be reproduced in any form or by any means - graphic, electronic,
or mechanical, including photocopying, recording, taping, or information storage and retrieval systems - without
the written permission of the publisher.
Products that are referred to in this document may be either trademarks and/or registered trademarks of the
respective owners. The publisher and the author make no claim to these trademarks.
While every precaution has been taken in the preparation of this document, the publisher and the author assume
no responsibility for errors or omissions, or for damages resulting from the use of information contained in this
document or from the use of programs and source code that may accompany it. In no event shall the publisher
and the author be liable for any loss of profit or any other commercial damage caused or alleged to have been
caused directly or indirectly by this document.
Printed: September 2007
WordSmith Tools
2007 Mike Scott
Publisher
Special thanks to:
All the people who contributed to this document by testing
WordSmith Tools in its various incarnations. Especially those who
reported problems and sent me suggestions.
Lexical Analysis Software
WordSmith ToolsI
2007 Mike Scott
Table of Contents
Foreword I
Part I WordSmith Tools 2
Part II Overview 4
................................................................................................................................... 41 Whaťs new in version 5
................................................................................................................................... 42 Controller
................................................................................................................................... 43 Concord
................................................................................................................................... 54 KeyWords
................................................................................................................................... 55 WordList
................................................................................................................................... 56 Utilities
......................................................................................................................................................... 5Choose Languages
......................................................................................................................................................... 5File Utilities
......................................................................................................................................................... 6File Viewer
......................................................................................................................................................... 6Minimal Pairs
......................................................................................................................................................... 6Splitter
......................................................................................................................................................... 6Text Converter
......................................................................................................................................................... 7Version Checker
......................................................................................................................................................... 8Viewer
......................................................................................................................................................... 8Webgetter
Part III Getting Started 11
................................................................................................................................... 111 getting started with Concord
................................................................................................................................... 122 getting started with KeyWords
................................................................................................................................... 133 getting started with WordList
Part IV Installation and Updating 15
................................................................................................................................... 151 installing WordSmith Tools
................................................................................................................................... 162 network defaults
................................................................................................................................... 163 version checking
Part V Controller 19
................................................................................................................................... 191 accents
................................................................................................................................... 192 add notes
................................................................................................................................... 193 adjust settings
................................................................................................................................... 204 advanced settings
................................................................................................................................... 235 batch folders
................................................................................................................................... 236 batch processing
................................................................................................................................... 267 choose favourite texts
................................................................................................................................... 268 choose language
................................................................................................................................... 279 choose texts
IIContents
2007 Mike Scott
................................................................................................................................... 3010 choosing files from standard dialogue box
................................................................................................................................... 3011 class or session instructions
................................................................................................................................... 3012 colours
................................................................................................................................... 3213 column totals
................................................................................................................................... 3314 compute new column of data
................................................................................................................................... 3315 copy your results
................................................................................................................................... 3416 count data frequencies
................................................................................................................................... 3617 custom processing
................................................................................................................................... 3918 custom settings
................................................................................................................................... 4119 editing a list of data
................................................................................................................................... 4320 editing column headings
................................................................................................................................... 4321 find relevant files
................................................................................................................................... 4422 fonts
................................................................................................................................... 4523 general settings
................................................................................................................................... 4624 layout & format
................................................................................................................................... 4825 match words in list
................................................................................................................................... 5026 never used WordSmith before
................................................................................................................................... 5127 previous lists
................................................................................................................................... 5128 print and print preview
................................................................................................................................... 5229 quit WordSmith
................................................................................................................................... 5230 reduce data to n entries
................................................................................................................................... 5231 save as text
................................................................................................................................... 5532 save defaults
................................................................................................................................... 5633 save results
................................................................................................................................... 5634 search & replace
................................................................................................................................... 5735 search by typing
................................................................................................................................... 5736 search for word or part of word
................................................................................................................................... 5837 see filenames
................................................................................................................................... 5838 stop lists
................................................................................................................................... 5939 suspend processing
................................................................................................................................... 6040 text and languages
................................................................................................................................... 6141 window management
................................................................................................................................... 6242 zap unwanted lines
Part VI Tags and Markup 64
................................................................................................................................... 641 overview
................................................................................................................................... 642 tag-types
................................................................................................................................... 653 handling tags
................................................................................................................................... 664 multimedia tags
................................................................................................................................... 675 tags as selectors
WordSmith ToolsIII
2007 Mike Scott
................................................................................................................................... 696 only if containing...
................................................................................................................................... 707 selecting within texts
................................................................................................................................... 718 making a tag file
................................................................................................................................... 739 start and end of text segments
................................................................................................................................... 7410 modify source texts
Part VII Concord 79
................................................................................................................................... 791 purpose
................................................................................................................................... 792 index
................................................................................................................................... 803 what is a concordance
................................................................................................................................... 804 blanking
................................................................................................................................... 815 categories
................................................................................................................................... 826 collocate horizons
................................................................................................................................... 827 collocate settings
................................................................................................................................... 838 collocate highlighting in concordance
................................................................................................................................... 849 collocates display
................................................................................................................................... 8510 collocation relationship
................................................................................................................................... 8611 collocation
................................................................................................................................... 8712 Concord: clusters
................................................................................................................................... 9013 Concord: dispersion
................................................................................................................................... 9114 Concord: saving and printing
................................................................................................................................... 9215 Concord: viewing options
................................................................................................................................... 9316 Concord: handling sounds & video
................................................................................................................................... 9317 Concord: what you see and do
................................................................................................................................... 9518 concordance settings
................................................................................................................................... 9819 concordancing on tags
................................................................................................................................... 9920 context word
................................................................................................................................... 10121 editing concordances
................................................................................................................................... 10222 file-based search-words
................................................................................................................................... 10323 follow-up
................................................................................................................................... 10424 nearest tag
................................................................................................................................... 10725 patterns
................................................................................................................................... 10826 remove duplicates
................................................................................................................................... 10927 re-sorting
................................................................................................................................... 11128 re-sorting: collocates
................................................................................................................................... 11129 re-sorting: dispersion plot
................................................................................................................................... 11130 text segments in Concord
................................................................................................................................... 11231 search word syntax
................................................................................................................................... 11332 WordSmith controller: Concord: settings
IVContents
2007 Mike Scott
Part VIII KeyWords 118
................................................................................................................................... 1181 purpose
................................................................................................................................... 1182 index
................................................................................................................................... 1193 Two word-list analysis
................................................................................................................................... 1194 associate definition
................................................................................................................................... 1195 associates
................................................................................................................................... 1206 choosing files
................................................................................................................................... 1207 clumps
................................................................................................................................... 1218 KeyWords clusters
................................................................................................................................... 1239 concordance
................................................................................................................................... 12310 creating a database
................................................................................................................................... 12411 example of key words
................................................................................................................................... 12512 key key-word definition
................................................................................................................................... 12513 key-ness definition
................................................................................................................................... 12614 KeyWords database
................................................................................................................................... 12615 KeyWords: advice
................................................................................................................................... 12716 KeyWords: calculation
................................................................................................................................... 12717 KeyWords: links
................................................................................................................................... 12818 make a word list from keywords data
................................................................................................................................... 12819 p value
................................................................................................................................... 12820 plot calculation
................................................................................................................................... 12921 plot display
................................................................................................................................... 13022 regrouping clumps
................................................................................................................................... 13023 re-sorting: KeyWords
................................................................................................................................... 13124 the key words screen
................................................................................................................................... 13225 WordSmith controller: KeyWords settings
Part IX WordList 135
................................................................................................................................... 1351 purpose
................................................................................................................................... 1352 index
................................................................................................................................... 1363 auto-joining lemmas
................................................................................................................................... 1374 choosing lemma file
................................................................................................................................... 1385 comparing wordlists
................................................................................................................................... 1396 merging wordlists
................................................................................................................................... 1397 comparison display
................................................................................................................................... 1418 consistency analysis (detailed)
................................................................................................................................... 1429 consistency analysis (simple)
................................................................................................................................... 14310 lemmas
................................................................................................................................... 14411 index lists: uses
WordSmith ToolsV
2007 Mike Scott
................................................................................................................................... 14412 index lists: viewing
................................................................................................................................... 14613 making a WordList Index
................................................................................................................................... 14714 index clusters
................................................................................................................................... 15015 menu search
................................................................................................................................... 15116 mutual information scores
................................................................................................................................... 15317 mutual information: computing
................................................................................................................................... 15518 mutual information display
................................................................................................................................... 15719 re-sorting: consistency lists
................................................................................................................................... 15720 statistics
................................................................................................................................... 15821 import words from text list
................................................................................................................................... 16022 type/token ratios
................................................................................................................................... 16123 case sensitivity
................................................................................................................................... 16124 minimum & maximum settings
................................................................................................................................... 16225 sort order
................................................................................................................................... 16326 WordList and tags
................................................................................................................................... 16427 WordList display
................................................................................................................................... 16728 WordSmith controller: WordList settings
Part X Utility Programs 171
................................................................................................................................... 1711 Convert Data from Previous Versions
......................................................................................................................................................... 171Convert Data from Previous Versions
................................................................................................................................... 1712 WebGetter
......................................................................................................................................................... 171overview
......................................................................................................................................................... 171settings
......................................................................................................................................................... 173display
......................................................................................................................................................... 174limitations
................................................................................................................................... 1743 Languages Chooser
......................................................................................................................................................... 174Overview
......................................................................................................................................................... 175Language
......................................................................................................................................................... 177Font
......................................................................................................................................................... 177Sort Order
......................................................................................................................................................... 178Other Languages
......................................................................................................................................................... 178saving your choices
................................................................................................................................... 1784 Minimal Pairs
......................................................................................................................................................... 178aim
......................................................................................................................................................... 178requirements
......................................................................................................................................................... 179choosing your files
......................................................................................................................................................... 179output
......................................................................................................................................................... 180rules and settings
......................................................................................................................................................... 180running the program
................................................................................................................................... 1815 File Viewer
......................................................................................................................................................... 181Using File Viewer
................................................................................................................................... 1836 File Utilities
......................................................................................................................................................... 183index
......................................................................................................................................................... 183Splitter
VIContents
2007 Mike Scott
.................................................................................................................................................. 183Splitter: index
.................................................................................................................................................. 183aim of Splitter
.................................................................................................................................................. 184Splitter: filenames
.................................................................................................................................................. 184Splitter: wildcards
......................................................................................................................................................... 185join text files
......................................................................................................................................................... 186compare two files
......................................................................................................................................................... 187file chunker
......................................................................................................................................................... 187find duplicates
......................................................................................................................................................... 188rename
................................................................................................................................... 1887 Text Converter
......................................................................................................................................................... 188purpose
......................................................................................................................................................... 189Text Converter: index
......................................................................................................................................................... 189Text Converter: extracting from files
......................................................................................................................................................... 190Text Converter: settings
......................................................................................................................................................... 193Text Converter: syntax
......................................................................................................................................................... 195Convert Text File Format
......................................................................................................................................................... 196Text Converter: move if
......................................................................................................................................................... 197Text Converter: copy to
......................................................................................................................................................... 197Text Converter conversion file
......................................................................................................................................................... 198Text Converter: sample conversion file
Part XI Viewer and Aligner 200
................................................................................................................................... 2001 purpose
................................................................................................................................... 2002 index
................................................................................................................................... 2013 aligning with Viewer
................................................................................................................................... 2024 aligning and moving
................................................................................................................................... 2025 editing
................................................................................................................................... 2026 languages
................................................................................................................................... 2037 numbering sentences & paragraphs
................................................................................................................................... 2038 options
................................................................................................................................... 2039 reading in a plain text
................................................................................................................................... 20310 sentence joining and splitting
................................................................................................................................... 20411 settings
................................................................................................................................... 20412 technical aspects
................................................................................................................................... 20513 translation mis-matches
................................................................................................................................... 20514 troubleshooting
................................................................................................................................... 20615 unusual sentences
Part XII Reference 208
................................................................................................................................... 2081 32-bit version
................................................................................................................................... 2082 acknowledgements
................................................................................................................................... 2093 API
................................................................................................................................... 2094 bibliography
................................................................................................................................... 2105 bugs
................................................................................................................................... 2116 Character Sets
WordSmith ToolsVII
2007 Mike Scott
......................................................................................................................................................... 211overview
......................................................................................................................................................... 212accents & symbols
......................................................................................................................................................... 212ansi and ascii
......................................................................................................................................................... 213DOS
......................................................................................................................................................... 213Windows
......................................................................................................................................................... 213Unicode
......................................................................................................................................................... 214UTF8
................................................................................................................................... 2147 clipboard
................................................................................................................................... 2168 contact addresses
................................................................................................................................... 2179 date format
................................................................................................................................... 21710 Definitions
......................................................................................................................................................... 217definitions
......................................................................................................................................................... 218word separators
................................................................................................................................... 21811 demonstration version
................................................................................................................................... 21812 edit v. type-in mode
................................................................................................................................... 21813 file types
................................................................................................................................... 21914 finding source texts
................................................................................................................................... 21915 folders\;directories
................................................................................................................................... 22016 formulae
................................................................................................................................... 22117 HistoryList
................................................................................................................................... 22118 HTML, SGML and XML
................................................................................................................................... 22119 hyphens
................................................................................................................................... 22220 international versions
................................................................................................................................... 22321 limitations
................................................................................................................................... 22322 tool-specific limitations
................................................................................................................................... 22423 links between tools
................................................................................................................................... 22524 keyboard shortcuts
................................................................................................................................... 22525 long file names
................................................................................................................................... 22626 machine requirements
................................................................................................................................... 22627 manual for WordSmith Tools
................................................................................................................................... 22628 menu and button options
................................................................................................................................... 22929 numbers
................................................................................................................................... 22930 plot dispersion value
................................................................................................................................... 22931 RAM availability
................................................................................................................................... 23032 reference corpus
................................................................................................................................... 23033 restore last file
................................................................................................................................... 23034 selecting multiple entries
................................................................................................................................... 23135 single words v. clusters
................................................................................................................................... 23236 speed
................................................................................................................................... 23337 status bar
................................................................................................................................... 23338 tools for pattern-spotting
................................................................................................................................... 23439 version information
VIIIContents
2007 Mike Scott
................................................................................................................................... 23540 zip files
Part XIII Troubleshooting 237
................................................................................................................................... 2371 list of FAQs
................................................................................................................................... 2372 apostrophes not found
................................................................................................................................... 2373 column spacing
................................................................................................................................... 2374 Concord tags problem
................................................................................................................................... 2385 Concord/WordList mismatch
................................................................................................................................... 2386 crashed
................................................................................................................................... 2387 demo limit
................................................................................................................................... 2388 funny symbols
................................................................................................................................... 2399 illegible colours
................................................................................................................................... 23910 keys don't respond
................................................................................................................................... 23911 pineapple-slicing
................................................................................................................................... 23912 printer didn't print
................................................................................................................................... 24013 too slow
................................................................................................................................... 24014 won't start
................................................................................................................................... 24015 word list out of order
Part XIV Error Messages 242
................................................................................................................................... 2421 list of error messages
................................................................................................................................... 2432 .ini file not found
................................................................................................................................... 2433 base list error
................................................................................................................................... 2444 can only save words as ASCII
................................................................................................................................... 2445 can't call other tool
................................................................................................................................... 2446 can't make folder as thaťs an existing filename
................................................................................................................................... 2447 can't compute key words as languages differ
................................................................................................................................... 2448 can't merge list with itself!
................................................................................................................................... 2449 can't read file
................................................................................................................................... 24410 character set reset to and and
, type and , type those in here. Again, whatever you type is case sensitive.
start & end of paragraph
For the Tools to recognise paragraphs, they need to know what constitutes a paragraph start
and/or end, e.g. a sequence of two
, etc.), or * to represent any number of
characters (e.g. will pick up , , etc.). Otherwise,
prepare your tag list file in the same way as for Stop Lists.
Use notepad or any other plain text editor, to create a new .tag file. Write one entry on each line.
Any number of pre-defined tags can be stored. But the more you use, the more work WordSmith
has to do, of course and it will take time & memory ...
Mark-up to EXclude
A tag file for stretches of mark-up like this or might represent the beginning of a paragraph and might represent the beginning of a sentence and the end. If you leave
the choice as auto, ends of sentences are determined by full stops or question marks or
exclamation marks followed by a capital letter.
Paragraphs
For example, ,
, etc. will be cut out of the concordance unless you specify them in a tag file. If so, specify the tag file and run the concordance again. You can also display tags in colour, or even hide the tags -- yet still colour the tagged word. Here is a concordance of this in the BNC World Edition text with the tags in colour: and here is a view showing the same data, with View | Hide Tags selected. 107 WordSmith Tools 2007 Mike Scott The tags themselves are no longer visible, and only 6 types of tag have been chosen to be viewed in colour. See also: Guide to handling the BNC, Overview of Tags, Handling Tags, Making a Tag File, Tagged Texts, Types of Tag, Viewing the Tags, Using Tags as Text Selectors 7.25 patterns When you have a collocation window open, one of the tab windows shows "Patterns". This will show the collocates (words adjacent to the search word), organised in terms of frequency within each column. That is, the top word in each column is the word most frequently found in that position. The second word is the second most frequent. 108Concord 2007 Mike Scott In R1 position (one word to the right of the search-word love) there seem to be both intimate ( thee) and formal (you) pronouns associated with love in Shakespeare. And looking at L1 position it seems that speakers talk more of their love for another than of another's love for them. The minimum frequency for one of the words to be shown at all, is the minimum frequency for collocates. The point of it... The effect is to make the most frequent items in the neighbourhood of the search word "float up" to the top. Like collocation, this helps you to see lexical patterns in the concordance. You can also highlight any given pattern collocate in your concordance display. 7.26 remove duplicates The problem Sometimes one finds that text files contain duplicate sections, either because the corpus has become corrupted through being copied numerous times onto different file-stores or because they were not edited effectively, e.g. a newspaper has several different editions in the same file. The result can sometimes be that you get a number of repeated concordance lines. Solution If you choose Edit |Remove Duplicates, Concord goes through your concordance lines and if it finds any two where the stored concordance lines are identical, regardless of the filename, date etc. it will mark one of these for deletion. That is, it checks all the "characters to save" to see whether the two lines are identical. If you set this to 150 or so it is highly unlikely that false duplicates will be identified, since every single character, comma, space etc. would have to match. Check before you zap... At the end it will sort all the lines so you can see which ones match each other before you decide finally to zap the ones you really don't want. 109 WordSmith Tools 2007 Mike Scott 7.27 re-sorting When a concordance is generated, it will appear in the order in the text file(s) which the concordance came from: file order. How to do it... Sorting can be done simply by pressing the top row of any list. Or by pressing F6 / Ctrl/F6. Or by choosing the menu option. The point of it... The point of re-sorting is to find characteristic lexical patterns. It can be hard to see overall trends in your concordance lines, especially if there are lots of them. By sorting them you can separate out multiple search words and examine the immediate context to left and right. For example you may find that most of the entries have "in the" or "in a" or "in my" just before the search word -- sorting by the second word to the left of the search word will make this much clearer. Sorting is by a given number of words to the left or right (L1 [=1 word to the left of the search word], L2, L3, L4, L5, R1 [=1 to the right], R2, R3, R4, R5), on the search word itself, the context word (if one was specified), the nearest tag, the distance to the nearest tag, a set category of your own choice, or original file order (file). Main Sort The listing can be sorted by three criteria at once. A Main Sort on Left 1 (L1) will sort the entries according to the alphabetical order of the word immediately to the left of the search word. A second sort (Sort 2) on R2 would re-order the listing by tie-breaking, that is: only where the L1 words (immediately to the left of the search word) matched exactly, and would place these in alphabetical order of the words 2 to the right of the search word. For very large concordances you may find the third sort (Sort 3) useful: this is an extra tie-breaker in cases 110Concord 2007 Mike Scott where the second sort matches. For many purposes tie-breaking is unnecessary, and will be ignored if the first and second sorts are the same (e.g. Left 1 and Left 1) or if the "activated" box is not checked. sorting by set (user-defined categories) You can also sort by set, if you have chosen to classify the concordance lines according to your own scheme, using letters from A to Z or a to z or longer strings. The sort will put the classified lines first, in category order, followed by any unclassified lines (which will appear in a light grey colour). See Nearest Tag for details of sorting by tags. The colour of the search word will change according to the sort system used. other sorts As the screenshot below shows, you can also sort by a number of other criteria, most of these accessible simply by clicking on their column header. The "contextual frequency" sort means sorting on the average ranking frequency of all the words in each concordance line which don't begin with a capital letter. For this you will be asked to specify your reference corpus wordlist. The result will be to sort those lines which contain "easy" (highly frequent) words at the top of the list. All By default you sort all the lines; you may however type in for example 5-49 to sort those lines only. Ascending If this box is checked, sort order is from A to Z, otherwise iťs from Z to A. See also: WordList sort, KeyWords sort, Choosing Language 111 WordSmith Tools 2007 Mike Scott 7.28 re-sorting: collocates The frequency-ordered collocation display can be re-sorted to reveal the frequencies sorted by their total frequencies overall (the default), by the left or right frequency total, or by any individual frequency position, from 25 words to the left of the search word to 25 words to the right. Just press the header of a column to sort it. Press again to toggle the sort between ascending to descending. The point of it... is to find patterns of collocation, so as to more fully understand the company your search-word keeps. The choices depend on the collocation horizons. See also: Collocation, Collocation Display 7.29 re-sorting: dispersion plot This automatically re-sorts the dispersion plot, rotating through these options: alphabetically (by file-name) in frequency order (in terms of hits per 1,000 words of running text) by first occurrence in the source text(s): text order by range: the gap between first and last occurrence in the source text. see also: Dispersion Plot 7.30 text segments in Concord A concordance line brings with it information about which segment of the text it was found in. In the screenshot below, a concordance on year was carried out; the listing has been sorted by Heading Position -- in the top 2 lines, year is found as the 3rd word of a heading. The advantage of this is that it is possible to identify search-words occurring near sentence starts, near the beginning of sections, of headings, of paragraphs. 112Concord 2007 Mike Scott See also: Start and end of text segments. 7.31 search word syntax By default, Concord does a whole-word non-case-sensitive search. Examples search word finds book Book or book or BoOk book* book, books, booking, booked *book textbook (but not textbooks) bo* in book in, books in, booking in (but not book into) book * hotel book a hotel, book the hotel, book my hotel bo* in* book in, books in, booking in, book into book? book, books, book; book. book^ book, books b^^k book, back, bank, etc. ==book== book (but not BOOK or Book) book/paperback book or paperback symbol meaning examples * disregard the end of the word, disregard a whole word tele* *ness *happi* book * hotel ? any single character (including punctuation) will match here Engl??? ?50.00 # any single number, 0 to 9 $### ##.00 ^ any single letter of the alphabet will match here Fr^nc^ == case sensitive ==French== 113 WordSmith Tools 2007 Mike Scott ==Fr*== :\ means use a file for lots of search-words (see file-based search_words) c:\text\frd.txt / separates alternative search-words. You can specify alternatives within an 80-character overall limit may/can/will <> beginning & end of tagsIf you want to use *, ? , == , #, ^ , :\, >, < or / as a character in your search word, put it in double quotes. Examples: "*" Why"?" and"/"or ":\" "<" Don't forget that question-marks come at the end of words (in English anyway) so you might need *"?" Tags You can also specify tags in your search-word if your text is tagged. Examples: symbol meaning examples * single common noun (BNC) book, chair, elephant * singular or plural common noun book, chairs t* any single noun beginning with T or t table, teacher * * two single common nouns in sequence campaign manager See also: Tag Concordancing, Context Word, Modify source texts 7.32 WordSmith controller: Concord: settings These are found in the main Controller under Adjust Settings | Concord. This is because some of the choices -- e.g. collocation horizons -- may affect other Tools. 114Concord 2007 Mike Scott WHAT YOU GET and WHAT YOU SEE There are 2 tabs for settings affecting What you get in the concordance and What you see in the display. There is a screenshot below showing the options under What you see. WHAT YOU GET Entries Wanted The maximum is more than 2 billion lines. This feature is useful if you're doing a number of searches and want, say, 100 examples of each. The 100 entries will be the first 100 found in the texts you have selected. If you search for more than 1 search-word (eg. book/paperback), you will get 100 of book and 100 of paperback. "at random" is a feature which allows you to randomise the search. Here Concord goes through the text files and gets the 100 entries by giving each hit a random one-in-three chance of being selected. To get 100 entries Concord will have found around 250-350 hits. You can set the randomiser anywhere from 1 in 2 to 1 in 1,000. Characters to save Here is where you set how many characters in a concordance line will be stored as text as the concordance is generated. The default is 80 (minimum 20 and maximum 8,000). The reason for this is that you will probably want a fixed number of characters so that when using a non proportional font, such as Courier or Lucinda Console, the search-words line up nicely. This 115 WordSmith Tools 2007 Mike Scott number of characters will be saved when you save your results, so even if you subsequently delete the source text file you can still see some context. If you grow the lines more text will be read in (and stored) as needed. In this section you can also specify markers for your search-word and context-word. Collocates By default, Concord will compute collocates as well as the concordance, but you can set it not to if you like (Minimal processing). For further details, see Collocate Horizons or Collocation Collocates relation statistic Choose between Specific Mutual Information, MI3, Z Score, Log Likelihood. See Mutual Information Display for examples of how these can differ. WHAT YOU SEE Sort preferences By default, Concord will sort a new concordance in original file order, but you can set this to different values if you like. For further details, see Sorting a Concordance. Concordance view You can choose different ways of seeing the data, and a whole set of choices as to what columns you want to display for each new concordance. You can re-instate any later if you wish by changing the Layout. 116Concord 2007 Mike Scott hide search-word = blank it out eg. to make a guess-the-word exercise hide undefined tags = hide those not defined in your tag file hide tag file tags = hide all tags including undefined ones hide words = show only the tags cut spaces = remove any double spaces sentence only = show the context only up to its left and right sentence boundaries raw numbers = show the raw data instead of percentages e.g. for sentence position See also: Concord Saving and Printing, Concord Help Contents, Collocation Settings. KeyWords Section VIII WordSmith Tools 118KeyWords 2007 Mike Scott 8 KeyWords 8.1 purpose This is a program for identifying the "key" words in one or more texts. Key words are those whose frequency is unusually high in comparison with some norm. Click here for an example. The point of it... Key-words provide a useful way to characterise a text or a genre. Potential applications include: language teaching, forensic linguistics, stylistics, content analysis, text retrieval. The program compares two pre-existing word-lists, which must have been created using the WordList tool. One of these is assumed to be a large word-list which will act as a reference file. The other is the word-list based on one text which you want to study. The aim is to find out which words characterise the text you're most interested in, which is automatically assumed to be the smaller of the two texts chosen. The larger will provide background data for reference comparison. Key-words and links between them can be plotted, made into a database, and grouped according to their associates. 8.2 index Explanations What is the Keywords program and whaťs it for? How Key Words are Calculated 2-Wordlist Analysis Key words display Key words plot Key words plot display Plot-Links Batch Analyses Database of Key Key-Words Associates Clumps Limitations Settings and Procedures Calling up a Concordance Choose Word Lists Colours Database Folders Fonts Keyboard Shortcuts Printing Re-sorting Exiting Tips KeyWords Advice Window Management Definitions General Definitions 119 WordSmith Tools 2007 Mike Scott Key-ness Key key-word Associate See also : WordSmith Main Index 8.3 Two word-list analysis The usual kind of KeyWords analysis. It compares the one text file (or corpus) you're chiefly interested in, with a reference corpus based on a lot of text. Choose Word Lists In the dialogue box you will choose 2 files. The text file in the box above and the reference corpus file in the box below. See also How Key Words are Calculated, KeyWords Settings 8.4 associate definition An "associate" of key-word X is another key-word (Y) which co-occurs with X in a number of texts. It may or may not co-occur in proximity to key-word X. (A collocate would have to occur within a given distance of it, whereas an associate is "associated" by being key in the same text.) For example, in a key-word database of Guardian newspaper text, wine was found to be a key word in 25 out of 299 stories from the Saturday "tabloid" page, thus a key key word in this section. The top associates of wine were: wines, Tim, Atkin, dry, le, bottle, de, fruit, region, chardonnay, red, producers, beaujolais. It is strikingly close to the early notion of "collocate". Association operates in various ways. It can be strong or weak, and it can be one-way or two-way. For example, the association between to and fro is one-way (to is nearly always found near fro but it is rare to find fro near to). See also: Definition of Key Word, Associates, Definitions, Mutual Information 8.5 associates "Associates" is the name given to key-words associated with a key key-word. The point of it... The idea is to identify words which are commonly associated with a key key-word, because they are key words in the same texts as the key key-word is. An example will help. Suppose the word wine is a key key-word in a set of texts, such as the weekend sections of newspaper articles. Some of these articles discuss different wines and their flavours, others concern cooking and refer to using wine in stews or sauces, others discuss the prices of wine in a context of agriculture and diseases affecting vineyards. In this case, the associates of wine would be items like Chardonnay, Chile, sauce, fruit, infected, soil, etc. The listing shows associates in order of frequency. A menu option allows you to re-sort them. Settings You can set a minimum number of text files for the association procedure, in the database settings: 120KeyWords 2007 Mike Scott Minimum texts The screenshot settings will only process those key-key-words which appear in at least 3 text files. Statistic Choose the mutual information statistic you prefer, apart from Z score which uses a span (here we're using the whole text). Minimum strength This will only show associates which reach at least the strength set here, eg. 3.000. See also: definition of associate. 8.6 choosing files Current Text Wordlist In the upper box, choose a word list file. To choose more than 1 word list file, press Control as you click to select non-adjacent lists, or Shift to select a range. This box determines which wordlist(s) you're going to find the key words of. Reference Corpus Wordlist The the box below, you choose your Reference Corpus List. (This can be set permanently in the main Controller Settings). No word-lists visible If you can't see any word lists in the displays, either change folders until you can, or go back to the WordList tool and make up at least 2 word lists: this procedure requires at least two before it can make a comparison. 8.7 clumps "Clumps" is the name given to groups of key-words associated with a key key-word. The point of it (1)... The idea here is to refine associates by grouping together words which are found as key in the 121 WordSmith Tools 2007 Mike Scott same sub-sets of text files. The example used to explain associates will help. Suppose the word wine is a key key-word in a set of texts, such as the weekend sections of newspaper articles. Some of these articles discuss different wines and their flavours, others concern cooking and refer to using wine in stews or sauces, others discuss the prices of wine in a context of agriculture and diseases affecting vineyards. In this case, the associates of wine would be items like Chardonnay, Chile, sauce, fruit, infected, soil, etc. The associates procedure shows all such items unsorted. The clumping procedure, on the other hand, attempts to sort them out according to these different uses. The reasoning is that the key words of each text file give a condensed picture of its "aboutness", and that "aboutnesses" of different texts can be grouped by matching the key word lists. Thus sets of key words can be clumped together according to the degree of overlap in the key word lexis of each text file. Two stages The initial clumping process does no grouping: you will simply see each set of key-words for each text file separately. To group clumps, you may simply join those you think belong together (by dragging), or regroup with help by pressing . The listing shows clumps sorted in alphabetical order. You can re-sort by frequency (the number of times each key word in the clump appeared in all the files which comprise the clump). See also: definition of associate, regrouping clumps 8.8 KeyWords clusters A KeyWords cluster, like a WordList cluster, represents two or more words which are found repeatedly near each other. However, a KeyWords cluster only uses key words. A screenshot will help make things clearer. 122KeyWords 2007 Mike Scott These are clusters computed using the Bible as source text. Each of the words here is "key" by comparison to a reference corpus; the clusters show cases where these KWs occur within the current collocation horizons. The [...] brackets represent cases where the KWs are not found together, e.g in come [.] pass there is one dot because the repeated occurrences are come to pass. See also: Plot calculation. 123 WordSmith Tools 2007 Mike Scott 8.9 concordance With a key word or a word list list on your screen, you can choose Compute and to call up a concordance of the currently selected word(s). The concordance will search for the same word in the original text file that your key word list came from. The point of it... is to see these same key-words in their original contexts. 8.10 creating a database To build a key words database, you will need a set of key word lists. For a decent sized database, it is preferable to build it like this: 1. Make a batch of word lists. 2. Use this to make a batch of keyword lists. Set "faster minimal processing" on as in this shot, so as to not waste time computing plots etc. 3. Now, in KeyWords, choose New | Database. This enables you to choose the whole set of key word files. Note that making a database means that only positive key words will be retained. In the Controller KeyWords settings you can make other choices: minimum frequency for database If you set this to 2 you will only use for the database any KWs which appear in 2 or more texts min. KWs per text If this is set to 10, any KW results files which ended up with very few KWs will be ignored. 124KeyWords 2007 Mike Scott See also: associates. 8.11 example of key words You have a collection of assorted newspaper articles. You make a word list based on these articles, and see that the most frequent word is the. Among the rather infrequent words in the list come examples like hopping, modem, squatter, grateful, etc. You then take from it a 1,000 word article and make a word list of that. Again, you notice that the most frequent word is the. So far, not much difference. You then get KeyWords to analyse the two word lists. KeyWords reports that the most "key" words are: squatter, police, breakage, council, sued, Timson, resisted, community. These "key" words are not the most frequent words (which are those like the) but the words which are most unusually frequent in the 1,000 word article. Key words usually give a reasonably good clue to what the text is about. Here is an example from the play Othello. 125 WordSmith Tools 2007 Mike Scott 8.12 key key-word definition A "key key-word" is one which is "key" in more than one of a number of related texts. The more texts it is "key" in, the more "key key" it is. This will depend a lot on the topic homogeneity of the corpus being investigated. In a corpus of City news texts, items like bank, profit, companies are key key-words, while computer will not be, though computer might be a key word in a few City news stories about IBM or Microsoft share dealings. See also: How Key Words are Calculated, Definition of Key Word, Creating a Database, Definitions 8.13 key-ness definition The term "key word", though it is in common use, is not defined in Linguistics. This program identifies key words on a mechanical basis by comparing patterns of frequency. (A human being, on the other hand, may choose a phrase or a superordinate as a key word.) A word is said to be "key" if a) it occurs in the text at least as many times as the user has specified as a Minimum Frequency b) its frequency in the text when compared with its frequency in a reference corpus is such that the statistical probability as computed by an appropriate procedure is smaller than or equal to a p value specified by the user. positive and negative keyness A word which is positively key occurs more often than would be expected by chance in comparison with the reference corpus. A word which is negatively key occurs less often than would be expected by chance in comparison with the reference corpus. typical key words KeyWords will usually throw up 3 kinds of words as "key". First, there will be proper nouns. Proper nouns are often key in texts, though a text about racing could wrongly identify as key, names of horses which are quite incidental to the story. This can be avoided by specifying a higher Minimum Frequency. Second, there are key words that human beings would recognise. The program is quite good at finding these, and they give a good indication of the texťs "aboutness". (All the same, the 126KeyWords 2007 Mike Scott program does not group synonyms, and a word which only occurs once in a text may sometimes be "key" for a human being. And KeyWords will not identify key phrases unless you are comparing wordlists based on word clusters.) Third, there are high-frequency words like because or shall or already. These would not usually be identified by the reader as key. They may be key indicators more of style than of "aboutness". But the fact that KeyWords identifies such words should prompt you to go back to the text, perhaps with Concord (just choose Compute | Concordance ), to investigate why such words have cropped up with unusual frequencies. See also: How Key Words are Calculated, Definition of Key Key-Word, Definitions, KeyWords Settings 8.14 KeyWords database (default file extension .KDB) The point of it... The point of this database is that it will allow you to see the "key-key-words" in your set of files. That is, the key-words which are most frequent over a number of files. For example, if you have 500 business reports, each one will have its own key words. These will probably be of two main kinds. There will be key-words which are key in one text but are not generally key (names of the firms and words relating to what they individually produce); and other, more general words (like consultant, profit, employee) which are typical of business documentation generally. By making up a database, you can sort these out. The ones at the top of the list, when you view them, will be those which are most typical of the genre. The list is ordered in terms of "key key-ness" but can be toggled into alphabetical order and back again. You can set a minimum number of files that each word must have been found to be key in, using Settings | KeyWords | Database. When viewing a database you will be able to investigate the associates of the key key-words. Under Statistics, you will also be able to see details of the key words files which comprise the database (file name and number of key words per file), together with overall statistics on the number of different types and the tokens (the total of all the key-words in the whole database including repeats). See also : Creating a database, Definition of key key-word 8.15 KeyWords: advice 1. Don't call up a plot of the key words based on more than one text file. It doesn't make sense! Anyway the plot will only show the words in the first text file. If you want to see a plot of a certain word or phrase in various different files, use Concord dispersion. 2. There can be no guarantee that the "key" words are "key" in the sense which you may attach to "key". An "important" word might occur once only in a text. They are merely the words which are outstandingly frequent or infrequent in comparison with the reference corpus. 3. Compare apples with pears, or, better still, Coxes with Granny Smiths. So choose your reference corpus in some principled way. The computer is not intelligent and will try to do whatever comparisons you ask it to, so iťs up to you to use human intelligence and avoid comparing apples with phone boxes! 127 WordSmith Tools 2007 Mike Scott 8.16 KeyWords: calculation The "key words" are calculated by comparing the frequency of each word in the wordlist of the text you're interested in with the frequency of the same word in the reference wordlist. All words which appear in the smaller list are considered, unless they are in a stop list. If the occurs say, 5% of the time in the small wordlist and 6% of the time in the reference corpus, it will not turn out to be "key", though it may well be the most frequent word. If the text concerns the anatomy of spiders, it may well turn out that the names of the researchers, and the items spider, leg, eight, etc. may be more frequent than they would otherwise be in your reference corpus (unless your reference corpus only concerns spiders!) To compute the "key-ness" of an item, the program therefore computes its frequency in the small wordlist the number of running words in the small wordlist its frequency in the reference corpus the number of running words in the reference corpus and cross-tabulates these. Statistical tests include: the classic chi-square test of significance with Yates correction for a 2 X 2 table Ted Dunning's Log Likelihood test, which gives a better estimate of keyness, especially when contrasting long texts or a whole genre against your reference corpus. See UCREĽs log likelihood site for more on these. A word will get into the listing here if it is unusually frequent (or unusually infrequent) in comparison with what one would expect on the basis of the larger wordlist. Unusually infrequent key-words are called "negative key-words" and appear at the very end of your listing, in a different colour. Note that negative key-words will be omitted automatically from a keywords database and a plot. Words which do not occur at all in the reference corpus are treated as if they occurred 5.0e-324 times (0.0000000 and loads more zeroes before a 5) in such a case. This number is so small as not to affect the calculation materially while not crashing the computer's processor. 8.17 KeyWords: links The point of it... is to find out which key-words are most closely related to a given key-word. A plot will show where each key word occurs in the original file. It also shows how many links there are between key-words. What are links? Links are "co-occurrences of key-words within a collocational span". An example is much easier to understand, though: Suppose the word elephant is key in a text about Africa, and that water is also a key word in the same text. If elephant and water occur within a span of 5 words of each other, they are said to be "linked". The number of times they are linked like this in the text will be shown in the Links window. What you see 128KeyWords 2007 Mike Scott This Links window shows the number of links followed by a column headed "in" and a percentage. This percentage represents the number of links divided by the total number of occurrences of the word in question (the "in" column number). Thus if you choose to see the links of elephant, and elephant crops up 10 times in your original text, and all 10 of those times iťs found near the word water, (even though water occurs 40 times altogether), you'll see 100%. If you choose to see the links of water, the percentage next to elephant will be 25%. The collocation horizons are those set in Concord, and go up to 25 words to left and right. The default is 5,5. Double-click on any word in the plot listing to call up a window (up to maximum of 20 windows) which show the linked key-words. See also: Plot calculation, KeyWords clusters 8.18 make a word list from keywords data With a key word list on your screen, you can press to save your data as a word list (for later comparison, etc. using WordList functions). 8.19 p value (Default=0.000001) The p value is that used in standard chi-square and other statistical tests. This value ranges from 0 to 1. A value of .01 suggests a 1% danger of being wrong in claiming a relationship, .05 would give a 5% danger of error. In the social sciences a 5% risk is usually considered acceptable. In the case of key word analyses, where the notion of risk is less important than that of selectivity, you may often wish to set a comparatively low p value threshold such as 0.000001 (one in 1 million) (1E-6 in scientific notation) so as to obtain fewer key words. Or you can set a low "maximum wanted" number in the main Controller, under Adjust Settings | KeyWords. If the chi-square procedure is used, the computed p value will only be shown if all appropriate statistical requirements are met (all expected values >= 5). See also: Definitions 8.20 plot calculation The point of it... is to see where the key words are distributed within the text. Do they cluster around the middle or near the beginning of the text? How iťs done This will calculate the inter-relationships between all the key words identified so far, excluding any which you have deleted or zapped. 1. it does a concordance on the text finding all occurrences of each key word; 2. it then works out which of each of the other key words appear within the collocation horizons (set in Settings). It uses the larger of the two horizons. 3. it then plots all the words showing where each occurrence comes in the original file (with a "ruler" showing how many words there are in each part of the file). 4. it computes how many other key-words co-occurred with it, within the current collocational span. 5. it computes a plot dispersion value. 129 WordSmith Tools 2007 Mike Scott Note: this process depends on KeyWords being able to find the source texts which your original wordlist was based on. You may find it useful to export your plot and make other graphs, as explained under Save As. See also: Plot Links, Key words plot display 8.21 plot display The plot will give you useful visual insights into how often and where the different key words crop up in the text. The plot is initially sorted to show which crop up more at the beginning (e.g. in the introduction) and then those from further in the text. The following screenshot shows KWs of the Bible, revealing where each term occurs. The name Jehoshaphat, for example, occurs mainly about one third of the way through the text. re-sorting You can re-sort the listing using . Re-sorting rotates through the following types: first mention of each key word in the text dispersion within the text the original plot order (which is based on key-ness) alphabetical order total number of links with other key-words links This shows the total number of links between the key-word and other key-words in the same text, within the current collocation span (default = 5,5). That is, how many times was each key-word found within 5 words of left or right of any of the other key-words in your plot. hits This column is here to remind you of how many occurrences there were of each key-word. When you have obtained a plot, you can then see the way certain words relate to others. To do 130KeyWords 2007 Mike Scott this, look at the Links window in the tabs at the bottom, showing which other key words are most linked to the word you clicked on. That is, which other words occur most often within the collocation horizons you've set. The Links window should help you gain insights into the lexical relations here. Each plot window is dependent on the key words listing from which it was derived. If you close that down, it will disappear. You can Print it. There's no Save option because the plot comes from a key words listing which you should Save, or Save As. There's no save as text option because the plot has graphics, which cannot adequately be represented as text symbols, but you can Copy to the clipboard (Ctrl-Ins) and then paste it into a word processor as a graphic. Alternatively, use the Output | Data as Text File option, which saves your plot data (each word is followed by the total number of words in the file, then the word number position of each occurrence). The ruler in the menu ( ) allows you to see the plot divided into 8 equal segments if based on one text, or the text-file divisions if there is more than one. See also: Key words plot, plot dispersion value 8.22 regrouping clumps How to do it You can simply join by dragging, where you think any two clumps belong together because of semantic similarity between their key-words. Or if you press , KeyWords will inform you which two clumps match best. You'll see a list of the words found only in one, a list of the words found only in the other, and (in the middle) a list of the words which match. Iťs up to you to judge whether the match is good enough to form a merged clump. If you aren't sure, press Cancel. If you do want to join them, press Join. If you're sure you don't want to join them and don't want KeyWords to suggest this pair again, press Skip. You can tell KeyWords to skip up to 50 pairs. To clear the memory of the items to be skipped, press Clear Skip. The point of it (2)... Scott (1997) shows how clumping reveals the different perceived roles of women in a set of Guardian features articles. See also: clumps 8.23 re-sorting: KeyWords How to do it... Sorting can be done simply by pressing the top row of any list. Or by pressing F6 / Ctrl/F6. Or by choosing the menu option. Press again to toggle between ascending & descending sorts. A key words list offers a choice between sorting by key-ness (the keyest words appear at the top) alphabetical order (from A to Z) frequency in the smaller list (the most frequent words come first) frequency in the reference list (the most frequent words come first) A key words plot rotates between sorting by key-ness (the keyest words appear at the top) alphabetical order (from A to Z) frequency (words which appear oftenest come first) number of links (the most linked words come first) 131 WordSmith Tools 2007 Mike Scott first mention of each key word in the text range (words used in smallest sections of text come first) A key key words database toggles between sorting by frequency (the most key key words appear at the top) alphabetical order (from A to Z) An Associates list toggles between sorting by frequency (association between title-word and item) alphabetical order (from A to Z) frequency (association between item and title-word) 8.24 the key words screen The display shows 1. each key word 2. its frequency in the source text(s) which these key words are key in. (Freq. column below) 3. the % that frequency represents. 4. its frequency in the reference corpus (RC. Freq. column) 5. the reference corpus frequency as a % 6. keyness (chi-square or log likelihood statistic) (Keyness column) 7. p value. The calculation of how unusual the frequency is, is based on the statistical procedure used. The statistic appears to the right of the display. If the procedure is log likelihood, or if chi-square is used and the usual conditions for chi-square obtain (expected value >= 5 in all four cells) the probability (p) will be displayed to the right of the chi-square value. The criterion for what counts as "outstanding" is based on the minimum probability value selected before the key words were calculated. The smaller the number, the fewer key words in the display. Usually you'll not want more than about 40 key words to handle. The words appear sorted according to how outstanding their frequencies of occurrence are. Those near the top are outstandingly frequent. At the end of the listing you'll find any which are outstandingly infrequent (negative keywords), in a different colour. There is no upper limit to the keyness column of a set of key words. It is not necessarily sensible to assume that the word with the highest keyness value must be the most outstanding, since keyness is computed merely statistically; there will be cases where several items are obviously equally key (to the human reader) but the one which is found least often in the reference corpus 132KeyWords 2007 Mike Scott and most often in the text itself will be at the top of the list. 8.25 WordSmith controller: KeyWords settings These are found in the main Controller under Adjust Settings | KeyWords. This is because some of the choices may affect other Tools. KeyWords and WordList both use similar routines: KeyWords to calculate the key words of a text file, and WordList when comparing comparing word-lists. Procedure Chi-square or Log Likelihood. The default is Log Likelihood. See procedure for further details. Max. p value The default level of significance. See p value for more details. Max. wanted (500) and Min. frequency (3) You may want to restrict the number of key words (KWs) identified so as to find for example the ten most "key" for each text. The program will identify all the key words, sort them by key-ness, 133 WordSmith Tools 2007 Mike Scott and then throw away any excess. It will thus favour positive key words over negative ones. The minimum frequency is a setting which will help to eliminate any words or clusters which are unusual but infrequent. For example, a proper noun such as the name of a village will usually be extremely infrequent in your reference corpus, and if mentioned only once in the text you're analysing, it is likely not to be "key". The default setting of 3 mentions as a minimum helps reduce spurious hits here. In the case of short texts, less than 600 words long, a minimum of 2 will automatically be used. Exclude negative KWs If this is checked, KeyWords will not compute negative key words (ones which occur significantly infrequently). Minimal processing If this is checked, KeyWords will not compute plots, links or KW clusters as it computes the key words (they can always be computed later assuming you do not move or delete the original text files). This is useful if computing a lot of KW files in a batch, eg. to make a database. Full lemma processing If this is checked (the default), KeyWords will compute the full frequency in the case of lemmatised items. For example if GO represents WENT, GOES etc. and GO alone had a frequency of 10 but the whole set GO, WENT, GONE etc. totalled 100, then its frequency will be counted as 100. If unchecked GO would count only 10. Max. link frequency To compute a plot is hard work as all the KWs have to be concordanced so as to work out where they crop up. To compute links between each KW is much harder work again and can take time especially if your KWs include some which occur thousands or hundreds of times in the text. To keep this process more manageable, you can set a default. Here 2000 means that any KW which occurs more than 2000 times in the text will not be used for computing links. (It will still appear in the plots and list of KWs, of course.) Database: minimum frequency The default is 1. See database. Database: associate minimum texts The default is 5. See associates. See also: KeyWords Help Contents, KeyWords calculation. WordList Section IX WordSmith Tools 135 WordSmith Tools 2007 Mike Scott 9 WordList 9.1 purpose This program generates word lists based on one or more ASCII or ANSI text files. The word lists are automatically generated in both alphabetical and frequency order, and optionally you can generate a word index list too. The point of it... These can be used 1 simply in order to study the type of vocabulary used; 2 to identify common word clusters; 3 to compare the frequency of a word in different text files or across genres; 4 to compare the frequencies of cognate words or translation equivalents between different languages; 5 to get a concordance of one or more of the words in your list. Within WordList you can compare two lists, or carry out consistency analysis (simple or detailed) for stylistic comparison purposes. These word-lists may also be used as input to the KeyWords program, which analyses the words in a given text and compares frequencies with a reference corpus, in order to generate lists of "key-words" and "key-key-words". See also: WordList display 9.2 index Explanations What is Wordlist and What Does It Do? Comparing Word-lists Comparison Display Consistency Analysis (Simple) Consistency Analysis (Detailed) Definitions Detailed Statistics Lemmas Limitations Summary Statistics Match List Mutual Information Sort Order Stop Lists Type/token Ratios Procedures Auto-Join Batch Processing 136WordList 2007 Mike Scott Calling up a Concordance Choosing Texts Colours Computing a new variable Folders Editing Entries Editing Filenames Keyboard Shortcuts Exiting Fonts Minimum & Maximum Settings Mutual Information Score Computing Printing Re-sorting a Word List Saving Results Searching for an Entry by Typing Searching for Entry-types using Menu Single Words or Clusters Text Characteristics Word Index Zapping entries See also: WordSmith Main Index, WordList display 9.3 auto-joining lemmas The menu option Auto-Join can be used to specify a string such as S or S;ED;ING and will then go through the whole word list, lemmatising all entries where one word only differs from the next by having S or ED or ING on the end of it. (Use ; to separate multiple suffixes.) Prefix / Suffix / Infix By default all strings typed in are assumed to be suffixes; to join prefixes put an asterisk (*) at the right end of the prefix. If you want to search for infixes (eg. bloody in absobloodylutely [languages like Swahili use infixes a lot]) put an asterisk at each end. Examples S;ED;ING will join books to book, booked to book and booking to book *S;*ED;*ING will join books to book, booked to book and booking to book UN*;ED;ING will join undo to do, booked to book and booking to book *BLOODY* will join absobloodylutely to absolutely The process can be left to run quickly and automatically, or you can have it confirm with you before joining each one. Automatic lemmatisation, like search-and-replace spell-checking, can produce oddities if just left to run! To stop in the middle of auto-joining, press Escape. Tip With a previously saved list, try auto-joining without confirming the changes (or choose Yes to All during it). Then choose the Alphabetical (as opposed to Frequency) version of the list and sort on Lemmas (by pressing the Lemmas heading). You will see all the joined entries at the top of the list. It may be easier to Unjoin (Ctrl + F4) any mistakes than to confirm each one... Finally, sort on the Word and save. 137 WordSmith Tools 2007 Mike Scott See also: Lemmatisation 9.4 choosing lemma file The point of it... You may choose to lemmatise all items in the current word list using a standard text file which groups words which belong together (be -> was, is, were, etc.). While it is time-consuming producing the text file the first time, it will be very useful if you want to lemmatise lots of word lists, and is much less "hit-and-miss" than auto-joining. There is an English-language lemma list from Yasumasa Someya at http://www.lexically.net/downloads/e_lemma.zip. How to do it In the main Controller, Settings | Adjust Settings | Lemma,Match,Stop lists, you will see a screen like this: Choose the appropriate button (for Concord, KeyWords or WordList) and type the file name or browse for it. The file should contain a plain text list of lemmas with items like this: BE -> AM, ARE, WAS, WERE, IS GO -> GOES, GOING, GONE, WENT WordSmith then reads the file and displays them (or a sample if the list is long). The format 138WordList 2007 Mike Scott allows any alphabetic or numerical characters in the language the list is for, plus the single apostrophe, space, underscore. In other words, if you mistakenly put GO = GOES that line won't be included because of the = symbol. The actual processing of the list only takes place when you choose the menu option Match Lemmas ( ) in WordList, Concord or KeyWords. See Match List for a more detailed explanation, with screenshots. What if my text files don't contain BE? Suppose you are matching AM, ARE etc with BE as in the list above, but your texts don't actually contain the word BE. WordList won't find it to link to.... The best way around this is to make a new word-list on the basis of a plain text file (in which you include BE and any other base forms wanted), save it, and then merge it with your existing wordlist. Now WordList should find the form BE to add to it AM, ARE, WAS etc. See also: Lemmatisation, Match List, Stop List 9.5 comparing wordlists The idea is to help stylistic comparisons. Suppose you're studying several versions of a story, or different translations of it. If one version uses kill and another has assassinate, you can use this function. The procedure compares all the words in both lists and will report on all those which appear significantly more often in one than the other, including those which appear more than a minimum number of times in one even if they do not appear at all in the other. How 1. Open a wordlist. 2. In the menu, choose File | Compare 2 wordlists. 3. Choose a wordlist to compare with. You will see the results in one of the tabs at the bottom of the screen. The minimum frequency (which you can alter in the Controller, Adjust Settings, KeyWords tab) can be set to 1. If it is raised to say 3, the comparison will ignore words which do not appear at least 3 times in at least one of the two lists. Choose the significance value (all, or a p value from 0.1 to 0.000001 or what you will). The smaller the p value, the more selective the comparison. In other words, a p setting of 0.1 will show more words than a p setting of 0.0001 will. The display format is similar to that used in KeyWords. See also: Consistency Analysis, Match List 139 WordSmith Tools 2007 Mike Scott 9.6 merging wordlists The point of it You might want to merge 2 word lists (or concordances, mutual information lists etc.) with each other if making each one takes ages or if you are gradually building up a master word list or concordance based on a number of separate genres or text-types. How to do it With one wordlist (or concordance) opened, choose File | Merge with and select another. Be aware that... Making a merged word list implies that each set of source texts was different. If you choose to merge 2 word lists both of which contained information about the same text file, WordSmith will do as you ask even though the information about the number of occurrences and of texts in which each word-type was found is (presumably) inaccurate. Merging a list in English with another in Spanish: if you start with the one in Spanish, the one in English will be merged in and henceforth treated as if it were Spanish, eg. in sort order. Presumably if you try to merge one in English with one in Arabic (I've never tried) you should see all the forms but you would get different results merging the Arabic one into the English one (all the Arabic words would be treated as if they were English). 9.7 comparison display Here is a comparison window, where we have compared Shakespeare's King Lear with Romeo and Juliet. The display shows frequency in the text you started with, here King Lear, (with % if > 0.01%) -- then, to the right frequency in the other text, here Romeo & Juliet, (with % if > 0.01%) -- then, to the right chi-square or log likelihood, and p value. The criterion for what counts as "outstanding" is based on the minimum probability value entered before the lists were compared. The smaller this probability value the fewer words in the display. The words appear sorted according to how outstanding their frequencies of occurrence are. Those near the top are outstandingly frequent in your main wordlist. At the end of the listing you'll find those which are outstandingly infrequent in the first text chosen: in other words, key in the second text. This comparison is similar to the analysis of "key words" in the KeyWords program. The KeyWords analysis is slightly quicker and allows for batch processing. The word Lear is the most key of all, it scores 304 on the keyness column. (It looks like 04.56 because the column hasn't been pulled any wider.) 140WordList 2007 Mike Scott The words above, in black, are key to Lear. Below, we see the middle of the listing --- the words in red are those which are key to Romeo. The word most is the last key word of Lear, and death the least key in Romeo; both have a keyness value of around 25 (positive or negative). Here at the bottom we see the words which are most key to the play Romeo and Juliet. 141 WordSmith Tools 2007 Mike Scott The word which is most outstanding (key) here is Romeo, with a keyness score of 394 (the column needs to be puller wider). 9.8 consistency analysis (detailed) This function does exactly the same thing as simple consistency, but provides much more detail. The point of it... The idea is to help stylistic comparisons. Suppose you're studying several versions of a story, or different translations of it. This function enables you to see all the words which are used in the wordlists which you have called up. The display will order the words, so that the first group contains all those which occur in all versions, then those which come in all versions but one, and so on down to those which occur in only one version. Within each set the words are ordered alphabetically. The Freq. column shows how many instances of each word occurred overall, Texts shows how many text-files it came in. Then there 142WordList 2007 Mike Scott are two columns (No. of Lemmas, and Set which behaves as in a word-list) and then a column for each text. In this case, the word about occurred in all 7 texts, it occurred 77 times in all, and it was most frequent in 1e.txt at 20 occurrences. Statistics and filenames can be seen for the set of 7 texts used here by clicking on the tabs at the bottom. Notes can be edited and saved along with the detailed consistency list. Note that the filename is test.dcl (detailed consistency list). There is no limit except the limit of available memory as to how many text files you can process in this procedure. How to do it... In the window you see when you press New...( ) you will be offered a tab showing detailed consistency. Choose your word-lists and press compute Detailed Consistency now. Each column can be sorted by clicking on its header column (Word, Freq. etc.). To get the words which occurred in all 7 texts to the top, I clicked Texts. See also: Consistency Analysis (Simple), Comparison Display, Comparing Word-lists, Match List, Column Totals 9.9 consistency analysis (simple) This function (termed "range" by Paul Nation) comes automatically with any word-list. In any word-list you will see a column headed "Texts". This shows the number of texts each word occurred in (the maximum here being the total number of text-files used for the word-list). The point of it... The idea is to find out which words recur consistently in lots of texts of a given genre. For example, the word consolidate was found to occur in many of a set of business Annual Reports. It did not occur very often in each of them, but did occur much more consistently in the business reports than in a mixed set of texts. Naturally, words like the are consistent across nearly all texts in English. (While working on a set of word lists to compare with business reports, I found one text without the. I also discovered that one of my texts was in Italian: but this wasn't the one without the! The culprit was an election results list, which contained lots of instances of Cons., Lab. and place names, but no instances of the.) To analyse common grammar words like the, a consistency list may be very useful. Even so, you're likely to find some common lexical items recur surprisingly consistently. To eliminate the commonly consistent words and find only those which seem to characterise your genre or sub-genre, you need to find out which are significantly consistent. Save your word list, then use it for comparison with others in WordList, or using KeyWords. This way you can determine which are the significantly consistent words in your genre or sub-genre. See also: Consistency Analysis (Detailed), Comparing Word-lists, Match List 143 WordSmith Tools 2007 Mike Scott 9.10 lemmas You may want to store several entries together: e.g. want; wants; wanting; wanted as members of the same lemma. Manual joining You can simply do this by dragging one entry to another. Suppose your word list has WANT WANTED WANTING you can simply grab wanting or wanted with your mouse and place it on want. (See choosing lemma file if you want to join these to a word which isn't in the list) Both the alphabetical and the frequency lists will be correctly updated, though the frequency list may not reflect the true order until after the file has been re-ordered by zapping entries. A lemmatised head entry has a red mark in the left margin beside it. The others you marked will be coloured as if deleted. The linked entries which have been joined to the head can be seen at the right. Here we see a word list based on 3-word clusters where originally a good deal had a frequency of 5, but has been joined to a great deal and thereby gained 10. If you cannot see all the items you want to join in one screen, you can do the same thing using function keys. 1. Use F5 to mark an entry for joining to another. The first one you mark will be the "head". For the moment, while you're still deciding which other entries belong with it, the edge of that row will be marked green. Any entries which you then decide to link with the head (by again pressing F5) will show they're marked too, in white. (If you change your mind you can press F5 again and the marking will disappear.) 2. Use F4 to join all the entries which you've marked. The program will then put the joint frequencies of all the words you've marked with the frequency of the one you marked first (the head). To Un-join If you select an item which has lemmas visible at the right and press Control/F4, this will unjoin the entries. File-based joining Alternatively you can join up lemmas using a text file which automates the matching & joining process. The actual processing of the list takes place when you choose the menu option Match 144WordList 2007 Mike Scott Lemmas ( ) in WordList, Concord or KeyWords. Every entry in your lemma list will be checked to see whether it matches one of the entries in your word list. In the example, if, say, am, was, and were are found, they will be stored as lemmas of be. If go and went are found, then went will be joined to go. Auto-joining To speed up this lemmatisation process, you can auto-join any of the entries in your current worldist which meet your criteria. Can't read all the lemma forms Double-click on the Lemmas column as in the shot below, and a window of Lemma Forms will open up, showing the various components. See also: Auto-Join, Using a text file to lemmatise, selecting multiple entries 9.11 index lists: uses the point of it 1. One of the uses for an Index is to record the positions of all the words in your text file, so that you can subsequently see which word came in which part of each text. Another is to speed up access to these words, for example in concordancing. If you select one or more words in the index and press , you get a speedy concordance. 2. Another is to compute "Mutual Information" scores which relate word types to each other. 3. Or you can use an index to see word clusters. See also Making an Index List, Viewing Index Lists, WordList Help Contents. 9.12 index lists: viewing In WordList, open an index as you would any other kind of word-list file -- using File | Open. Or, easier in my opinion, in the Controller | Previous lists, choose any index you've made and double-click it. You will see the index as if it were a large word-list. 145 WordSmith Tools 2007 Mike Scott The picture above shows the top 10 words in the BNC World Corpus. Number 5 (#) represents numbers or words which contain numbers such as 50.00. These very frequent words are also very consistent -- they appear in at least 99% of the 4,054 texts of BNC World. In the view below, you see words sorted by the number of Texts: all these words appeared 10 times in the corpus but their frequencies vary. You can highlight one or more words or mark them with the option, then to get a speedy concordance. See also Making an Index List, WordList clusters, WordList Help Contents. 146WordList 2007 Mike Scott 9.13 making a WordList Index index files Two files are created for each index: .tokens file: a large file containing information about the position of every word token in your text files. .types file: knows the individual word types. To create an index, first use the main Controller and choose Adjust Settings | Index. You will need to specify a basic filename for the index because WordSmith needs to know the filename before it can do the work (unlike a concordance where you only save the results after it has done the work of computing the concordance). In this screenshot below, the basic filename is new_one: WordSmith will add .tokens and .types to this basic filename as it works. If you choose an existing basic filename which you have already used, WordList will check whether you want to add to it or start it afresh: Next, select your text files in the usual way. WordList will go through your selected texts and store information about the position of every instance of every word-type using the .tokens and .types files. An index permits the computation of word clusters and Mutual Information scores for each word type. The screenshot below shows the progress bars for an index of the BNC World corpus; on a desktop PC with 1GB of RAM it has taken nearly one hour to do 96% of the work: a rate of about 1.8 million words per minute. The resulting BNC Words.tokens file was 1.6GB in size and the BNC Words.types file was 26 MB. On a basic laptop with 512MB of RAM it took about 3 hours 15 minutes. 147 WordSmith Tools 2007 Mike Scott adding to an index To add to an existing index, just choose some more texts and choose File | New | Index. If the existing filename is already in use for an index, you will be asked whether to add more ('Yes') or start it afresh ('No'). See also Using Index Lists, Viewing Index Lists, WordList Help Contents. 9.14 index clusters WordList clusters A word list doesn't need to be of single words. You can ask for a word list consisting of two, three, up to eight words on each line. To do cluster processing in WordList, first make an index. How to see clusters... Open the index. Now choose Compute | Clusters. 148WordList 2007 Mike Scott Words to make clusters from "all" : all the clusters involving all words above a certain frequency (this will be s-l-o-w for a big corpus like the BNC World), or "selection": clusters only for words you've selected (eg. you have highlighted BOOK and BOOKS and you want clusters like book a table, in my book). To choose words which aren't next to each other, press Control and click in the number at the left -- keep Control held down and click elsewhere. The first one clicked will go green and the others white. In the picture below, using an index of the BNC World corpus, I selected world and then life by clicking numbers 164 and 167. 149 WordSmith Tools 2007 Mike Scott The process will take time. In the case of BNC World, the index knows the positions of all of the 100 million words. To find 3-word clusters, in the case above, it took about a minute to process all the 115,000 cases of world and life and find 5,719 clusters like the world bank and of real life. Chris Tribble tells me it took his PC 36 hours to compute all 3-word clusters on the whole BNC ... he was able to use the PC in the meantime but thaťs not a job you're going to want to do often. What you see The "cluster size" must be between 2 and 8 words. The "min. frequency" is the minimum number of each that you want to see. Here the user has chosen to see any 3-word clusters that appear 5 or more times. Working constraints The "max. frequency %" setting is to speed the process up. It means the maximum frequency percentage which the calculation of clusters for a given word will process. This is because there are lots and lots of the very high frequency items and you may well not be interested in clusters which begin with them. For example, the item the is likely to be about 6% of any word-list (about 6 million of them in the BNC therefore), and you might not want clusters starting the... -- if so, you might set the max. percent to 0.5% or 0.1% (which for the BNC World corpus will cut out the top 102 frequency words). You will still get clusters which include very high frequency items in the middle or end, like the a in book a table, but would not get in my book, which begins with the very high frequency word in. The more words you include, the longer the process will take.... Max. seconds per word is another way of controlling how long the process will take. The default (0) means no limit. But if you set this e.g. to 30 then as WordList processes the words in order, as soon as one has taken 30 seconds no further clusters will be collected starting with that word. Stop at, like Concord clusters, offers a number of constraints, such as sentence and other punctuation-marked breaks. The idea is that a 5-word cluster which starts in one sentence and continues in the next is not likely to make much sense. 150WordList 2007 Mike Scott What they look like Here is a small set of 3-word clusters involving rabies from the BNC World corpus. Some of them are plausible multi-word units. All clusters which appear at least 5 times are shown: to alter that setting, choose Adjust Settings | Index in the Controller and set the "show if frequency.." number thus: See also: clusters in Concord 9.15 menu search Using the menu you can search for a sub-string within an entry -- e.g. all words containing "fore" (by entering *fore* -- the asterisk means that the item can be found in the middle of a word, so *fore will find before but not beforehand, while *fore* will find them both). These searches can be repeated. This function enables you to find parts of words so that you can edit your wordlist, e.g. by joining two words as one. You can search for ends or middles of words by using the * wildcard. Thus *TH* will find other, something, etc. *TH will find booth, sooth, etc. You can then use F8 to repeat your last search. The search hot keys are: F8 repeat last search (use in conjunction with F10 or F11) F10 search forwards from the current line F11 search backwards from the current line F12 search starting from the beginning 151 WordSmith Tools 2007 Mike Scott This function is handy for lemmatization (joining words which belong under one entry, such as seem/ seems/ seemed/ seeming etc.) See also: searching for an entry by typing 9.16 mutual information scores the point of it A Mutual Information (MI) score relates one word to another. For example, if problem is often found with solve, they may have a high mutual information score. Usually, the will be found much more often near problem than solve, so the procedure for calculating Mutual Information takes into account not just the most frequent words found near the word in question, but also whether each word is often found elsewhere, well away from the word in question. Since the is found very often indeed far away from problem, it will not tend to be related, that is, it will get a low MI score. This relationship is bi-lateral: in the case of kith and kin, it doesn't distinguish between the virtual certainty of finding kin near kith, and the much lower likelihood of finding kith near kin. There are various different formulae for computing the strength of collocational relationships. The MI in WordSmith ("specific mutual information") is computed using a formula derived from Gaussier, Lange and Meunier described in Oakes, p. 174; here the probability is based on total corpus size in tokens. Other measures of collocational relation are computed too, which you will see explained under Mutual Information Display. Settings The Mutual Information settings are found in the Controller under Adjust Settings | Indexing or in a menu option in WordList. 152WordList 2007 Mike Scott stop at: you can choose where you want collocational breaks to be assumed. With the setting above, "I wrote the letter. Then I posted it" would not consider posted as a possible collocate of letter because there's a sentence break between them. max. percent: ignores any tokens which are more frequent than the percentage indicated. (The point of this is to avoid computing mutual information for words like the and of, which are likely to have a frequency greater than say 1.0%.) span: the number of intervening words between collocate and node. With a span of 5, the node wrote would consider the, letter, then, I and posted as possible collocates if stop at were set at no limits. min. mutual info: the minimum number which the MI must come up with to be reported. A useful limit is 3.0. Below this, the linkage between node and collocate is likely to be rather tenuous. min. frequency: the minimum frequency for any item to be considered for the mutual information calculation (default = 5). (If an item occurs only once or twice, the mutual information is unlikely to be informative.) See also: Mutual Information Display, Computing Mutual Information, Making an Index List, Viewing Index Lists, WordList Help Contents. See Oakes for further information about Mutual Information. 153 WordSmith Tools 2007 Mike Scott 9.17 mutual information: computing In WordList or in Concord In Concord MI is not computed by default for a collocate list. To compute MI, you need a word list to supply the relevant data. Suppose you have made a concordance using all the files in c:\wsmith5\text\shakespeare and have done a concordance on love. You get collocates such as Romeo, hate, the, Juliet, Nurse etc. All these show a "Relation" (MI) score of "??" because they haven't yet been computed. If you haven't done so yet, use WordList to make a word list of the same text files (or if you prefer, use some other reference corpus). Make sure the reference corpus file is what you prefer. Now choose the menu item and Concord will use the reference corpus filename. It will look up each of your collocates in the word list and compute MI using the information in the reference corpus word list. In WordList To compute Mutual Information (MI) you need a WordList Index. Call up the alphabetical view of the list. When you press , you can choose whether to compute MI for selected (highlighted) entries, for all entries, or for those between two initial characters e.g. between A and D. If you wish to select only a few items for MI calculation, you can mark them first (with ). You can always do part of the list (eg. A to D) and later merge your mutual-information list with another (E to H). What you see: set the minimum frequency to suit the frequency, e.g. 5 means that no word of 154WordList 2007 Mike Scott frequency 4 or less in the index will be visible in the MI results. Omit # means no numbers will be considered, and omit if word1=word2 is there because you might find that GOOD is related to GOOD if there are lots of cases where these 2 are found near each other. Working constraints: this is to set things so that the process doesn't take forever, as explained below. Max. frequency = ignore high frequency words which would occur say at 0.5% frequency. (Above 0.5% in the case of the BNC would mean ignoring about 20 of the top frequency words, such as WITH, HE, YOU. Above 0.1% would cut about 100 words including GET, BACK, BECAUSE.) Stop at has to do with whether breaks such as punctuation or sentence breaks determine that one word cannot be related to another; to suit the frequency, e.g. 5 means that no word of frequency 4 or less in the index will be used in the MI results. Span is how far left and right to look for the MI relation. From A to A is where you choose a range of words starting with those characters. Computing the MI score for each and every entry in an index takes a long time: it took over an hour to compute MI for all words beginning with B in the case of the BNC World edition (written, 90 million words) in the screenshot below, using the settings visible above. It might take 24 hours to process the whole BNC, 100 million words, even on a modern powerful PC. Don't forget to save your results afterwards! See also Collocates, Mutual Information Settings, Mutual Information Display, Making an Index List , Viewing Index Lists, WordList Help Contents. 155 WordSmith Tools 2007 Mike Scott 9.18 mutual information display The "Mutual Information" procedure contains a number of columns and uses various formulae: Word 1: the word to the left, followed by Freq. (its frequency in the whole index). Word 2: the word to the right, followed by Freq. (its frequency in the whole index). Texts: the number of texts this pair was found in (there were 56 in the whole index). Gap: the most common distance between Word 1 and Word 2. Joint: their joint frequency. In line 2 of this display, PURSE occurs 6 times in the whole index, and STRINGS 5 times. They occur together 5 times -- in other words in this little corpus, strings is always part of the phrase purse strings. The gap is 1 because strings comes 1 word after purse. The pair purse strings comes in 3 texts. As usual, the data can be sorted by clicking on the headers. Above, it was sorted by clicking on "MI" first and "Word 1" second. You get a double sort, main and secondary, because sometimes you will want to see how MI or Z score or other sorting affects the whole list and sometimes you will want to keep the words sorted alphabetically and only sort by MI or Z score within each word-type. Press Swap to switch the primary & secondary sorts. Compare this with the display sorted by Z Score (Oakes p. 163). 156WordList 2007 Mike Scott TED HEATH (a UK Prime Minister of the 1970s) is still top and SPEAKERS ... VOUCH still visible, but some other items have moved in. Here is the display sorted by MI3 Score (Oakes p. 172): Much more frequent items have jumped to the top. Finally, by Log Likelihood (Dunning, 1993): 157 WordSmith Tools 2007 Mike Scott Here the Word 2 items are very high frequency ones and we get at colligation (grammatical collocation). See also: Formulae, Mutual Information, Computing Mutual Information, Making an Index List, Viewing Index Lists, WordList Help Contents. See Oakes for further information about Mutual Information. 9.19 re-sorting: consistency lists The frequency-ordered consistency display can be re-sorted by alphabetical order (Word) total frequencies overall (Total, the default) by the frequencies in any given file (you see the file names). Click on Word, Total or a filename to choose. The sort can be either ascending or descending, the default being descending. See also: Sorting word-lists 9.20 statistics These include: number of files involved in the word-list file size (in bytes, i.e. characters) running words in the text (tokens) no. of different words (types) type/token ratios no. of sentences in the text mean sentence length (in words) standard deviation of sentence length (in words) no. of paragraphs in the text mean paragraph length (in words) standard deviation of paragraph length (in words) no. of headings in the text mean heading length (in words) no. of sections in the text mean section length (in words) standard deviation of heading length (in words) 158WordList 2007 Mike Scott the number of 1-letter words ... the number of n-letter words (to see these scroll the list box down) (14 is the default maximum word length. But you can set it to any length up to 50 letters in Word List Settings, in the Settings menu.) Longer words are cut short but this is indicated with a + at the end of the word. The number of types (different words) is computed separately for each text. Therefore if you have done a single word-list involving more than one text, summing the number of types for each text will not give the same total as the number of types over the whole collection. See also : WordList display (with a screenshot), Summary Statistics, Starts and Ends of Text Segments. 9.21 import words from text list the point of it You might want a word list based on some data you have obtained in the form of a list, but whose original texts you do not have access to. requirements Your text file can be in any language (select this before you make the list), and can be in Unicode or ASCII. But it must follow a similar format as a stop list expects, except that following each word there must be a character and the frequency as a plain number (decimal points will be ignored). Do not use commas as a thousands delimiter as otherwise they'll be interpreted as different words. The words do not need to be in frequency or alphabetical order. Example ; My word list for test purposes. THIS 67543 IT 33218 WILL 2978 BE 5679 COMPLETE 45 AND 99345 UTTER 54 RUBBISH 99 THE 578965 IS 55678 You should get results like these. 159 WordSmith Tools 2007 Mike Scott Statistics are calculated in the simplest possible way: the word-lengths (plus mean and standard deviation), and the number of types and tokens. Most procedures need to know the total number of running words (tokens) and the number of different word types so you should manage to use the word-list in KeyWords etc. how to do it When you choose the New menu option ( ) in WordList you get a window offering three tabs: a Main tab for most usual purposes, 160WordList 2007 Mike Scott one for Detailed Consistency, and another (Advanced) for creating a word list using a plain text file. Choose your .txt file and press create word list now. 9.22 type/token ratios If a text is 1,000 words long, it is said to have 1,000 "tokens". But a lot of these words will be repeated, and there may be only say 400 different words in the text. "Types", therefore, are the different words. The ratio between types and tokens in this example would be 40%. But this type/token ratio (TTR) varies very widely in accordance with the length of the text -- or corpus of texts -- which is being studied. A 1,000 word article might have a TTR of 40%; a shorter one might reach 70%; 4 million words will probably give a type/token ratio of about 2%, and so on. Such type/token information is rather meaningless in most cases, though it is supplied in a WordList statistics display. The conventional TTR is informative, of course, if you're dealing with a corpus comprising lots of equal-sized text segments (e.g. the LOB and Brown corpora). But in the real world, especially if your research focus is the text as opposed to the language, you will probably be dealing with texts of different lengths and the conventional TTR will not help you much. Wordlist uses a different strategy for computing this, therefore. The standardised type/token ratio (STTR) is computed every n words as Wordlist goes through each text file. By default, n = 1,000. In other words the ratio is calculated for the first 1,000 running words, then calculated afresh for the next 1,000, and so on to the end of your text or corpus. A running average is computed, which means that you get an average type/token ratio based on consecutive 1,000-word chunks of text. (Texts with less than 1,000 words (or whatever n is set to) will get a standardised type/token ratio of 0.) 161 WordSmith Tools 2007 Mike Scott Setting the N boundary Adjust the n number in Minimum & Maximum Settings to any number between 100 and 20,000. What STTR actually counts Note: The ratio is computed a) counting every different form as a word (so say and says are two types) b) using only the words which are not in a stop-list c) those which are within the length you have specified, d) taking your preferences about numbers and hyphens into account. The number shown is a percentage of new types for every n tokens. That way you can compare type/token ratios across texts of differing lengths. This method contrasts with that of Tuldava (1995:131-50) who relies on a notion of 3 stages of accumulation. The WordSmith method of computing STTR was my own invention but parallels one of the methods devised by the mathematician David Malvern working with Brian Richards (University of Reading). Further discussion TTR and STTR are both pretty crude measures even if they are often assumed to imply something about "lexical density". Suppose you had a text which spent 1,000 words discussing ELEPHANT, LION, TIGER etc, and then 1,000 discussing MADONNA, ELVIS, etc., then 1,000 discussing CLOUD, RAIN, SUNSHINE. If you set the STTR boundary at 1,000 and happened to get say 48% or so for each section, the statistic in itself would not tell you there was a change involving Africa, Music, Weather. Suppose the boundary between Africa & Music came at word 650 instead of at word 1,000, I guess there'd be little or no difference in the statistic. But what would make a difference? A text which discussed clouds and written by a person who distinguished a lot between types of cloud might also use MIST, FOG, CUMULUS, CUMULO-NIMBUS. This would be higher in STTR than one written by a child who kept referring to CLOUD but used adjectives like HIGH, LOW, HEAVY, DARK, THIN, VERY THIN to describe the clouds... and who repeated DARK, THIN, etc a lot in describing them..... (NB. Shakespeare is well known to have used a rather limited vocabulary in terms of measures like these!) 9.23 case sensitivity Normally, you'll make a case-insensitive word list, especially as in most languages capital letters are used not only to distinguish proper nouns but also to signal beginnings of sentences, headings, etc. If, however, you wish to make a word list which distinguishes between major, Major and MAJOR, activate case sensitivity (Adjust Settings | WordList | Case Sensitivity in the Controller). When you first see your case-sensitive list, it is likely to appear all in UPPER CASE. Press Ctrl/L or choose the Layout menu option ( ) to change this. 9.24 minimum & maximum settings These include: minimum word length Default: 1 letter. When making a word-list, you can specify a minimum word length, e.g. so as to cut out all words of less than 3 letters. maximum word length Default: 49 letters. You can allow for words of up to 50 characters in length. If a word exceeds the limit and Abbreviate with + is checked, WordList will append a + symbol at the end of it to 162WordList 2007 Mike Scott show that it was cut short. (If Abbreviate with + is not checked, the long word will be omitted from your word list. You might wish to use this to set both minimum and maximum to say, 4, and leave Abbreviate with + un-checked that way you'll get a wordlist with only the 4-letter words in it. minimum frequency Default: 1. By default, all words will be stored, even those which occur once only. If you want only the more frequent words, set this to any number up to 32,000. maximum frequency Default maximum is 2,147,483,647 (2 Gigabytes). You'd have to analyse a lot of text to get a word which occurred as frequently as that!. You might set this to say 500, and the minimum to 50: that way your word-list would hold only the moderately common words. type/token mean number (default 1,000) Enables a smoothed calculation of type/token ratio for word lists. Choose a number between 10 and 20,000. For a more complete explanation, see WordList Type/Token Information. See also: Text Characteristics, Stop Lists, Setting Defaults 9.25 sort order How to do it... Sorting can be done simply by pressing the top row of any list. Press again to toggle between ascending & descending sorts. With a word-list on your screen, the main Frequency window doesn't sort, but you can re-sort the Alphabetical window (look at the tabs at the bottom of WordList to choose the tab) in a number of different ways. To choose one of the special sorts specified below, press F6 or Ctrl/F6 or Shift/Ctrl/F6. Or choose the appropriate menu option. Alphabetical Word Sort Many languages have their own special sorting order, so prior to sorting or re-sorting, check that you have selected the right language for the words being sorted. Spanish, for example, uses this order: A,B,C,CH,D,E,F,G,H,I,J,K,L,LL,M,N,,O,P,Q,R,S,T,U,V,W,X,Y,Z. Reverse Word Sort This is so that you can sort words by suffix. The order is determined by word endings, not word beginnings. You will therefore find all the -ing forms together. Word Length Sort This is so that you can sort words by their length (1-letter, 2-letter, etc up to 50-letter words) Within a set of equal-length words, there's a second, alphabetical sort. Consistency Sort Press the "Texts" header to re-sort the words according to their consistency. See also: Concord sort, KeyWords sort, Editing entries; Accented characters; Choosing Language 163 WordSmith Tools 2007 Mike Scott 9.26 WordList and tags If you have defined a tag file and made the appropriate settings, you can get a word-list which treats tags and words separately as in this example, where the tag is viewed as if it were a prefix. In its Alphabetical view, The list can be sorted on the tag or the word. To colour these as in the example, in the main Controller I chose colour 40 for the foreground for tags. Then in WordList, I chose View | Layout as in this screenshot. 164WordList 2007 Mike Scott 9.27 WordList display Each WordList display shows the word its frequency its frequency as a percent of the running words in the text(s) the word list was made from the number of texts each word appeared in that number as a percentage of the whole corpus of texts The Frequency display might look like this: 165 WordSmith Tools 2007 Mike Scott Here you see the top 6 words in a word list based on 7 interviews. There are 2,479 words altogether but in the screenshot we can only see the first few. The Freq. column shows how often each word cropped up (THE appeared 1,270 times in the 7 texts), and the % column tells us that 1,270 represents 5.52% of the running words in the 7 texts. The Texts column shows that THE comes in 7 texts, that is 100% of the texts used for the word list. The Alphabetical listing also shows us some of the words but now they're in alphabetical order. ABLE comes 18 times altogether, and in 5 of the 7 texts. ABOUT, on the other hand, comes in all 7 texts. Now leťs examine the statistics. 166WordList 2007 Mike Scott In all 7 texts, there are 2,749 word types (as pointed out above). The total running words is 22,992. Each word is about 4.49 characters in length. There are 928 sentences altogether, on average 24.78 words in length. In the text of the interview with Alex Salmond, there are only 674 differenmt word types and that interview is only just over 3,000 words in length. This is explained in more detail in the Statistics page. Finally, here is a screenshot of the same word list sorted "reverse alphabetically". In the part which we can see, all the words end in -IC. 167 WordSmith Tools 2007 Mike Scott To do a reverse alphabetical sort, I had the Alphabetical window visible, then chose Edit | Reverse Word sort in the menu. To revert to an ordinary alphabetical sort, press F6. See also : Consistency, Lemmatisation 9.28 WordSmith controller: WordList settings These are found in the main Controller under Adjust Settings | WordList. This is because some of the choices -- e.g. Minimum & Maximum Settings -- may affect other Tools. There are 2 sets : What you Get and What you See. WHAT YOU GET 168WordList 2007 Mike Scott Word Length & Frequencies See Minimum & Maximum Settings. Standardised Type/Token # See WordList Type/Token Information. Tags By default you get "words only, no tags". If you want to include tags in a word list, you need to set up a Tag File first. Then choose one of the options here. Tags are not counted in any statistics based on a running word count or number of tokens or types. What you will see for each is its frequency, that frequency as a percentage of the running words excluding tags, and the number of texts it is in. In the example here we see that BECAUSE is classified by the BNC either as a or a . (Thaťs how the BNC classifies BECAUSE OF...) 169 WordSmith Tools 2007 Mike Scott For colours and tags see WordList and Tags. WHAT YOU SEE Case Sensitivity Normally, you'll make a case-insensitive word list. If you wish to make a word list which distinguishes between the, The and THE, activate case sensitivity. See also: Using Index Lists, Viewing Index Lists, WordList Help Contents, WordList and tags, Computing word list clusters. Utility Programs Section X WordSmith Tools 171 WordSmith Tools 2007 Mike Scott 10 Utility Programs 10.1 Convert Data from Previous Versions 10.1.1 Convert Data from Previous Versions As WordSmith Tools develops, it has become necessary to store more data along with any given word-list, concordance etc. For example, data about which language(s) were selected for a concordance, notes now stored with every type of results file, etc. Therefore it has been necessary to supply a tool to convert data from the formats used in WS 1.0 to 3.0 to the new format for the current version. This is the Data Converting tool. If you try to open a file made with a previous version you should be offered a chance to convert it first. 10.2 WebGetter 10.2.1 overview The point of it The idea is to build up your own corpus of texts, by downloading web pages with the help of a search engine. What you do Just type a word or phrase and press Go or . How it works WebGetter visits the Search Engine specified in the second box and downloads the first 100 sources or so. Basically it uses the Search Engine just as you do yourself, getting a list of useful references. Then it sends out a robot to visit each web address and download the web page in each case (not from the Search Engine's cache but from the original web-site). Quite a few robots may be out there searching for you at once -- the advantage of this is that one slow download doesn't hold all the others up. After downloading a web page, that WebGetter robot checks it meets your requirements (in Settings). If the page is big enough, a file with a name very similar to the web address will be saved to your hard disk. When it runs out of references, WebGetter re-visits the Search Engine and gets some more. See also: Settings, Display, Limitations 10.2.2 settings These are where the texts are to be stored. The folder you specify will act as a root. That is, if you specify c:\temp and search for "besteirol", results will be stored in c:\temp\besteirol. If you 172Utility Programs 2007 Mike Scott do another search on say " WordSmith Tools", results for that will go into c:\temp\WordSmithTools. timeout: the number of seconds after which WebGetter robot stops trying a given webpage if there's no response. Suggested value: 20 seconds. max simultaneous: WebGetter works by sending robots out simulaneously, each one requesting a different web page. Suggested value: 20. That is, up to 20 are being downloaded at once. language: you specify the language you require. minimum file length (suggested 20Kbytes): the minimum size for each text file downloaded from the web. Small ones may just contain links to a couple of pictures and nothing much else. minimum words (suggested: 300): after each download, WebGetter goes through the downloaded text file counting the number of words and won't save unless there are enough. required words: you may optionally type in some words which you require to be present in each download; you can insist they all be present or any 1 of these. Search Engines Download a choice of search engines by pressing Engines. This gets the latest information about each search engine from www.lexically.net/downloads/searchengines.htm. Advanced Options If you work in an environment with a "Proxy Server", WebGetter will recognise this automatically and use the proxy unless you uncheck the relevant box. If in doubt ask your network administrator. The grid of settings This contains: name The Name to appear above, in the list of Search Engines ignore Websites not to visit when downloading (as opposed to requesting a list). That is, when WebGetter gets a page from Google, it only wants Google's list, not more Google web-pages. URL The URL where the Search Engine is found. Searchstring The search word syntax Max How many hits to try for on each contact Next Language Required language Other The search word is specified more or less just as you do when you use the same Search Engine yourself. Few advanced settings for each Search Engine are used; you can try your own preferences by typing in the grid, in the Searchstring column. Learn each Search Engine's current settings by simply trying it and then adapt the Searchstring accordingly. Some Search Engines want to set cookies on your PC and this might cause a failure to download. You can see the address line in the Advanced tab; WebGetter attempts to tell the Search Engine the search-word, the maximum number of hits to show per contact, what language to use, and how to get more. See also: Display, Limitations 173 WordSmith Tools 2007 Mike Scott 10.2.3 display As webgetter works, it shows the URLs visited. If greyed out, they were too small to be of use or haven't been contacted yet. If dark blue, they were saved to disk. Above, you will see the bytes visited, and every time a file which meets your requirments is stored, you'll see the number of files and number of words go up. At the bottom, the current time and elapsed time. There is a tab giving access to a list of the successfully downloaded files. Here is a partial list of what I got with a broadband connection, in 1 minute & 1 second, with the search term "history of the English language" (with quotes). As you can see, about 1.3MB of web-pages were examined, and 90,000 words (1.1MB) were found worth saving, with the default settings (they each had to be at least 10K in size and have 300 words). In that time I got a couple of time-outs, presumably because 20 seconds isn't long enough for some websites or servers which are slow and ponderous. See also: Settings, Limitations 174Utility Programs 2007 Mike Scott 10.2.4 limitations Everything depends on the search engine and the search terms you use. The Internet is a huge noticeboard; lots of stuff on it is merely ads and catalogue prices etc. The search terms are collected by the search engines by examining terms inserted by the web page author. There is no guarantee that the web pages are really "about" the term you specify, though they should be roughly related in some way. Use the Settings to be demanding about what you download. See also: Display 10.3 Languages Chooser 10.3.1 Overview A tool for selecting Languages which you want to process. You will probably only need to do this once, when you first use WordSmith Tools. How to get here The Language Chooser is accessed from the main WordSmith Controller menu: Settings | Adjust Settings | Text and Languages | Other Languages. What you will see may look like this: 175 WordSmith Tools 2007 Mike Scott 5 languages have been chosen already. At the bottom you will see what the current font can handle, in terms of Windows ANSI or Unicode text. The Courier New font on the PC this was done on can handle characters in Windows for Western and Eastern Europe, Cyrillic etc., as well as several ranges within the Unicode standard. See also : Language, Font, Sort Order, Other Languages, saving your choices 10.3.2 Language How to get here The Language Chooser is accessed from the main WordSmith Controller menu: Settings | Adjust Settings | Text and Languages | Other Languages. What it does The list of languages on the left shows all those which are supported by the PC you're using. If any of them are greyed, thaťs because although they are "supported" by your version of Windows, they haven't been installed in your copy of Windows. (To install more multilingual support, you will need your original Windows cdrom or may be able to find help on the Internet.) 176Utility Programs 2007 Mike Scott On the right, there are the currently chosen languages for use with WordSmith. The default language should be marked #1 and others which you might wish to use with *. For each Chosen Language, you can specify any symbols which can be included within a word, e.g. the apostrophe in English, where it makes more sense to think of "don't" as one word than as "don" and "t". You can also specify whether a hyphen separates words or not (e.g. whether "self-conscious" is to be considered as 2 words or 1). To change the status of a chosen language, right-click. This user is about to make Russian the #1 default. To delete any unwanted language, right-click and choose "demote". To add a language, drag it from the left window to the right, then set the country and font you prefer for that particular language. Each time you change language, the list of fonts available changes, and the sorted words will change their appearance. The window at the bottom shows which characters can be supported in Unicode or 1-byte format by the highlighted language. Some languages do not mark word-separators. See also : Other Languages, saving your choices 177 WordSmith Tools 2007 Mike Scott 10.3.3 Font The Fonts window shows those available for each language, depending on fonts you have installed. You will need a font which can show the characters you need: there are plenty of specialised fonts to be found on the Internet. Unicode fonts can show a huge number of different characters, but require your text to be saved in Unicode format. If you change font, the list of characters available changes. Click here for more on Unicode. See also : Language, Sort Order, Other Languages, saving your choices 10.3.4 Sort Order Sorting is done in accordance with the language chosen. (Spanish, Danish, etc. sort differently from English.) The display 178Utility Programs 2007 Mike Scott You will see 2 windows below "Resort" -- the one at the left contains some words in various languages; you can add your own. The cursor in the screenshot shows where a user is about to type, having already typed "(". If your keyboard won't let you type them in, paste from your own collection of texts. The one at the right shows how these words get sorted according to the language you have selected. See also : Language, Font, Other Languages, saving your choices 10.3.5 Other Languages To work on a language not in the list, press Edit and base your new language name on one of the existing languages. Choose a font which can show the characters & symbols you want to include. Sort order is handled as for the language you base your new language on. See also : Language, Font, Sort Order, saving your choices 10.3.6 saving your choices Save your results before quitting, so that next time WordSmith Tools will know your preferences regarding fonts and your #1 default language and your subsidiary default languages and you won't need to run this again. Results will be in \wsmith5\language_choices.ini. See also : Language, Font, Sort Order, Other Languages 10.4 Minimal Pairs 10.4.1 aim A program for finding possible typos and pairs of words which are minimally different from each other (minimal pairs). For example, you may have a word list which contains ALEADY 5 and ALREADY 461, that is, your texts contain 5 instances where there is a possible misprint and 461 which are correct. This program helps to find possible variants and typos and anagrams. See also : requirements, choosing your files, output, rules and settings, running the program. 10.4.2 requirements A word-list in text format. Each line should contain a word and its frequency separated by tabs, e.g. THE 75,432 WAS 9,895 or 179 WordSmith Tools 2007 Mike Scott 1 THE 75,432 2 WAS 9,895 You can make such a list using WordList. For example, select (highlight) the columns containing the word and its frequency, press the ".txt" button, then Clear the "Number each line" box Rows to save = "all" (but if it shows 0-xxx change 0 to 1) Columns to save = "any highlighted" See also : aim, choosing your files, output, rules and settings, running the program. 10.4.3 choosing your files Choose your input word list (which must be in plain text format) by clicking the button at the right of the edit space and finding the word list .txt file. If it has numbered lines, check the ".txt is pre-numbered" box. If it has a header (WS3 will by default produce 3 lines of header information) make sure you have set the "Header lines to skip" box to the right number. You must specify where to save your results. The results will show all the typos and minimal pairs which the program finds. Choose also, whether to number the list of results whether to show the frequencies of possible typos whether to show the rule which generated the result. See also : aim, requirements, output, rules and settings, running the program. 10.4.4 output An example of output is 418 ALTHOUGHT (7) ALTHOUGH(37975) Here the lines are numbered, and the bracketed numbers mean that ALTHOUGHT occurred 7 times and ALTHOUGH 37,975 times. An example using Dutch medical text, lower case: 136 aplasie (1) aplasia(1)[L] 137 apyogene (1) apyogeen(1)[S] 138 arachnoideales (1) arachnoidales(1)[I] Here line 136 generated a 1-Letter difference, 137 a Swap and 138 an Insertion. An example using Guardian newspaper, looking for anagrams: 35 AUDIE (7) ADIEU(43)[A] 36 ABASS (6) ASSAB(16)[A] 37 AGUIAR (6) AURIGA(11)[A] 38 ALREĎS (6) ADLER'S(18)[A] 39 ANDOR (6) ADORN(128)[A] See also : aim, requirements, choosing your files, rules and settings, running the program. 180Utility Programs 2007 Mike Scott 10.4.5 rules and settings Rules Insertions (abxcd v. abcd) This rule looks for 1 extra letter which may be inserted, e.g. HOWWEVER Swapped letters (abcd v. acbd) This rule looks for letters which have got swapped, e.g. HOVEWER 1 letter difference (abcd v. abxd) This rule looks for a 1 letter difference, e.g. HOWEXER Anagrams too (abcd v. adbc) This rule looks for the same letters in a different order, e.g. HWVROEE Settings: end letters to ignore if at last letter: This rule allows you to specify any letters to ignore if at the end of the word, e.g. if you specify "s", the possibility of a typo when comparing ELEPHANT and ELEPHANTS will not be reported. minimum start-of-word match This setting (default =1) allows you to assume that when looking for minimal pairs there is a part of each at the beginning which matches perfectly. For example, when considering ALEADY, the program probably doesn't need to look beyond words beginning with A for minimal pairs. If the setting is 1, it will not find BLEADY as a minimal pair. To check all words, take the setting down to 0. The program will be 26 times slower as a result! minimum word length This setting specifies the minimum word length for the program to consider the possibility there is a typo. The default is 5, which means 4-letter words will be simply ignored. This is to speed up processing, and because most typos probably occur in longer words. all words starting with ... If you choose this option, the program will ignore the next setting (max. word frequency). Here you can type in a sequence such as F,G,H and if so, the program will take all words beginning F or G or H (whatever their frequency) and look for minimal pairs based on the rules and settings above. max. word frequency (ignored if "all words starting with" is checked) How frequent can a typo be? This will depend on how much text your word-list is based on. The default is 10, which means that any word which appears 11 times is assumed to be OK, not a typo. Factory Defaults (restores default values) Save Current Settings (saves your choices of file and rules) Get Saved Settings (restores your last-saved choices) See also : aim, requirements, choosing your files, output, running the program. 10.4.6 running the program Press "Compute". You should then see your source text, with a few lines visible. Some of the rows and columns may be greyed and others white: move the column and row numbers till the real data are white and any headings or line-numbers are greyed out. If you want to stop in the middle, press "Stop". The status bar at the bottom of the screen shows how many words have been found in the word-list, the time elapsed, and time estimated to completion of the whole task. You can press "Results" to see your results file, when you have finished. 181 WordSmith Tools 2007 Mike Scott Finally, "Quit". See also : aim, requirements, choosing your files, output, rules and settings 10.5 File Viewer 10.5.1 Using File Viewer Aim To help you examine files of various kinds to see what is in them. This might be in order to see whether they're really in plain text format to see whether there's something wrong with them, such as unusual characters which oughtn't to be there to see whether they were formatted for Windows, Mac, or for Unix to check out any hidden stuff in there. (A Word .doc for example will have lots of hidden stuff you don't see on the screen but is inside the file anyway, such as the name of the author, type of printer being used, etc.) to find strings of words in a database, a spreadsheet or even a program file. Here you can see the gory details of the text. Some characters are highlighted in different colours so you can see exactly how the text is formatted. (In the above case we can find out it wasn't originally produced on a Windows PC.) Loading a "text" a. Choose your file if necessary click on the button at the right of the text-input box. b. Press Show. Format The two options available are as 1 bytes or 2 to represent each character-symbol in the text in question. You may need to alter this setting to see your text in a readable format. 182Utility Programs 2007 Mike Scott The two windows The left window shows how the "text" is built up. You can see each character as a number and, further to the right, as a character. The right window shows the text, line by line so you can read it. It isn't an editor and it doesn't word-wrap. Searching Just type in the search-word and press Search. The search is case sensitive and is not a "whole word" search. Settings Font Choose the font in the font window. You may need to change font if you want to see Chinese etc. represented correctly. Colours The colour grid lets you see the number section ("hex") in special colours, so you can find the potential problems you're interested in. First select the character you want coloured. Click the left mouse button to change the foreground colour, or the right button to change the background colour. The character names are Unicode names. Columns o You can set the "hex" columns between 2 and 16. 183 WordSmith Tools 2007 Mike Scott o The text can be shown in anything between 10 and 100 columns. 10.6 File Utilities 10.6.1 index This sub-program supplies a few file utilities for general use: Compare Two Files File Chunker Find Duplicates Rename Find Holes: for "holes" in text files Splitter Joiner 10.6.2 Splitter 10.6.2.1 Splitter: index Explanations What is the Splitter sub-program and whaťs it for? Filenames Wildcards See also : WordSmith Main Index 10.6.2.2 aim of Splitter This is a sub-program for splitting large files into lots of small ones. Splitter needs to know: End of Text Separator The symbol which will act as an end-of-text separator: eg. [FF] or or or !# or [FF*] or [FF?????] Restrictions: 1 The end-of-text marker must occur at the beginning of a line in the original large file. 2 It is case sensitive: will not find . 3 The first character in the end-of-text separator may not be a wildcard such as #,* or ?. 4 * and # may occur only once each in the end-of-text separator. Splitter will create a new file every time it encounters the end-of-text marker you've specified. Destination Folder Where you want the small files to be copied to. (You'll need write permission to access it if on a network.) 184Utility Programs 2007 Mike Scott Required sizes The minimum and maximum number of lines that your small files can have (default = 2 and 30,000). Only files within these limits will be saved. This feature is useful for extracting files from very large CD-ROM files. The default of 2 is to avoid getting little text files e.g. from newspaper News in Brief stories, but if you do want small texts, then set this to 1. A "line" means from one to the next. Bracket first line Whether or not you want the first line of each new text file to be bracketed inside < > marks. This is because often the first line after your end-of-text symbol will contain some kind of header. If you don't want it to insert < and > around the line, leave the checkbox un-checked. Title Line If you know which line of your texts always contains the title for the sub-textin question, set this counter to that number, otherwise leave it at 0. See also: Joiner, Filenames, Wildcards, The buttons, Text Converter index. 10.6.2.3 Splitter: filenames Splitter will create lots of small files based on your large one(s). It creates new filenames on the following basis: A folder based on the name of the source file is created. Sub-folders are created if there are too many files for a folder. If a title is detected, each file will contain the title plus a number and .txt. If there is no title, the filename will be the number + .txt added as a file extension. Thus a large file called HELLO.DAT will split up into a number of small ones: \HELLODAT\1.txt \HELLODAT\2.txt ... \HELLODAT\1\512.txt etc. Tips 1. Splitter will start numbering at 1 each session. 2. Note that the small files will probably take up a lot more room than the original large file did. This is because the disk operating system has a fixed minimum file size. A one-character text file will require this minimum size, which will probably be several thousand bytes in size. Even so, I suggest you keep your text files such that each file is a separate text, by using Splitter. When doing word lists and key words lists, though, do them in batches. 3. CD-ROM files when copied to your hard disk may be read-only. You can change this attribute using Text Converter. 10.6.2.4 Splitter: wildcards # The hash symbol, #, is used as a wildcard to represent any number, so [FF#] would find [FF3] or [FF9987] but not [FF] or [FF 9] (because there's a space in it) or [FFhello]. * The asterisk represents any string, so [FF* would find all of the above. * is used as the last character in the end-of-text symbol. It would find [FF anything at all up to the next . ^ The ^ mark represents any single letter, so [FF??] would find [FFZQ] but none of the others. ? The question mark represents any single character (including spaces, punctuation, letters), so [FF??] would find [FF 9] in the above examples, but none of the others. 185 WordSmith Tools 2007 Mike Scott To represent a genuine #,^,? or *, put each one in double quotes, eg. "?" "#" "^" "*". See also: Settings 10.6.3 join text files This is a sub-program for joining small text files into bigger ones. You might want this because you aren't interested in the different texts individually but are only interested in studying the patterns of a whole lot of texts. When you choose Joiner you will see something like this: End of text marker The symbol which will act as an end-of-text separator: eg. [FF] or or or !# or [FF*] or [FF?????]. The end-of-text marker will come at the beginning of a line in the original large file. If it includes # this will be replaced by the number of the text as the texts are processed. Folder with files to join Where the small files you want to be merged are now. They will not get deleted -- you must merge them into the Destination folder. and sub-folders too Check this if you want to process sub-folders of the "folder with files to join". 186Utility Programs 2007 Mike Scott file specifications The kinds of text files you want to merge, eg. *.* or *.txt or *.txt;*.ctx. Destination Folder Where you want the small files to be copied and merged to. (You'll need write permission to access it if on a network.) recreate same sub-folders as source If checked, creates the same structure as is the source. In the example, all the sub-folders of d:\text\guardian_cleaned will be created below d:\text\guardian_joined. one text for each folderful if checked, a whole folderful of source texts will go into one frile in the destination. Max. size (Kbytes) The maximum size in kilobytes that you want the each merged text file to be. 1000 means you will get almost 1 megabyte of text into each. That is about 150,000 words if there are no tags and the text is in English. This only applies if one text for each folderful isn't checked. Stop button Does what it says on the caption. See also: Splitter, Text Converter index. 10.6.4 compare two files The point of it The idea is to be able to check whether 2 files are similar or not. You may often make copies of files and a few weeks later cannot remember what they were. Or you have used File Chunker to copy a big file to floppies and want to be sure the copy is identical to the original. This program checks whether a) they are the same size b) they have the same contents (it goes through both, byte by byte, checking whether they match) c) they have the same attributes (file attributes can be "read only" [you cannot alter the file], "system" [a file which Windows thinks is central to your operating system], "hidden" [one which is so important that Bill Gates may be reluctant to even let you know it exists on your disk]) d) they have the same time & date. How to do it Specify your 2 files and simply press "Compare". See also : file chunker, find duplicates, rename 187 WordSmith Tools 2007 Mike Scott 10.6.5 file chunker The point of it The idea is to be able to cut up a big file into pieces, so that you can copy it to floppy disks or cdroms. Otherwise how can you get a 5MB file onto 3 or 4 floppy disks and transfer it to another pc? Naturally on the other pc, you will later want to restore the chunks to one file. How to do it: to copy a file 1. Specify your "file to chunk" (the big one you want to copy) 2. Specify your "drive & folder" (where you want to copy the chunks to. If to A: you will be asked to put in a new formatted floppy for each chunk.) 3. Specify the "size of each chunk" (default = 1,400K, which fits on a floppy) 4. Specify whether to "compress while chunking" (compresses the file as it goes along) 5. Press "Copy". How to do it: to restore a file 1. Specify your "first chunk" (the first chunk you made using this program) 2. Specify which folder to "restore to" (where you want the results) 3. Specify whether to "delete chunks afterwards" (if they are not needed) 4. Press "Restore". See also : compare two files, find duplicates, rename 10.6.6 find duplicates The point of it The idea is to be able to check whether you have files with the same name in different folders. You may often make copies of files and a few weeks later cannot remember where they were. This program only checks whether the files it is comparing have the same name. (You could use Compare 2 Files to see whether they are in fact identical.) It handles lots of folders, the point being to locate unnecessarily duplicated files or confusing reuse of the same filenames. How to do it Specify your Folder 1 and simply press "Search". Find Duplicates will go through that folder and any sub-folders and will report any duplicates found. Or you can specify 2 different folders (e.g. on different drives) and the process compares one set with the other. See also : compare two files, file chunker, rename 188Utility Programs 2007 Mike Scott 10.6.7 rename The point of it To rename a lot of files at once, in one or more folders. You may have files with excessively long names which do not suit certain applications. Or it is a pain to rename a lot of files one by one. How to do it Specify your Folder, whether sub-folders will also be processed, and the kinds of file you want to handle. The default is All files *.*. Also specify a "mask for new name" and a starting number. For example, with a mask of SUN and start number 0, the first file found, leťs say it was originally Quite_a_long_and_Complicated_file.txt will be renamed SUN0.txt. The next file would be SUN1.txt,and so on. (If the next was Quite_a_long_and_Complicated_file.htm, that would become SUN2.htm). When you press "Find Files", you will see a list of all files meeting these choices. If you now press "Rename" each one will be renamed according to your settings. See also : compare two files, file chunker, find duplicates 10.7 Text Converter 10.7.1 purpose This program does a "Search & Replace", on virtually any number of files. It is very useful for going through large numbers of texts and re-formatting them as you prefer, e.g. taking out unnecessary spaces, ensuring only paragraphs have at their ends, changing accented characters, ensuring you have Windows symbols, etc. converting text For a simple search-and-replace you can type in the search item and a replacement; for more complex conversions, use a Conversion File so that Text Converter knows which symbols or strings to convert. It operates under Windows and saves using the Windows character set, but will convert text using DOS or Windows character sets. You can use it to make your text files suitable for use with your Internet browser. It does a "search and replace" much as in word-processors, but it can do this on lots of text files, one after the other. As it does so, it can also replace up to any number of strings, not just one. Once the conversion file is prepared and Settings specified, the Text Converter will read each source file and either create a new version or replace the old one, depending on the over-write setting. You will be able to see the details of how many instances of each string were found and replaced overall. 189 WordSmith Tools 2007 Mike Scott filtering files And/or you may need to make sure texts which meet certain criteria are put into the right folders. Tip The easiest way to ensure your text files are the way you want, especially if you have a very large number to convert, is to copy a few into a temporary folder and try out your conversion file with the Text Converter. You may find you've failed to specify some necessary conversions. Once you're sure everything is the way you want it, delete the temporary files. See also: Text Converter Contents, The buttons 10.7.2 Text Converter: index Explanations What is the Text Converter and whaťs it for? Getting Started... Convert the text format Filters Sample Conversion File Syntax Conversion File See also : WordSmith Main Index 10.7.3 Text Converter: extracting from files The point of it... The idea is to be able to extract something useful from within larger files. In the example below, I wanted to extract the headlines only from some newspaper text. I knew that the header for each text contained (date of publication mark-up) and that the headline ended , and I wanted only those chunks which contained the phrase Leading article:. 190Utility Programs 2007 Mike Scott The results I got looked like this: 05 August 2001 The Observer 26 Comment: Leading article: Ealing's lessons: Time for steel from the peacemakers 05 August 2001 The Observer 26 Comment: Leading article: The free market can't house us all: Why Government has to intervene Settings containing : all non-blank lines in this box will be required. Leave it blank if you have no requirement that the chunk you want to extract contains any given word or phrase. chunk marker : Leave blank, otherwise each chunk will be marked up as in the example above, if it begins with < and ends with >. The reason for this marker is to enable subsequent splitting. 10.7.4 Text Converter: settings 1. Choose Files (the top left tab). Decide whether you want the program to process sub-folders of the one you choose. There is no limit to the number of files Text Converter can process in one operation. 2. Click on the Conversion tab, and: 3. Decide whether you want to make copies of the text files, or to over-write the originals. Obviously you must be confident of the changes to choose to over-write; copying however may mean a problem of storage space. 4. Specify what to convert, that is the search-words and what you want them to be replaced with. 191 WordSmith Tools 2007 Mike Scott For a quick conversion you can simply type in a word you want to change and its replacement (e.g. Just one change so that responsable becomes responsible) or you can choose your own pre-prepared Conversion File. 5. Or in the Whole Files section you can choose simply to update legacy files in various ways, e.g. by choosing Dos to Windows, Unix to Windows, MS Word .doc to .txt, into Unicode, etc. 6. Or if you want simply to extract some text from your files, you should choose the Extract from files tab. 7. If you might want some files not to be converted, or simply don't want any conversions but instead to place files in appropriate sub-folders, choose the Filters tab. If you choose Over-write Source texts, Text Converter will work more quickly and use less disk space, but of course you should be quite sure your conversion file codes are right before starting! See copy to for details of how the folders get replicated in a copy operation. Note that some space on your hard disk will be used even if you plan to over-write. The conversion process does its work, then if all is well the original file is deleted, and the new version copied. There has to be enough room in the destination folder for the largest of your new files; it is much quicker for it to be on the same drive as the source texts. If it isn't, your permission will be asked to use the same drive. 192Utility Programs 2007 Mike Scott cutting out a header from each file It can be useful to get a header removed. In the screenshot example, any text which contains will get all the beginning of the file up to that point cut out. Press OK to start; you will see a list of results, as in the screenshot below. If you want to stop Text Converter at any time, click on the Cancel button or press Escape. Right-click to see the source or the converted result file: 193 WordSmith Tools 2007 Mike Scott See also: Text Converter Contents. 10.7.5 Text Converter: syntax The syntax for a Conversion File is: Only lines beginning / or " are used. Others are ignored completely. Every string for conversion is of the form "A" -> "B". That is, the original string, the one you're searching for, enclosed in double quotes, is followed by a space, a hyphen, the > symbol, and the replacement string. Removing all tags To remove all tags, choose "<*>" -> "" as your search string. Control Codes Control codes can be symbolised like this: {CHR(xxx)} where xxx is the number of the code. Examples: {CHR(13)} is a carriage-return, {CHR(10)} is a line-feed, {CHR(9)} is a tab, {CHR(12)} is a printer form-feed. To represent 05 August 2001 The Observer 26 Comment: Leading article: What a turn-on: Caťs whiskers are the bee's knees which comes at the end of paragraphs and sometimes at the end of each line, you'd type {CHR(13)}{CHR(10)} which is carriage-return followed immediately by line-feed. Use {CHR(34)} if you need to refer to double inverted commas. Wildcards (*,?,# and ~) * You can use the asterisk as a wildcard. Thus "<*>" -> "" will delete any string in < > brackets from your text. "" will delete any string starting "", even if there are hundreds of characters between them. The default search distance is 1,000 characters, with a maximum of 25,000. (The text is read chunk by chunk into a 30,000 character buffer, so the maximum will work fine at the start of the text; after this only 1,000 characters of search-space are guaranteed.) As deleting a lot of text can get rid of more text than you expect if the text is not properly marked up in the first place, you will probably need to over-ride the default search distance by specifying it in brackets, e.g. "".The asterisk may not be the first or last symbol between the double quotation marks in the search-string. The asterisk also retains up to 1,000 characters. " " remembers all the characters up to > and can use them in the replacement: Thus "" -> "[section *]" will produce [section 1 They Meet Again] if the original has. " " will do the same thing but would allow up to 1,000 characters' search for the >. # Use # to symbolise any number. "" will find, , , etc. If # is in the replacement too, the exact same number will be used in the replacement. Thus " " -> "[section #]" will produce [section 468] if the original has. ? The question mark stands for any single character, except a space. Up to ten ?s can be used in the replacement string to reproduce the character referred to by the ?s in the search-string. ~ The tilde means except. ~" " "<*>" -> "" means delete everything in between angle brackets, except a case of
. Use {CHR(42)} if you need to refer to *, {CHR(35)} for #, {CHR(63)} for ? and 194Utility Programs 2007 Mike Scott {CHR(126)} for ~. Whole word, case Insensitive, Confirm, redundant Spaces /C stops to confirm you wish to go ahead before each change. /W does a whole word search (ensuring the alteration only happens if there's a word separator on either side) (/W "the" finds the but not other or then or bathe). /I does a case insensitive search (/I "restaurant" -> "hotel" replaces restaurant with hotel and RESTAURANT with HOTEL and Restaurant with Hotel, i.e. respecting case as far as possible). You can combine these, e.g. /IWC "the" -> "this" /S cuts out all redundant spaces. That is, it will reduce any sequence of two or more spaces to one, and it also removes some common formatting problems such as a lone space after a carriage-return or before punctuation marks such as .,; and ). /S can be used on a line of its own or in combination with other searches. Additions (/A, /T and {v}) /A means add text. /A "Ulan" START inserts Ulan at the start, /A "Bator" END inserts Bator at the end of the text. See \wsmith5\convert.txt to see one in use. /T means add title. So /T "
* " -> "*" looks for... and if iťs found, inserts the wording given into the file. This will make your browser show the title at the top of the screen. {v="} means remember this and use it in another line of the conversion file when you find {v}. "26 Dec." -> "Boxing Day" {v="Xmas"} stores the reference Xmas and "1 May" -> "Mayday" {v="after Easter"} stores after Easter for use in a later line, such as "/celebration/" -> "{v}". Assuming that your text has a mention of 26 Dec. and 1 May, this example, on finding /celebration/ in the text, will put Xmas if the most recent mention in the text was 26 Dec. and after Easter if the most recent mention was 1 May. See \wsmith5\convert.txt to see examples in use. See also: Text Converter Contents. 195 WordSmith Tools 2007 Mike Scott 10.7.6 Convert Text File Format To convert a series of whole text files from one format to another, choose between these options: These formats allow you to convert into formats which will be suited to text processing. (UTF8, a format which was devised for many languages some years ago when disk space was limited and character encoding was problematic, is generally not suitable. Thaťs because it uses a variable number of bytes to represent the different characters. A to Z will be only 1 byte but for example Japanese characters may well need 2, 3 or even more bytes to represent one character.) DOS to Windows: ... choose the "codepage" that your old DOS texts were encoded with, eg. DOS 850 Multilingual. Unix to Windows: ... Unix-saved texts don't use the same codes for end-of-paragraph as Windows-saved ones. 196Utility Programs 2007 Mike Scott into Unicode: .... this is a better standard than ANSI as it allows many more characters to be used, suiting lots of languages. This is UTF16 Unicode, 2 bytes for each character. from MS Word .doc ... like using "Save as Text" in Word. HTML/BNC entities to characters ... converts symbols which are hard to read such as é to ones like é from column tagged, using <> except column ... The Stuttgart Tree Tagger produces output like this: word pos lemma The DT the TreeTagger NP TreeTagger is VBZ be easy JJ easy to TO to use VB use . SENT . If you set the column to 1, Text Converter will convert this to TheTreeTagger is easy to us Lemmatised using ... ... converts each file using a lemma file. Where if your source text has "she was tired" and your lemma file has BE -> AM, WAS, WERE, IS, ARE, then you will get "she be tired" in your converted text file. Where your source text has "Was she tired?" you'll get "Be she tired?" 10.7.7 Text Converter: move if This function allows you to specify a word or phrase, look for it in each file, and if iťs found move that file into a new folder. The point of it ... Suppose you have a whole set of files some of which contain dialogues between Pip and Magwich, others containing references to the Great Wall of China or the anatomy of fleas. You want those with the Pip-Magwich dialogues and you want them to go into a folder called Expect. How to do it 1. Click on the Filters tab (at the top). 2. Now the Activated checkbox. 3. Specify a word or phrase the text must contain. This is case sensitive. 4. Choose whether that word or phrase has to be found anywhere in the text, anywhere before some other word or phrase, or between 2 different words or phrases. 5. Decide what happens if the conditions are met: nothing copy to a certain folder, or 197 WordSmith Tools 2007 Mike Scott move to that folder, or delete the file (careful!). You can also decide to build a sub-folder based on the word or phrase you chose in #3. And you may have the program add .txt (useful if as with the BNC there are no file extensions). See also: Text Converter Contents. 10.7.8 Text Converter: copy to If you choose to copy the files you are converting, instead of converting or filtering them in place, which is a lot safer, the new files created will be structured like this. Suppose you are processing d:\texts\2007\literature and copying to c:\temp and suppose d:\texts\2007\literature contains this sort of thing: d:\texts\2007\literature\shakespeare\hamlet.pdf d:\texts\2007\literature\shakespeare\macbeth.pdf ... d:\texts\2007\literature\shakespeare\poetry\sonnet1.pdf d:\texts\2007\literature\shakespeare\poetry\sonnet2.pdf ... d:\texts\2007\literature\french\victor hugo\miserables.pdf d:\texts\2007\literature\french\poetry\baudelaire\le chat.pdf ... you will get c:\temp\shakespeare\hamlet.txt c:\temp\shakespeare\macbeth.txt ... c:\temp\shakespeare\poetry\sonnet1.txt c:\temp\shakespeare\poetry\sonnet2.txt ... c:\temp\french\victor hugo\miserables.txt c:\temp\french\poetry\baudelaire\le chat.txt ... In other words, for each file successfully converted or filtered, any same directory structure beyond the starting point (d:\texts\2007\literature in the example above) will get appended to the destination. 10.7.9 Text Converter conversion file Prepare your Text Converter conversion file using a plain text editor such as Notepad. You could use \wsmith5\contvert.txt as a basis. If you have accented characters in your original files, use the DOS editor to prepare the conversion file if they were originally written under DOS and a Windows editor if they were written in a Windows word-processor. Some Windows word processors can handle either format. There can be any number of lines for conversion, and each one can contain two strings, delimited with " " quotes, each of up to 80 characters in length. The Text Converter makes all changes in order, as specified in the Conversion File. Remember 198Utility Programs 2007 Mike Scott one alteration may well affect subsequent ones. Alterations that increase the original file Most changes reduce the size of an original. But Text Converter will cope even if you need to increase the original file -- as long as there's disk space! Tip To get rid of the at line ends but not at paragraph ends, first examine your paragraph ends to see what is unique about them. If for example, paragraphs end with two , use the following lines in your conversion file: "{CHR(13)}{CHR(10)}{CHR(13)}{CHR(10)}" -> "{%%}" (this line replaces the two with {%%} .) (It could be any other unique combination. Iťll be slightly faster if you make the search and the replacement the same length, as in this case, 4 characters) "{CHR(13)}{CHR(10)}" -> " " (this line replaces all other with a space, to keep words separate) "{%%}" -> "{CHR(13)}{CHR(10)}{CHR(13)}{CHR(10)}" (this line replaces the {%%} combination with , thus restoring the original paragraph structure) /S (this line cuts out all redundant spaces) See also: sample conversion file, syntax, Text Converter Contents. 10.7.10 Text Converter: sample conversion file You could copy all or part of this to the clipboard and paste it into notepad. [ comment line -- put whatever you like here, iťll be ignored ] [ first a spelling correction ] "responsable" -> "responsible" [ now leťs change brackets from < > to [ ] and { } to ( ) ] "<" -> "[" ">" -> "]" "}" -> ")" "{" -> ")" /S [ that will clear all redundant spaces] The file \wsmith4\convert.txt is a sample conversion file for use with British National Corpus text files. See also: Text Converter Contents. Viewer and Aligner Section XI WordSmith Tools 200Viewer and Aligner 2007 Mike Scott 11 Viewer and Aligner 11.1 purpose This is a program for showing your text or other files, highlighting words of interest. You will see them in plain text format, with tag mark-up shown or hidden as in your tag settings. There are a number of settings and optionsyou can change. Its main use is to produce an aligned version of 2 or more texts, with alternate sentences or paragraphs from each of them. See also: Viewer & Aligner settings, Viewer & Aligner options 11.2 index Explanations What is the Viewer & Aligner and whaťs it for? Settings Viewing Options What to do if it doesn't do what I want... Searching for Short Sentences Joining/Splitting Aligning a Dual Text Finding translation mis-matches 201 WordSmith Tools 2007 Mike Scott The technical side... see also : WordSmith Main Index 11.3 aligning with Viewer This feature aligns the sentences in two files. Translators need to study differences between an original and a translation. Other linguists might want it to study differences between two versions of a text in the same language. Students of different languages can use it as they might use dual language readings, to study closely the differences e.g. in word order. It helps you produce a new text which consists of the two files, with sentences interspersed. That way you can compare the translation with the original. Example Original : Der Knabe sagte diesen Gedanken dem Schwesterchen, und diese folgte. Allein auch der Weg auf den Hals hinab war nicht zu finden. So klar die Sonne schien, ...(from Stifter's Bergkristall, translated by Harry Steinhauer, in German Stories, Bantam Books 1961) Translation: The boy communicated this thought to his sister and she followed him. But the road down the neck could not be found either. Though the sun shone clearly, ... Aligned text: Der Knabe sagte diesen Gedanken dem Schwesterchen, und diese folgte. The boy communicated this thought to his sister and she followed him. Allein auch der Weg auf den Hals hinab war nicht zu finden. But the road down the neck could not be found either. So klar die Sonne schien, ... Though the sun shone clearly, ... An aligned text like this helps you identify additions and omissions, normalisations, style changes, word order preferences. In this case the translator has chosen to avoid very close equivalence. How to do it -- a Korean and English example 1. Read in your Korean text (eg. KOREAN.TXT), and checking its sentences and paragraphs break the way you like. Try "Unusual Lines" to help identify oddities. 2. Save it and it will (by default) get your filename.VWR, eg. KOREAN.VWR. 3. Do the same steps 1 and 2 for your English text -- you will now have e.g. ENGLISH.VWR. 4. Now open your KOREAN.VWR and then File | Merge with ENGLISH.VWR. 5. File | Save AS - Korean and English.ALI (multiple-language aligned file). See also: Aligning and moving 202Viewer and Aligner 2007 Mike Scott 11.4 aligning and moving You may well want to alter sentence ordering. The translator may have used three sentences where the original had only one. You can also merge paragraphs. adjusting by dragging with the mouse To merge sentences or paragraphs, simply grab and drag it up to the next one above in the same language. Or use the Join button. Or press F4. To split a sentence or paragraph, choose the Split button or press Ctrl/F4. Finally you will want to save (F2) the results. See also: Viewer & Aligner contents 11.5 editing While Viewer & Aligner is not a full word-processor, some editing facilities have been built in to help deal with common formatting problems: Edit ( ): opens up a window allowing you to edit the whole of the current sentence or paragraph. Trim extra spaces: this goes through each sentence of the text, removing any redundant spaces -- where there are two or more consecutive spaces they will be reduced to one. Find lower-case lines: this identifies cases where a sentence or paragraph does not start with a capital letter or number -- you will probably want to join it to the one above. This problem is common if the text has been saved as "text only with line breaks" (where an comes at the end of each line whether or not it is the end of a paragraph.) Find short lines You will then want to save (F2) your text. You can also: open a new file for viewing (you can open any number of text files within Viewer & Aligner) copy a text file to the clipboard (select, then press Control+Ins) print the whole or part of the currently active text file search for words or phrases (press F12) 11.6 languages Each Viewer file (.VWR) has its own language. Each Aligner file (.ALI) has one language for each of the component sections. (They could all be the same, if for example you were analysing various different editions of a Shakespeare play they'd all be English.) The set of languages available is that defined using the Languages Chooser. If you find you have read in a plain text without defining the language correctly, you can change the language to one of your previously defined languages by pressing the button visible at the top of Viewer & Aligner. 203 WordSmith Tools 2007 Mike Scott 11.7 numbering sentences & paragraphs You can use the Viewer & Aligner to make a copy of your text with all the sentences and/or paragraphs tagged with and. To do this, simply read in the text file in, choose Edit | Insert Tags, then save it as a text file. See also: Viewer & Aligner contents 11.8 options Mode: Sentence/Paragraph This switches between Sentence mode and Paragraph mode. In other words you can choose to view your text files with each row of the display taking up a sentence or a paragraph. Likewise, you can make an dual aligned text by interspersing either paragraphs or sentences. The other functions (e.g. joining, splitting) work in the same way in either mode. Colours The various texts in your aligned text will have different colours associated with them. Colours can be changed using the button. 11.9 reading in a plain text In Viewer and Aligner, choose File | Open and select your plain text file. Ensure you have a suitable language chosen. Edit it, as necessary, e.g. splitting or merging paragraphs or sentences. 11.10 sentence joining and splitting Joining The easiest way to join two sentences is simply to drag the one you want to move onto its neighbour above. Or select the lower of the two and press F4 or use the button ( ) Splitting in two To split a sentence, press . You will get a list of the words. Click on the word which should end the sentence, then press OK. 204Viewer and Aligner 2007 Mike Scott example This will insert the words which follow (I need others etc.) into a new line below. See also: Viewer & Aligner contents 11.11 settings 1. What constitutes a "short" sentence or paragraph (default: less than 25 characters) 2. Whether you want to do a lower-case check when Finding Unusual Lines The settings are standard ones found in most of the Tools: Colours Font Printing Text Characteristics Review all Settings 11.12 technical aspects When is a sentence not a sentence? There is no perfect mechanical way of determining sentence-breaks. For example, a heading may well have no final full stop but would normally not be considered part of the sentence which follows it. And a sentence may often have no final full stop, if what follows it is a list of items. The algorithm used by Viewer & Aligner is: a sentence ends if a full-stop, question-mark or exclamation-mark (.?!) is immediately followed by one or more word separators and if the next non-punctuation symbol is a capital letter A..Z or an accented capital letter, a number or a currency 205 WordSmith Tools 2007 Mike Scott symbol. The same routine is used as in WordList. Consider this chunk from A Tale of Two Cities: "Wo-ho!" said the coachman. "So, then! One more pull and you're at the top and be damned to you, for I have had trouble enough to get you to it! - Joe!" Viewer & Aligner will mistakenly consider - Joe! as a separate sentence, but handles "Wo-ho!" said the coachman. as one: though the program would split it in two if the word after ho! had a capital lettter (e.g. in Wild Bill, the coachman, said.) Viewer & Aligner cannot therefore be expected to handle all sentence boundaries exactly as you would. (I saw Mr. Smith. would be considered two sentences; several headings may be bundled together as one sentence.) For this reason you can choose Find Short Sentences to seek out any odd one-word sentences. See also: Viewer & Aligner contents 11.13 translation mis-matches Viewer & Aligner can help find cases where alignment has slipped (one sentence having been translated as two or three). One method is to use the menu item Match by Capitals. This searches for matching proper nouns in the two versions: if say Paris is mentioned in sentences 25 of the source text and not in sentence 25 of the translation but in sentence 27, it is very likely that some slippage has occurred. Viewer & Aligner will search forwards from the current text sentence on, and will tell you where there's a mis-match. You should then search back from that point to find where the sentences start to diverge. It may be useful to sample every 10 or every 20 to speed up the search for slippage. When you find the problem, un-join or join and/or edit the text as appropriate, then save it. See also: The technical side..., Finding unusual sentences, Viewer & Aligner contents 11.14 troubleshooting Can't see the whole sentence or paragraph Press to "auto-size" the lines in your display. This adjusts line heights according to the current highlighted column of data. Can't see the whole text file Press to "refresh" the display. Don't like the colours Change colours using . The colours initially used for each language version in the dual-language window are the same colours as used for primary sorting and secondary sorting in Concord. See also: Viewer & Aligner contents 206Viewer and Aligner 2007 Mike Scott 11.15 unusual sentences It can be useful to seek unusually short sentences to see whether your originals have been handled as you want. Because Viewer & Aligner uses full stops, question marks and exclamation marks as sentence-boundary indicators, you will find a string like "Hello! Paul! Come here!" is broken into 3 very short sentences. Depending on your purposes you may wish to consider these as one sentence, e.g. if a translator has translated them as one ("Oi, Paulo, venha cá!") . This function can also find lower-case lines: where a sentence or paragraph does not start with a capital letter or number -- you will probably want to join it to the one above. This problem is common if the text has been saved as "text only with line breaks" (where an
comes at the end of each line whether or not it is the end of a paragraph.) Seeking Use the Find Unusual Toolbar menu item ( ) and then press Start Search. Viewer & Aligner will go to the next possibly problematic sentence or paragraph and you will probably want to join it by pressing Join Up (to the one above), Join Down, or Skip. "Case check" switches on or off the search for lower-case sentence starts. The number (25 in the example above) is for you to determine the number of characters counting as a short sentence or paragraph. See also: Settings, The technical side..., Finding translation mis-matches, Viewer & Aligner contents Reference Section XII WordSmith Tools 208Reference 2007 Mike Scott 12 Reference 12.1 32-bit version This version of WordSmith Tools is a complete re-write in comparison to the earlier 16-bit versions, with lots of changes "under the hood". Some of the changes you will see are: long filenames better tag and entity handling including Tag Concordancing previous work can still be used, but it should be re-saved in the 32-bit format. You will get a suggestion to "Update" a data file if it is still in the old format. zip file handling easier exporting of data to Microsoft Word and Excel Unicode text handling, allowing more languages to be processed possibility of altering the data as it comes in, e.g. for language-specific lemmatisation the old limitations of 16,000 lines of data have gone. (The theoretical limit for a list of data is over 134 million lines.) See also: Contact Addresses. 12.2 acknowledgements WordSmith Tools has developed over a period of years. Originally each tool came about because I wanted a tool for a particular job in my work as an Applied Linguist. Early versions were written for DOS, then WindowsÔ came onto the scene. One tool, Concord, had a slightly different history. It developed out of MicroConcord which Tim Johns and I wrote for DOS and which Oxford University Press published in 1993. Concord has a lot of additional features in this Windows version and all the code has been re-written, but the essential features of the design were there in MicroConcord. The first published version was written in BorlandÔ Pascal with the time-critical sections in Assembler. Subsequently the programs were converted to DelphiÔ 16-bit; this is a 32-bit only version written in Delphi 7 and still using time-critical sections in Assembler. I am grateful to lots of users who have made suggestions and given bug reports, generations of students and colleagues at the Department of English, University of Liverpool, and the MA Programme in Applied Linguistics at the Catholic University of So Paulo Audrey Spina, Élodie Guthmann and Julia Hotter for their help with the French & German versions for their feedback on aspects of the suite (including bugs!), and suggestions as to features it should have. Researchers from many other countries have also acted as alpha-testers and beta-testers and I thank them for their patience and feedback. I am also grateful to Nell Scott and other members of my family who have always given valuable support, feedback and suggestions. Mike Scott Feel free to email me at my contact address with any further ideas for developing WordSmith Tools. 209 WordSmith Tools 2007 Mike Scott 12.3 API It is possible to run the WordSmith routines from your own programs; for this an API is published at http://www.lexically.net/wordsmith/version5/API/API.htm. If you know a programming language, you can call a .dll which comes with WordSmith and ask it to create a concordance, a wordlist or a key words list, which you can then process to suit your own purposes. See also : custom processing 12.4 bibliography Aston, Guy, 1995, "Corpora in Language Pedagogy: matching theory and practice", in G. Cook & B. Seidlhofer (eds.) Principle & Practice in Applied Linguistics: Studies in honour of H.G. Widdowson, Oxford: Oxford University Press, 257-70. Aston, Guy & Burnard, Lou, 1998, The BNC Handbook, Edinburgh: Edinburgh University Press. Biber, D., S. Johansson, G. Leech, S. Conrad and E. Finegan, 2000, Longman Grammar of Spoken and Written English, Harlow: Addison Wesley Longman. Clear, Jeremy, 1993, "From Firth Principles: computational tools for the study of collocation" in M. Baker, G. Francis & E. Tognini-Bonelli (eds.), 1993, Text and Technology: in honour of John Sinclair, Philadelphia: John Benjamins, 271-92. Dunning, Ted, 1993, "Accurate Methods for the Statistics of Surprise and Coincidence", Computational Linguistics, Vol 19, No. 1, pp. 61-74. Fillmore, Charles J, & Atkins, B.T.S, 1994, "Starting where the Dictionaries Stop: The Challenge of Corpus Lexicography", in B.T.S. Atkins & A. Zampolli, Computational Approaches to the Lexicon, Oxford:Clarendon Press, pp. 349-96. Katz, Slava, 1996, Distribution of Common Words and Phrases in Text and Language Modelling, Natural Language Engineering 2 (1), 15-59 Murison-Bowie, Simon, 1993, MicroConcord Manual: an introduction to the practices and principles of concordancing in language teaching, Oxford: Oxford University Press. Nakamura, Junsaku, 1993, "Statistical Methods and Large Corpora: a new tool for describing text types" in M. Baker, G. Francis & E. Tognini-Bonelli (eds.), 1993, Text and Technology: in honour of John Sinclair, Philadelphia: John Benjamins, 293-312. Oakes, Michael P. 1998, Statistics for Corpus Linguistics, Edinburgh: Edinburgh University Press. Scott, Mike, 1997, "PC Analysis of Key Words - and Key Key Words", System, Vol. 25, No. 2, pp. 233-45. Sinclair, John M, 1991, Corpus, Concordance, Collocation, Oxford: Oxford University Press. Stubbs, Michael, 1986, "Lexical Density: A Technique and Some Findings", in M. Coulthard (ed.) Talking About Text: Studies presented to David Brazil on his retirement, Discourse Analysis Monograph no. 13, Birmingham: English Language Research, Univ. of Birmingham, 27-42. Stubbs, Michael, 1995, "Corpus Evidence for Norms of Lexical Collocation", in G. Cook & B. Seidlhofer (eds.) Principle & Practice in Applied Linguistics: Studies in honour of H.G. Widdowson, Oxford: Oxford University Press, 245-56. Tuldava, J. 1995, Methods in Quantitative Linguistics, Trier: WVT Wissenschaftlicher Verlag Trier. Youlmans, Gilbert, 1991, "A New Tool for Discourse Analysis: the vocabulary-management profile", Language, V. 67, No. 4, pp. 763-89. UCREĽs log likelihood information 210Reference 2007 Mike Scott 12.5 bugs All computer programs contain bugs. You may have seen a "General Protection Fault" message when using big expensive drawing or word-processing packages. If you see something like this, then you have an incompatibility between sections of WordSmith. You have probably downloaded a fresh version of some parts of WordSmith but not all, and the various sub-programs are in conflict... The solution is a fresh download. http://www.lexically.net/wordsmith/version5/faqs/updating_or_reinstalling.htm explains. Otherwise you should get a report popping up, giving "General" information about your PC and "Details" about the fault. This information will help me to fix the problem and will be saved in a small text file called wordsmith.elf, concord.elf, wordlist.elf, etc. When you quit the program, you will be offered a chance to email this to me. The first thing you'll see when one of these happens is something like this: You may have to quit when you have pressed OK, or WordSmith may be able to cope despite the problem. Usually the offending program will be able to cope despite the bug or you can go straight back into it without even needing to quit the main WordSmith Tools Controller, retrieve your saved results from disk, and resume. If that doesn't work, try quitting WordSmith Tools overall, or quit Windows and then start it up again. When you press OK, your email program should have a message with a couple of attachments to send to me. 211 WordSmith Tools 2007 Mike Scott The email message will only get sent when you press Send in your email program. It is only sent to me and I will not pass it on to anyone else. Read it first if you are worried about revealing your innermost secrets ... it will tell me the operating system, the amount of RAM and hard disk space, the version of WordSmith, and some technical details of routines which it was going through when the crash occurred. error messages These warn you about problems which occur as the program works, e.g. if there's no room left on your disk, or you type in an impossible filename or a number containing a comma. See also: logging, troubleshooting. 12.6 Character Sets 12.6.1 overview You need "plain text" in WordSmith. Not Microsoft Word .doc files -- which contain text and a whole lot of other things too that you cannot normally see. To handle a text in a computer, programs need to know how the text is encoded. In its processing, the software sees only a long string of numbers, and these have to match up with what you and I can recognise as "characters". For many languages like English with a restricted alphabet, encoding can be managed with only 1 "byte" per character. On the other hand a language like Chinese, which draws upon a very large array of characters, cannot easily be fitted to a 1-byte system. Hence the creation of other "multi-byte" systems. Obviously if a text in English is encoded in a multi-byte way, it will make a bigger file than one encoded with 1 byte per character, and this is wasteful of disk and memory space. So, at the time of writing, 1-byte character sets are still in very widespread use. UTF-8 is a name for a multi-byte method, widely used for Chinese, Japanese etc. In practice, your texts are likely to be encoded in a Windows 1-byte system, older texts in a DOS 1-byte system, and newer ones, especially in Chinese, Japanese, Greek, in Unicode. What matters most to you is what each character looks like, but WordSmith cannot possibly sort words correctly, or even recognise where a word begins and ends, if the encoding is not correct. WordSmith has to know (or try to find out) which system your texts are encoded in. It can perform certain tests in the background. But as it doesn't actually understand the words it sees, it is much safer for you to define the character set in advance, especially if you process texts in German, Spanish, Russian, Greek, Polish, Japanese, Farsi, Arabic etc. Three main kinds of character set, each with its own flavours, are Windows, DOS, and Unicode. Tip To check results after changing the code-page, select Choose Texts and View the file in question. While viewing you can change Text Characteristics until it looks right. If you can't get it to look right, you've probably not got a cleaned-up plain text file but one straight from a word-processor. In that case, take it back into the word-processor and save it as text again as a plain text file in Windows format, which is more up-to-date than DOS formats. See also: Choosing Accents & Symbols, Accented characters; Choosing Language 212Reference 2007 Mike Scott 12.6.2 accents & symbols When entering your search-word you may need to insert symbols and accented characters into your search-word, exclusion word or context word, etc. If you have the right keyboard set for your version of Windows this may be very easy -- if not, just choose the symbol in the main Controller by clicking. Below, you will see which character has been selected with the current font (which affects which characters can be seen), and then you can paste the character into Concord: See also: Choosing Language 12.6.3 ansi and ascii ASCII text, ANSI text, Text Only and DOS text are all names for plain text. Most word-processors insert special hidden codes into text files to help them keep track of page numbers, bold type and so on. WordSmith Tools can handle them anyway but you'll get cleaner results if you use plain text without the hidden codes. If your source texts were saved as "Text Only with line breaks" there will probably be one every 70 or 80 characters at the end of each text line. If they were saved as "Text Only", the will be equivalent to paragraph breaks. I recommend saving as "Text Only". The Windows program Notepad (Start | Program Files | Accessories) makes plain text files or 213 WordSmith Tools 2007 Mike Scott .txt files. It uses basic character sets e.g. A to Z, numbers and common punctuation symbols. The main difference is in the accented characters. For more on this, see character sets. See also : HTML,SGML & XML. 12.6.4 DOS DOS (text format before Windows) offered a range of character sets called "codepages". They all shared the same codes for the standard English alphabet (a, for example is always code 97) and common punctuation symbols, but included varying symbols for box-drawing, foreign language accents, etc. If you process texts in German, Spanish, Russian, Greek, Polish, etc. you may need to find out which codepage was used when the texts were originally typed. For example, the character is coded one way in codepage 850 (Multilingual) but differently in codepage 860 (Portuguese). It is simply not available at all in codepage 537 (the default codepage in the UK and USA). To alter or examine codepages, see a DOS manual or check the topic out on the web. When it loads up, WordSmith Tools detects the current DOS code-page, so the codepage is only likely to need altering if you are using texts produced when another codepage was in use. 12.6.5 Windows Windows character set codes are different from thos in DOS or Unicode. (The symbol is code 156 in DOS but 163 in Windows.) In Windows 95 or later you can get non-Western fonts enabled via Microsoft Plus. If your texts were written using a Windows word-processor and saved as text in Windows, the accented characters will obey the Windows codes. You will have access to a few more symbols than in DOS (e.g. ,,TM and curly apostrophes). Windows Western (1252) format includes: Anglo-Saxon, Basque, Catalan, Danish, Dutch, English, Middle English, Finnish, French, German, Icelandic, Italian, Norwegian, Old Norse, Portuguese, Spanish, Swedish Windows Baltic (1257) format includes: Estonian, Latvian, Lithuanian Windows Central European (1250) format includes: Albanian, Bosnian, Croatian, Czech, Hungarian, Polish, Romanian, Serbian, Slovak, Slovene, Upper Sorbian, Lower Sorbian Windows Cyrillic (1251) format includes: ByeloRussian, Bulgarian, Macedonian, Russian, Serbian (1251), Ukrainian Windows Greek (1253) handles Greek and Windows Turkish (1255) handles Turkish (what else?) 12.6.6 Unicode A text format standard which uses 2 "bytes" per character. This allows for over 65,000 different characters and symbols to be displayed and makes it possible to show Chinese, Japanese, Cherokee and a whole lot of other languages. When choosing texts, you can press a button to test whether text files are encoded in Unicode. There are a number of different "flavours" of Unicode as defined by the Unicode Consortium. MS Word (2003) offers Unicode 214Reference 2007 Mike Scott Unicode (Big-Endian) Unicode (UTF-7) Unicode (UTF-8) The last two are 1-byte versions, not really Unicode in my opinion. WordSmith wants the first of these but should automatically convert from any of the others. 12.6.7 UTF8 UTF8 is a name for a multi-byte character encoding method, widely used for Chinese, Japanese etc. WordSmith cannot handle UTF8, but you can convert UTF8 to Unicode first using Text Converter. This is a format which was devised for many languages some years ago when disk space was limited and character encoding was problematic. Thaťs because it uses a variable number of bytes to represent the different characters. A to Z will be only 1 byte but for example Japanese characters may well need 2, 3 or even more bytes to represent one character. Quite a messy kludge. 12.7 clipboard You can block an area of data, by using the cursor arrows and Shift, or the mouse, then press Ctrl/Ins or Ctrl/C to copy it to the clipboard. If you then go to a word processor, you can paste or ("paste special") the blocked area into your text. This is usually easier than saving as a text file (or printing to a file) and can also handle any graphic marks. Example 1. Select some data. Here I have selected the first 5 lines (of 335) of a concordance, just the visible text, no Set or Filenames information. 2. Hold down Control and press Ins or C. The data is now in the Windows "clipboard" ready for pasting into any other application, such as Excel, Word, Notepad, etc. The data is automatically placed in the Clipboard in two different formats: as a Picture 215 WordSmith Tools 2007 Mike Scott You will probably use this format for your dissertation and will have to in the case of plotted data. In this concordance, you get only the words visible in your concordance line (not the whole line). This is a graphic which includes screen colours and graphic data. If you subsequently click on the graphic you will be able to alter the overall size of the graphic and edit each component word or graphic line (but not at all easily!). To get this, in Word, I chose Edit | Paste Special | Picture (Enhanced Metafile). What you see in Word is very like what you see in Concord. as plain text Alternatively, you might want to paste as plain text because you want to edit the concordance lines, eg. for classroom use, or because you want to put it into a speadsheet such as MS Excel TM (which will be better if you have graphic data, such as Concord dispersion plots or KeyWords plots). Here the concordance or other data is copied as plain text, with a tab between each column. The Windows plain text editor Notepad can only handle this data format. Microsoft Word will paste (using Shift-Ins or Ctrl-V) the data as text. It pastes in as many characters as you have set in the settings for save as text, the default being 80. 216Reference 2007 Mike Scott Here, the concordance lines are copied, but of course they don't line up very nicely and iťs hard to see the search-word (this). For the search-word to line up nicely, you should use a non proportional font, such as Courier or Lucinda Console, and iťll look like this. Notice that at 10 point text in Lucida Console, the width of the text with 80 characters and the numbers at the left comes to over 18 cm. To avoid word-wrapping, I set the page format in Word to landscape. An alternative is to reduce the number of characters per line to say 50 or 60. 12.8 contact addresses Downloads You can get a more recent version at my website. There are also some free extra downloads (programs, word lists, etc.) there too. And links to sources of free text corpora. Screenshots visit http://www.lexically.net/wordsmith/version5/screenshots/index.html for screenshots of what WordSmith Tools can do. This may give you useful ideas for your own research and will give you a better idea of the limitations of WordSmith too! 217 WordSmith Tools 2007 Mike Scott Purchase Visit http://www.lexically.net/wordsmith/purchasing.htm for details of suppliers. Complaints & Suggestions If you do not have the official OUP version but one from my website, please do not email OUP but me (Mike.Scott@liv.ac.uk). Please give me as full a description of the problem you need to tackle as you can, and details of the equipment too. Please don't include any attachments over 200K in size. I do try to help but cannot promise to... 12.9 date format Date Format Japanese date format year_month_day_hour_minute. At least it is logical, going from larger to smaller. Why aren't URLs organised in a logical order too? 12.10 Definitions 12.10.1 definitions words The word is defined as a sequence of valid characters with a word separator at each end. Valid characters include all the letters from A to Z, plus all accented characters which can be used in the current character set, plus any user-defined acceptable characters to be included within a word (such as the apostrophe or hyphen). A word can be of any length but for one to be stored in a word list, you may set the length you prefer (maximum of 50 characters) -- any which exceed your limit will get + tagged onto them at that point. You can decide whether or not to include words including numbers (e.g. $35.50) in text characteristics. clusters A cluster is a group of words which follow each other in a text. The term phrase is not used here because it has technical senses in linguistics which would imply a grammatical relation between the words in it. In WordList cluster processing or Concord cluster processing there can be no certainty of this, though clusters often do match phrases or idioms. See also: general cluster information. sentences The sentence is defined as the full-stop, question-mark or exclamation-mark (.?!) immediately followed by one or more word separators and then a capital letter in the current language, a number or a currency symbol. (For more discussion see Starts and Ends of Text Segments or Viewer & Aligner technical information.) paragraphs Paragraphs are user-defined. See Starts and Ends of Text Segments for further details. headings Headings are also user-defined -- see Starts and Ends of Text Segments. See also: Setting Text Characteristics, Key-ness, Key key-word, Associate 218Reference 2007 Mike Scott 12.10.2 word separators Conventionally one assumes that one word is distinguished from the next by the presence of spaces at either end. But WordSmith Tools also includes within word separators certain standard codes used by most word processors: page eject code (12), tabs (9), carriage return (13) and line feed (10), end-of-text (26). Besides, hyphens may optionally be considered to split words like self-access into two words. Note that in Chinese and Japanese which do not separate words in this way, any WordSmith functions which require word-separation will not work unless you get your texts previously tagged with word-separators. 12.11 demonstration version The demonstration version of WordSmith Tools offers all the facilities of the complete suite, except that any screen which shows a list (of words in a wordlist, or concordance lines, etc.) is limited to a small number of lines which can be shown or printed. (If you save data, all of it will be saved; iťs just that you can't see it all in the demo version.) See also: Installing, Version Information, Contact Addresses. 12.12 edit v. type-in mode Most windows allow you to press keys either to edit your data (edit mode), or to get quickly to a place in a list (type-in mode). Concordance windows use key presses also for setting categories for the data, or for blanking out the search word. In type-in mode, your key-presses are supposed to help you get quickly to the list item you're interested in, e.g by typing theocr to get to (or near to) theocracy in a word list. If you've typed in 5 letters and a match is found, the search stops. Changing mode is done by right-clicking on the word Set and choosing from the menu which opens up. See also: user-defined categories. 12.13 file types The standard file-extensions used in WordSmith are .cnc concordance file .lst word list 219 WordSmith Tools 2007 Mike Scott .mut mutual information list .dcl detailed consistency list .tokens, .types word list index file .kws key words file .kdb key word database file .ali aligner list .vwr viewer list WordSmith does not affect your Windows Registry, unlike most other programs. The reason is because this can make a system slow down and become unstable, and it also means that to remove WordSmith you can simply delete the folder it is in. In the Controller's General settings, or on installing, however, you can if you wish associate (or disassociate) the current file-types with WordSmith in the Registry. The advantage of association is that Windows will know what Tool to open your data files with. 12.14 finding source texts For some calculations the original source texts need to be available. For example, for Concord to show you more context than has been saved for each line, iťll need to re-read the source text. For KeyWords to calculate a dispersion plot, it needs to look at the source text to find out which KWs came near each other and compute positions of each KW in the text and KW links. If you have moved or deleted the source file(s) in the meantime, this won't be possible. See also : Editing filenames, Choosing source files. 12.15 folders\;directories Found in main Settings menu in all Tools. Default folders can be altered in WordSmith Tools or set as defaults in wordsmith.ini. Concordance Folder: for your concordance files. KeyWords Folder: for your key-word list files. WordList Folder: where you will usually save your word-list files. Texts Folder: where your text files are to be found. Downloaded Media: where your sound & video files will be stored after downloading the first time from the Internet. Settings: where your settings files (.ini files and some others) are kept. If you write the name of a folder which doesn't exist, WordSmith Tools will create it for you if possible. (On a network, this will depend on whether you have rights to create folders and save files.) If you change your Settings folder, you should let WordSmith copy any .ini and other settings files which have been created so that it can keep track of your language preferences, etc. Note: in a network, drive names such as G:, H:, K: change according to which machine you're running from, so that what is G:\texts\my text.txt on one terminal may be H:\texts\my text.txt on another. Fortunately network drives also have names structured like this: \\computer_name\drive_name\. You will find that these names can be used by WordSmith, with the advantage that the same text files can be accessed again later. 220Reference 2007 Mike Scott Tip Use different folders for the different functions in WordSmith Tools. In particular, you may end up making a lot of word lists and key word lists if you're interested in making databases of key words. It is theoretically possible to put any number of files into a folder, but accessing them seems to slow down after there are more than about 500 in a folder. Use the batch facility to produce very large numbers of word list or key words files. I would recommend using a \keywords folder to store .kdb files, and \keywords\genre1, \keywords\genre2, etc. for the .kws files for each genre. See also: finding source texts. 12.16 formulae For computing collocation strength, we can use the joint frequency of two words: how often they co-occur, which assumes we have an idea of how far away counts as "neighbours". (If you live in London, does a person in Liverpool count as a neighbour? From the perspective of Tokyo, maybe they do. If not, is a person in Oxford? Heathrow?) the frequency word 1 altogether in the corpus the frequency of word 2 altogether in the corpus the span or horizons we consider for being neighbours the total number of running words in our corpus: total tokens Mutual Information Log to base 2 of (A divided by (B times C)) where A = joint frequency divided by total tokens B = frequency of word 1 divided by total tokens C = frequency of word 2 divided by total tokens MI3 Log to base 2 of ((J cubed) times E divided by B) where J = joint frequency F1 = frequency of word 1 F2 = frequency of word 2 E = J + (total tokens-F1) + (total tokens-F2) + (total tokens-F1-F2) B = (J + (total tokens-F1)) times (J + (total tokens-F2)) Z Score (J - E) divided by the square root of (E times (1-P)) where J = joint frequency S = collocational span F1 = frequency of word 1 F2 = frequency of word 2 P = F2 divided by (total tokens - F1) E = P times F1 times S 221 WordSmith Tools 2007 Mike Scott Log Likelihood based on Oakes p. 170-2. 2 times ( a Ln a + b Ln b + c Ln c + d Ln d - (a+b) Ln (a+b) - (a+c) Ln (a+c) - (b+d) Ln (b+d) - (c+d) Ln (c+d) + (a+b+c+d) Ln (a+b+c+d) ) where a = joint frequency b = frequency of word 1 c = frequency of word 2 d := frequency of pairs involving neither w1 nor w2 and "Ln" means Natural Logarithm See also: this link from Lancaster University, Mutual Information 12.17 HistoryList History List: many of the combo-boxes in WordSmith like this one for choosing a search-word remember what you type in so you can look them up by pressing the down arrow at the right. 12.18 HTML, SGML and XML These are formats for text exchange. The most well known is HTML, Hypertext Markup Language, used for distributing texts via the Internet. SGML is Standard Generalized Markup Language, used by publishers and the BNC; XML is Extensible Markup Language, intermediate between the other two. All these standards use plain text with additional extra tags, mostly angle-bracketed, such as and
. The point of inserting these tags is to add extra sorts of information to the text: 1 a header () supplying details of the authorship & edition 2 how it should display (e.g., ) 3 what the important sections are ( marks a heading, is the body of the text) 4 how special symbols should display (é corresponds to é) See also: Overview of Tags 12.19 hyphens The character used to separate words. The item "self-help" can be considered as 2 words or 1 word, depending on Language Settings. 222Reference 2007 Mike Scott 12.20 international versions WordSmith can operate with a series of interfaces depending on the language chosen. If you choose French this is what you see in all of WordSmith. See also: acknowledgements 223 WordSmith Tools 2007 Mike Scott 12.21 limitations The programs in WordSmith Tools can handle virtually unlimited amounts of text. They can read text from CD-ROMs, so giving access to corpora containing many millions of words. In practice, the limits are reached by a) storage and b) patience. You can have as many copies of each Tool running at any one time as you like. Each one allows you to work on one set of data. Tags to ignore or ones containing an asterisk can span up to 1,000 characters. When searching for tags to determine whether your text files meet certain requirements, only the first 2 megabytes of text are examined. For Ascii thaťs 2 million characters, for Unicode 1 million. Tip Press F9 to see the "About" box -- it shows the version date and how much memory you have available. If you have too little memory left, try a) closing down some applications, b) closing WordSmithTools and re-entering. See also: Specific Limitations of each Tool 12.22 tool-specific limitations Concord limitations You can compute a virtually unlimited number of lines of concordance using Concord. Concord allows 80 characters for your search-word or phrase, though you can specify an unlimited number of concordance search-words in a search-word file. Each concordance can store an unlimited number of collocates with a maximum horizon of 25 words to left and right of your search-word. WordList limitations A head entry can hold thousands of lemmas, but you can only join up to 20 items in one go using F4. Repeat as needed. Detailed Consistency lists can handle up to 50 files. KeyWords limitations One key-word plot per key-word display. (If you want more, call up the same file in a new display window.) number of link-windows per key-word plot display: 20. number of windows of associates per key key-word display: 20. Splitter limitations Each line of a large text file can be up to 10,000 characters in length. That is, there must be an
from time to time! Text Converter limitations There can be up to 500 strings to search-and-replace for each. Each search-string and each replace-string can be up to 80 characters long. An asterisk must not be the first or last character of the search-string. When the asterisk is used to retain information, the limit is 1,000 characters. Viewer & Aligner limitations 224Reference 2007 Mike Scott If you choose the View option when choosing texts, Viewer & Aligner will call up the first 10 source text files selected. When choosing texts or jumping into the middle of a text (e.g. after choosing in Concord), Viewer & Aligner will only process 10,000 characters of each file, to speed things up in the case of very large files, but you can get it to "re-read" the file by pressing to refresh the display, after which it will read the whole text. See also: General Limitations 12.23 links between tools Linkage with Word Processors, Spreadsheets etc. All the windows showing lists or texts can easily copy selected information to the clipboard. (Use Ctrl+Ins to insert). Where you see this symbol, you can send any selected data straight to a new Microsoft WordTM document. Where you see an URL (such as http://www.lexically.net) you can click to access your browser. Links between the various Tools The programs in WordSmith Tools are linked to each other via wordsmith.exe (the one which says " WordSmith Tools Controller" in its caption, and is found in the top-left corner of your screen). This handles all the defaults, such as colours, folders, fonts, stop lists, etc. In general, if you press Control-C in WordList or KeyWords you'll go straight to a concordance, computed using the current word and using the current files. Press Control-W in Concord or KeyWords to start a wordlist using the current files. Each Tool will send as much relevant information as possible to the Tool being called. This will include: the current word (the one highlighted in the scrolling window) and the text files where any current information came from. Example: after computing a word list based on 3 business texts, you discover that the word hopeful is more frequent than you had expected. You want to do a concordance on that word, using the same texts. Place the highlight on hopeful, hold down Control and press C. Now you can see whether hopeful is part of a 3-word cluster, or view a dispersion plot. Example: after computing a key words database using 300 business texts, you discover that the word bid seems to be a key key-word, and that iťs associated with company, shares etc. Place the highlight on bid, press Control-C and a concordance will be computed using the same 300 texts. Now you can check out the contexts: is bid a bid for power, or is it part of a tendering process? Example: you have a concordance of green. Now press Control-W to generate a word list of the same text files. Press Control-K to compare this word list with a reference corpus list to see what the key words are in these text files. 225 WordSmith Tools 2007 Mike Scott 12.24 keyboard shortcuts scrolling windows: Control-Home to top of scrollable list Control-End to last line of list if iťs ordered alphabetically, type-in your search-word and if it scrolls horizontally: Home to left edge End to right edge Control-Right one word to right Control-Left one word to left hotkeys: Ctrl-C call Concord from within another Tool Ctrl-W call WordList from within another Tool Ctrl-Ins copy blocked section to clipboard Shift-cursor keys block a section F1 help F2 save results F3 print results F4 join entries F5 mark entries for joining F6 re-sort Ctrl/F6 reverse word sort F7 view source text F8 seek short sentences (Viewer) F9 About box (shows version-date and memory availability) F12 search within a list Ctrl/M Merge 2 word lists or KeyWords databases Alt/H access to Help sub-menus Alt/W access to Settings sub-menus Alt/X access to Window sub-menus Alt/X eXit the Tool Ctrl/Z Zap deleted lines see also: Menu items and Buttons 12.25 long file names This version of WordSmith handles long filenames correctly. 226Reference 2007 Mike Scott 12.26 machine requirements This version of WordSmith Tools is designed for machines with: at least 256MB of RAM (you might be OK with 128 but probably not on Windows XP or later) at least 40MB of hard disk space WindowsÔ 98, NT, 2000, XP, Vista or later, or an emulator of one of these if using an Apple Mac or Unix system. It may also work on Windows 95 2nd edition, I don't know... You will find it runs better on a faster machine, especially if there's plenty of RAM. 12.27 manual for WordSmith Tools This help file exists in the form of a manual, which you get when you install. The file ( wordsmith.pdf), is in Adobe AcrobatTM format. It has a table of contents and a fairly detailed index (which I used WordList and KeyWords to help me create). Most people find paper easier to deal with than help files! You may find it useful to see screenshots of WordSmith in action: check out Contact Addresses. 12.28 menu and button options These functions may or may not be visible in each Tool depending on the capacity of the Tool or the current window of data -- the one whose caption bar is highlighted. advice opens a window showing a map of WordSmith Tools, giving a view of where you are now and where you might go next; also offers advice depending on the Tool. associates opens a new window showing Associates. auto-join joins (lemmatises) automatically. auto-size re-sizes each line of a display so that each one shows as much data as it should. Most windows have lines of a fixed size but some, e.g. in Viewer, allow you to adjust row heights. This adjusts line heights according to the current highlighted column of data. clumps computes clumps in a keywords database regroup clumps regroups the clumps clusters computes concordance clusters. collocates shows collocates using concordance data. compute calculates a new column of data based on calculator functions and/or existing data. redo collocates recalculates collocates, e.g. after you've deleted concordance lines. column totals computes totals, min, max, mean, standard deviation for each column of numerical data. 227 WordSmith Tools 2007 Mike Scott concord within KeyWords, WordList, starts Concord and concordances the highlighted word(s) using the original source text(s). copy allows you to copy your data to a variety of different places (the printer, a text file, the clipboard, etc.). double columns allows you to double the number of columns, so as to save paper when printing. edit allows editing of a list or searches for a word (type-in search). edit or type-in mode alternates between edit and type-in mode. filenames opens a new window showing the filenames from which the current data derived. If necessary you can edit them. find files finds any text files which contain all the words you've marked. grow increases the height of all rows to a fixed size. See shrink ( ) below. help (also F1) opens WordSmith Help (this file) with context-sensitive help. join joins one entry to another e.g. sentences in Viewer, words in WordList (lemmatisation). layout This allows you to alter many settings for the layout: the colour of each column, whether to hide a column of data, typefaces and column widths. links computes links between words in a key-words plot. mark marks an entry for joining or finding files. match lemmas checks each item in the list against ones from a text file of lemmatised forms and joins any that match. match list matches up the entries in the current list against ones in a "match list file" or template, marking any found with (~). mutual information computes mutual information scores in a WordList index list. new... gets you started in the various Tools, e.g. to make a concordance, a word list, or a key words list. open... gives you a chance to choose a set of saved results. patterns computes collocation patterns. play media plays a media file. plot opens a new window showing a Concord dispersion plot or KeyWords plot. print (also F3) previews your window data for printing; can print to file, which is equivalent to "save as text". 228Reference 2007 Mike Scott refresh re-reads your text file (in Viewer) or re-draws the screen (in Print Preview). remove duplicates removes any duplicate concordance lines. replace search & replace, e.g. to replace drive or folder data, when editing file-names where the source texts have been moved. re-sort re-sorts lists (e.g. in frequency as opposed to alphabetical order) in Concord, KeyWords or WordList. ruler shows/hides vertical divisions in any list; text divisions in a KeyWords plot. Click ruler in a menu to turn on or off or change the number of ruler divisions for a plot. save (also F2) saves your data using existing file-name; if iťs a new file asks for file-name first. save as saves after asking you for a file-name. save as text saves as a .txt file: plain text. search (also F12) searches within a list. shrink reduces the height of all rows to a smaller fixed height. See grow ( ) above. skim in Viewer, allows timed skimming through a text. statistics opens a new window showing detailed statistics. statusbar toggles on & off the "status bar" (at the bottom of a window, shows comments and the status of what has been done). summary statistics opens a new window showing summary statistics, e.g. proportion of lemmas to word-types. swap columns for rows swaps the columns and rows. WordList statistics are shown by default with the file data in each column. Click this button to swap the row data with the column data. toolbar toggles on & off a toolbar with the same buttons on it as the ones you chose when you customised popup menus. unjoin unjoins any entries that have been joined, e.g. lemmatisedentries. view source text shows the source text and highlights any words currently selected in the list. Microsoft WordTM sends formatted data to Word. wordlist within KeyWords, makes a word list using the current data. zap zaps any deleted entries. see also: Keyboard Shortcuts, Customising popup menus. 229 WordSmith Tools 2007 Mike Scott 12.29 numbers Depending on Language and Text Settings, you might wish to include or exclude numbers from word lists. 12.30 plot dispersion value The point of it A dispersion value is the degree to which a set of values are uniformly spread. Think of rainfall in the UK -- generally fairly uniformly spread throughout the year. Compare with countries which have a rainy season. In linguistic terms, one might wish to know how the occurrences of a word like skull are distributed in Hamlet, and WordSmith has shown this in plot form since version 1. The dispersion value statistic gives mathematical support to this and makes comparisons easier. How it is calculated The plot dispersion calculated in KeyWords and Concord dispersion plots uses the first of the 3 formulae supplied in Oakes (1998: 190-191), which he reports as having been evaluated as the most reliable. Like the ruler, it divides the plot into 8 segments for this. It ranges from 0 to 1, with 0.9 or 1 suggesting very uniform dispersion and 0 or 0.1suggesting "burstiness" (Katz, 1996) See also: KeyWords plot, Concord dispersion plot. 12.31 RAM availability The more RAM (chip memory) you have in your computer, the faster it will run and the more it can store. As it is working, each program needs to store results in memory. A word list of over 80,000 entries, representing over 4 million words of text, will take up roughly 3 Megabytes of memory. (In Finnish it would be much more.) When memory is low, Windows will attempt to find room by putting some results in temporary storage on your hard disk. If this happens, you'll probably hear a lot of clicking as it puts data onto the disk and then reads it off again. You will probably hear some clicking anyway as most of the programs in WordSmith Tools access your original texts from the hard disk, but a constant barrage of thrashing shows you've reached your machine's natural limits. You can find out how much storage you have available even in the middle of a process, by pressing F9 (the About option in the main Help menu of each program). The first line states the RAM availability. The other figures supplied concern Windows system resources: they should not be a problem but if they do go below about 20% you should save results, exit Windows and re-enter. Theoretically, word lists and key word lists can contain up to 2,147,483,647 separate entries. Each of these words can have appeared in your texts up to 2,147,483,647 times. (This strange number 2,147,483,647, half of 2 to the power 32, is the largest signed integer which can be stored in 32 bits and is also called 2 Gigabytes.) You are not likely to reach this theoretical limit: for the item the to have occurred 2,147,483,647 times in your texts, you would have processed about 30 thousand million words (1 CD-ROM, containing only plain text, can hold about 100 million words so this number represents some 300 CD-ROMs.) You would have run out of RAM long before this. If you have 64MB of RAM or more you should be able to have a copy of a wordlist based on 230Reference 2007 Mike Scott millions of words of text, and at the same time have a powerful word-processor and a text file in memory. See also: speed 12.32 reference corpus Reference Corpus A corpus of text which you use for comparative purposes. For example, you might want to compare a given piece of text with the British National Corpus, a collection of 100 million words. Useful when computing key words. In the Controller you can set your reference corpus word list for KeyWords and Concord to make use of. (That is, a word list created using the WordList tool.) 12.33 restore last file By default, the last word list, concordance or key words listing that you saved or retrieved will be automatically restored on entry to WordSmith Tools. If the last Tool used is Concord, a list of your 10 most recent search-words will be saved too. This feature can be turned off temporarily via a menu option or permanently in wordsmith.ini (in your \wsmith5 folder). 12.34 selecting multiple entries To select more than one entry in a wordlist, concordance, key word list etc, hold down Control and select the rows you are interested in. To mark entries for joining in lemmatisation, you can choose Edit | Mark (F5) in the menu. For example, to do a search from a wordlist of these items, I help down Control and pressed FEB, FEBRUARY, FEBUARY and FEBURARY, then chose Edit | Concordance 231 WordSmith Tools 2007 Mike Scott The resulting concordance shows the last two entries are indeed mis-spellings. 12.35 single words v. clusters The point of it... Clusters are words which are found repeatedly together in each others' company, in sequence. They represent a tighter relationship than collocates, more like multi-word units or groups or phrases. (I call them clusters because groups and phrases already have uses in grammar and because simply being found together in software doesn't guarantee they are true multi-word units.) Bibercalls them "lexical bundles". Language is phrasal and textual. It is not helpful to see it as a matter of selecting a word to fill a grammatical "slot" as implied by structural theories. Words keep company: the extreme example is idiom where they're bound tightly to each other, but all words have a tendency to cluster together with some others. These clustering relations may involve colligation (e.g. the relationship between depend and on), collocation, and semantic prosody (the tendency for cause to come with negative effects such as accident, trouble, etc.). WordSmith Tools gives you two opportunities for identifying word clusters, in WordList and Concord. They use different methods. Concord only processes concordance lines, while WordList processes whole texts. 232Reference 2007 Mike Scott How Concord does it... Suppose your text begins like this: Once upon a time, there was a beautiful princess. She snored. But the prince didn't. If you've chosen 2-word clusters, the text will be split up as follows: Once upon upon a a time (note not "time there" because of the comma) there was (etc.) With a three-word cluster setting, it would send Once upon a upon a time there was a was a beautiful a beautiful princess But the prince the prince didn't (etc.) That is, each n-word cluster will be stored, if it reaches n words in length, up to a punctuation boundary, marked by ;,.!? (It seems reasonable to suppose that a cluster does not cross clause boundaries and these punctuation symbols help mark clause boundaries.) 12.36 speed To make a wordlist on 4.2 million words used to take about 20 minutes on a 1993 vintage 486-33 with 8Mb of RAM. The sorting procedure at the end of the processing took about 30 seconds. A 200Mz Pentium with 64MB of RAM handled over 1.7 million words per minute. On a 100Mz Pentium with 32Mb of RAM this whole process took about 3 and a half minutes, working at over a million words a minute. When concordancing, tests on the same Pentium 100, using one 55MB text file of 9.3 million words, and a quad-speed CD-ROM drive, showed search-word source speed quickly CD-ROM 6 million words per minute quickly hard disk 12 million wpm theCD-ROM 900,000 wpm thehard disk 1 million wpm thez CD-ROM 6 million wpm thez hard disk 16 million wpm Tests using a set of text files ranging from 20K down to 4K, using quickly as the search-word, gave speeds of 2 million wpm rising with the longer files to 4 million wpm. Making a word list on the same set of files gave an average speed of 800,000 wpm. On the 55MB text file the speed was around 1.35 million wpm. These data suggest that factors which slow concordancing down are, in order, word rarity (the was much slower than quickly or the non-existent thez), text file size (very small files of only 500 words or so (3K) will be processed about three times as slowly as big ones) and disk speed (the outdated quad speed CD-ROM being roughly half the speed of the 12ms hard disk). When Concord finds a word it has to store the concordance line and collocates and show it (so that you can decide to suspend any further processing if you don't like the results or have enough already). This is a major factor slowing down the processing. Second, reading a file calls on the computer's file management system, which is quite slow in loading it, in comparison with Concord actually searching through it. Third, disk speeds are quite varied, floppy disks being much the worst for speed. If processing seems excessively slow, close down as many programs as possible and run 233 WordSmith Tools 2007 Mike Scott WordSmith Tools again. Or install more RAM. Get advice about setting Windows to run efficiently (virtual memory, disk caches, etc.) Use a large fast hard drive. You can run other software while the programs are computing, but they will take up a lot of the processor's time. Shoot-em-up games may run too jerkily, but printing a document at the same time should be fine. 12.37 status bar The bar at the bottom of a window, which allows you to pull the whole window bigger or smaller, and which also shows a series of panels with information on the current data. The status bar can usually be revealed or hidden using a main menu option. You can right-click on the panel to bring up a popup menu offering choice between Edit, Type and Set. 12.38 tools for pattern-spotting Tools are needed in almost every human endeavour, from making pottery to predicting the weather. Computer tools are useful because they enable certain actions to be performed easily, and this facility means that it becomes possible to do more complex jobs. It becomes possible to gain insights because when you can try an idea out quickly and easily, you can experiment, and from experimentation comes insight. Also, re-casting a set of data in a new form enables the human being to spot patterns. This is ironic. The computer is an awful device for recognising patterns. It is good at addition, sorting, etc. It has a memory but it does not know or understand anything, and for a computer to recognise printed characters, never mind reading hand-writing, is a major accomplishment. Nevertheless, the computer is a good device for helping humans to spot patterns and trends. That is why it is important to see computer tools such as these in WordSmith Tools in their true light. A tool helps you to do your job, it doesn't do your job for you. Tool versus Product Some software is designed as a product. A game is self-contained, so is an electronic dictionary. A word-processor, spreadsheet or database, on the other hand, is a tool because it goes beyond its own borders: you use it to achieve something which the manufacturers could not possibly anticipate. WordSmith Tools, as their name states, are not products but tools. You can use them to investigate many kinds of pattern in virtually any texts written in a good range of different languages. Insight through Transformation No, this is not a religious claim! The claim I am making is psychological. It is through changing the shape of data, reducing it and then re-casting it in a different format, that the human capacity for noticing patterns comes to the fore. The computer cannot "notice" at all (if you input 2 into a calculator and then keep asking it to double it, it will not notice what you're up to and begin to do it automatically!). Human beings are good at noticing, and particularly good at noticing visual patterns. By transforming a text into a list, or by plotting keywords in terms of where they crop up in their source texts, the human user will tend to see a pattern. Indeed we cannot help it. Sometimes we see patterns where none was intended (e.g. in a cloud). There can be no guarantee that the pattern is "really there": iťs all in the mind of the beholder. WordSmith Tools are intended to help this process of pattern-spotting, which leads to insight. The tools in this kit are intended therefore to help you gain your own insights on your own data from your own texts. Types of Tool All tools take up positions on two scales: the scale of specialisation and the scale of permanence. general-purpose ----------------- specialised general-purpose The spade is a digging tool which makes cutting and lifting soil easier than it otherwise would be. But it can also be used for shovelling sand or clearing snow. A sewing machine can be used to 234Reference 2007 Mike Scott make curtains or handkerchiefs. A word-processor is general-purpose. specialised A thimble is dedicated to the purpose of protecting the fingers when sewing and is rarely used for anything else. An overlock device is dedicated to sewing button-holes and hems: iťs better at that job than a sewing machine but its applications are specialised. A spell-checker within a word-processor is fairly specialised. temporary ----------------- permanent temporary The branch a gorilla uses to pull down fruit is a temporary tool. After use it reverts to being a spare piece of tree. A plank used as a tool for smoothing concrete is similar. It doesn't get labelled as a tool though it is used as one. This kind of makeshift tool is called "quebra-galho", literally branch-breaker, in Brazilian Portuguese. permanent A chisel is manufactured, catalogued and sold as a permanent tool. It has a formal label in our vocabulary. Once bought, it takes up storage room and needs to be kept in good condition. The WordSmith Tools in this kit originated from temporary tools and have become permanent. They are intended to be general-purpose tools: this is the Swiss Army knife for lexis. They won't cut your fingers but you do need to know how to use them. see also : Acknowledgements 12.39 version information This help file is for the current version of WordSmith Tools. The version of WordSmith Tools is displayed in the About option (F9) which also shows your registered name and the amount of memory available. If you have a demonstration version this will be stated immediately below your name. Check the date in this box, which will tell you how up-to-date your current version is. As suggestions are incorporated, improved versions are made available for downloading. Keep a copy of your registration code for updated versions. You can click on the WordSmith graphic in the About box to see your current code. 235 WordSmith Tools 2007 Mike Scott See also: 32-bit Version Differences, Demonstration Version, Contact Addresses. 12.40 zip files Zip files are files which have been compressed in a standard way. WordSmith can now read and write to .zip files. The point of it... Apart from the obvious advantage of your files being considerably smaller than the originals were, the other advantage is that less disk space gets wasted like this: any text file, even a short one containing on the word "hello", will take up on your disk something like 4,000 bytes or maybe up to 32,000 depending on your system. If you have 100 short files, you would be losing many thousands of bytes of space. If you "zip" 100 short files they may fit into just 1 such space. Zip files are used a lot in Internet transmissions because of these advantages. If you have a lot of word lists to store, it will be much more efficient to store them in one .zip file. The "cost" of zipping is a) the very small amount of time this takes, b) the resulting .zip file can only be read by software which understands the standard format. There are numerous zip programs on the market, including PKZipTM and WinzipTM. If you zip up a word list, these programs can unzip it but won't be able to do anything with the finished list. WordSmith can first unzip it and then show it to you. How to do it... Where you see an option to create a zip file, this can be checked, and the results will be stored where you choose but in zipped form with the .zip ending. If you choose to open a zipped word list, concordance, text file, etc. and it contains more than one file within it, you will get a chance to decide which file(s) within it to open up. Otherwise the process will happen in the background and will not affect your normal WordSmith processing. Troubleshooting Section XIII WordSmith Tools 237 WordSmith Tools 2007 Mike Scott 13 Troubleshooting 13.1 list of FAQs See also: logging. These are the Frequently Asked Questions. There's a much longer list of explanations under Error Messages. Can't process apostrophes Is this Russian, Greek or English? strange symbols in display It crashed It doesn't even start! It takes ages! Keys don't respond Line beyond demo limit Mismatch between Concord and WordList results No tags visible in concordance Printing problem Text is unreadable because of the colours Too much or too little space between columns Wordlist out of order Won't slice pineapples 13.2 apostrophes not found Apostrophes not processed If your original text files were saved using Microsoft WordTM, you may find Concord can't find apostrophes or quotation marks in them! This is because Word can be set to produce "smart" symbols. The ordinary apostrophe or inverted comma in this case will be replaced by a curly one, curling left or right depending on its position on the left or right of a word. These smart symbols are not the same as straight apostrophes or double quote symbols. Solution: drag the symbol from the set below when entering your search word, or else replace them in your text files using Text Converter. See also: settings 13.3 column spacing column spacing is wrong You can alter this by clicking on the layout button. 13.4 Concord tags problem no tags visible in concordance If you can't see any tags after asking for Nearest Tag in Concord, it is probably because the Tags to Ignore has the same format. For example, if Text to Ignore has <*>, any tags such as , , etc. will be cut out of the concordance unless you specify them in a tag file. Solution: specify the tag file and run the concordance again. 238Troubleshooting 2007 Mike Scott 13.5 Concord/WordList mismatch Concord/WordList mismatch If WordList finds a certain number of occurrences of a (word list) cluster but Concord finds a different number, this is because the procedures are different. WordList proceeds word by word, ignoring punctuation (except for hyphens and apostrophes). When Concord searches for a (concordance) cluster it will take punctuation into account. 13.6 crashed it crashed! Solution: quit WordSmith Tools and enter again. If that fails, quit Windows and try again. Or try logging. The idea of Logging is to find out what is causing a crash. It is designed for when WS5 gets only part of the way through some process. As it proceeds, it keeps adding messages to the log about what it has found & done. When it crashes, it can't add any more messages! So if you examine the log you can see where it was up to. At that point, you may see a text file name that it opened up. Examine that text, you might be able to see something strange about it, eg. it has got corrupted. 13.7 demo limit demo limit reached You may have just downloaded, but you haven't yet supplied your registration details. To do this, go to the main WordSmith Tools window, and choose Settings | Register in the menu. If you haven't got the 20-character registration code, contact Lexical Analysis Software. The only difference between a demonstration version and a full version is: with the latter you can see or print all the data, with the former you'll be able to see only about 25 lines of output. 13.8 funny symbols weird symbols funny symbols when using WordSmith Tools 1. Check your text files. Read them in Notepad. Do they contain lots of strange symbols? These may be hidden codes used by your usual word-processor. Solution: read them into your usual word-processor and Save As, with a new name, in plain text format, sometimes called "Text Only" or .txt. 2. Choose Texts, highlight the text file, and before pressing OK, press View. Does it contain strange symbols? Solution: change Text Settings; try going from one of the DOS character sets to Windows or vice-versa. The text was clean ASCII but WordSmith Tools thought it was Windows ANSI. 3. Funny symbols in a word list may well also be caused by mis-spellings in the original text files. Greek, Russian, etc. 4. If the text is in Russian, Greek, etc. you will need an appropriate font, obtainable from your Windows cd or via the Microsoft website. 5. If you have several lists open which use different character sets, and you change Font or Text Characteristics, the lists will all be updated to show the current font and character set, unless you first minimize any window which would be affected. funny symbols when reading WordSmith data in another application 239 WordSmith Tools 2007 Mike Scott WordSmith Tools can Save or Save As and Saves as text" by printing to a file. "Save" and "Save As" will store the file in a format for re-use by WordSmith. This format is not suitable for reading into a word processor. The idea is simply for you to store your work so that you can return to it another day. "Save as Text", on the other hand, means saving as plain text, by "printing" to a file. This function is useful if you don't want to print to paper from WordSmith but instead take the data into a spreadsheet, or word processor such as Microsoft Word. It is usually quicker to copy the selected text into the clipboard. 13.9 illegible colours text unreadable because of colours Solution: in Settings, choose Colours. You can now set the colours which suit your computer monitor. Monochrome settings are available. 13.10 keys don't respond Keys don't respond If a key press does nothing, it is probably because the wrong window has the focus. As you know, Windows is designed to let users open up a number of programs at once on the same screen, so each window will respond to different key-press combinations. You can see which window has the focus because its caption is coloured differently from all the others. The solution is to click anywhere within the window which you want to use, then press the key you wanted. 13.11 pineapple-slicing won't slice a pineapple "Propose to any Englishman any principle, or any instrument, however admirable, and you will observe that the whole effort of the English mind is directed to find a difficulty, a defect, or an impossibility in it. If you speak to him of a machine for peeling a potato, he will pronounce it impossible: if you peel a potato with it before his eyes, he will declare it useless, because it will not slice a pineapple." Charles Babbage, 1852. (Babbage was the father of computing, a 19th Century inventor who designed a mechanical computer, a mass of brass levers and cog-wheels. But in order to make it, he needed much greater accuracy than existing technology provided, and had all sorts of problems, technical and financial. He solved most of the former but not the latter, and died before he was able to see his Difference Engine working. The proof that his design was correct was shown later, when working versions were made. The difficulties he encountered in getting support from his government weren't exclusively English.) 13.12 printer didn't print printing problem If your printing comes out with one or more column blank but others printed correctly, you may have a printer which can only manage black and white and not shades of grey. In the Controller, change the setting (Adjust Settings | General) to monochrome. 240Troubleshooting 2007 Mike Scott 13.13 too slow It takes ages If you're processing a lot of text and you have an ancient PC with little memory and a hard disk that Noah bought from a man in the market for a rainy day, it might take ages. You'll hear a lot of clicks coming from the hard disk when memory is low. Solution: get a faster computer, by installing more memory which makes a big difference), by defragmenting your hard drive, by using a disk cache, or by adjusting virtual memory settings. If you're running WordSmith Tools on a network, check with the network administrator whether performance is significantly degraded because of network access. Solution 2: quit all programs you don't need. That can restore a lot of system memory. Solution 3: quit Windows and start again. That can restore a lot of system memory. Solution 4: save and read from the local hard disk, not the network. 13.14 won't start it doesn't even start Yikes! 13.15 word list out of order wordlist out of order Words are sorted according to Microsoft routines which depend on the language. If you process Spanish but leave the Language settings to "English", you will get results which are not in correct Spanish order, (e.g. LL will come just before LM). Solution: choose your language and re-compute the wordlist. Error Messages Section XIV WordSmith Tools 242Error Messages 2007 Mike Scott 14 Error Messages 14.1 list of error messages List of Error Messages See also: Troubleshooting. Can only save WORDS as ASCII Can't call other Tool Can't make folder as thaťs an existing filename Can't merge list Can't read file Character set reset toto suit Concordance file is faulty Concordance stop list file not found Conversion file not found Destination folder not found Disk problem: File not saved Dispersions go with concordances Drive not valid Failed to access Internet Failed to create new folder name File access denied File contains none of the tags specified File not found Filenames must differ! Full drive:\folder name needed function not working properly yet INI file not found Invalid Concordance file Invalid file name Invalid Keywords Database file Invalid Keywords file Invalid Wordlist Comparison file Invalid Wordlist file Joining limit reached: join & try again Key words file is faulty Keywords Database file is faulty Limit of 500 file-based search-words reached Links between Tools disrupted Match list details not specified Must be a number Network registration running elsewhere or vice-versa No access to text file: in use elsewhere? No associates found No clumps identified No clusters found No collocates found No concordance entries found No concordance stop list words No deleted lines to Zap No entries in Keywords Database 243 WordSmith Tools 2007 Mike Scott No Key Words found No key words to plot No keyword stop list words No lemma list words No match list words No room for computed variable No statistics available No stop list words No such file(s) found No tag list words Not a valid number No wordlists selected Original text file needed but not found Registration string is not correct Registration string must be 20 letters long Short of Memory! Source Folder file(s) not found Stop list file not found Stop list file not read Tag file not found Tag list file not read This function is not yet ready! This is a demo version This program needs Windows 95 or greater To stop getting this annoying message, Update from Demo in setup.exe Too many ignores (50 limit) Too many sentences (8000 limit) Two files needed Truncating at xx words -- tag list file has more! Unable to merge Keywords Databases Why did my search fail? Word list file not found Wordlist comparison file is faulty Word-list file is faulty WordSmith Tools has expired: get another WordSmith Tools already running WordSmith version mis-match xx days left 14.2 .ini file not found .ini file not found On starting up, WordSmith looks for the wordsmith.ini file which holds your current defaults. If you've removed or renamed it, restore it. This file should be in the same folder as the Tools are in. 14.3 base list error base list error WordSmith is trying to access an word or concordance line above or below the top or bottom of the data computed. This is a bug. 244Error Messages 2007 Mike Scott 14.4 can only save words as ASCII Can only save WORDS as Plain Text WordSmith Tools can't save graphics as a text file. If you get this error message, you can only save this type of data by copying to the clipboard and pasting it into your word-processor. 14.5 can't call other tool Can't call other Tool Inter-Tool communication has got disrupted. Save your work, first. Then, if necessary, close down WordSmith Tools altogether, then start the main wordsmith.exe program again. 14.6 can't make folder as thaťs an existing filename Can't make folder as thaťs an existing filename If you already have a file called C:\TEMP\FRED, you can't make a sub-folder of C:\TEMP called FRED. Choose a new name. 14.7 can't compute key words as languages differ Can't compute key words as languages differ Key words can only be computed if both the text file and the reference corpus are in the same primary language. You can compute KWs using 2 different varieties of English or 2 different varieties of Spanish, but not between English and French. 14.8 can't merge list with itself! Can't merge list with itself You can only merge 1 word list or key word database with 1 other at a time. Select (by clicking while holding down the Control key) 2 file-names in the list of files. 14.9 can't read file Can't read file If this happens when starting up WordSmith Tools, there is probably a component file missing. One example is sayings.txt, which holds sayings that appear in the main Controller window. If you've deleted it, I suggest you use notepad to start a new sayings.txt and put one blank line in it. If you get this message at another time, something has gone wrong with a disk reading operation. The file you're trying to read in may be corrupted. This happens easily if you often handle very large files, especially if iťs a long time since you last ran Scandisk to check whether any clusters in your files have got lost. See your DOS or Windows manual for help on fragmentation. 14.10 character set reset to to suit Character set reset to to suit Prior to version 2.00.07, WordSmith Tools handled fewer character sets and languages than it does now. Accordingly, data saved in the format used before that version may not "know" what language it was based on. If you get this message when opening up an old WordSmith data file, 245 WordSmith Tools 2007 Mike Scott iťs because WordSmith doesn't know what language it derived from. Through gross linguistic imperialism, it will by default assume that the language is English! If the data are okay, just click the save button so that next time it will "know" which language iťs based on. If not, reset the language to the one you want in the Controller, Adjust Settings | Text, then re-save the list. 14.11 concordance file is faulty Concordance file is faulty Each type of file created by WordSmith Tools has its own default filename extension (e.g. .CNC, .LST) and its own internal structure. If you have another file with the same extension produced by another program, this will not be compatible. It would not be sensible to rename a .CNC file to .TXT, or vice-versa! WordSmith has detected that the file you're calling up wasn't produced by the current version of Concord. 14.12 concordance stop list file not found Concordance stop list file not found You typed in the name of a non-existent file. If typing in a filename, remember to include the full drive and folder as well as the filename itself. 14.13 confirmation messages: okay to re-read Okay to re-read? A confirmation message. To proceed, Viewer & Aligner will now re-read the disk file. This will affect any alterations you've already made to the display. You may wish to save first and then try again later. Also, Viewer & Aligner will try to read the whole text file. If you have a very big file on a slow CD-ROM drive, this will take some time. 14.14 conversion file not found Conversion file not found You typed in the name of a non-existent file. If typing in a filename, remember to include the full drive and folder as well as the filename itself. 14.15 destination folder not found Destination folder not found WordSmith couldn't find that folder; perhaps iťs mis-spelt. 14.16 disk problem -- file not saved Disk problem: File not saved Something has gone wrong with a disk writing operation. Perhaps there's not enough room on the drive. If so, delete some files on that drive. 246Error Messages 2007 Mike Scott 14.17 dispersions go with concordances Dispersions go with concordances They can't be saved separately. 14.18 drive not valid Drive not valid WordSmith is unable to access this drive. This could happen if you attempt to access a disk drive which doesn't exist, e.g. drive P: where your drives include A:, C:, D: and E:. 14.19 failed to access Internet Failed to access Internet This function relies on a) your having an Internet browser on your computer, b) your system "associating" an Internet URL ending .htm with that browser. 14.20 failed to create new folder name Failed to create new folder A folder and a file cannot have the same name. If you already have a file called C:\TEMP\FRED, you can't make a sub-folder of C:\TEMP called FRED. Choose a new name. 14.21 failed to read file Failed to Read This may have happened because your disk filing system has got screwed up. This is especially likely to occur if you often use large files in your word processor. I would recommend you to run System Tools | Scandisk. 14.22 failed to save file Failed to Save Maybe because you had the same file open in another program or another instance of the Tool you're running. If so, close it and try again. Or because the folder you're saving to is a read-only folder on a network, or because the disk is full, or because your disk filing system has got screwed up. This last problem is quite common, actually, and is especially likely to occur if you often use large files in your word processor. In that case run Programs | Accessories | System Tools | Disk Defragmenter. If you're working on a network, you will be able to save on certain drives and folders but not others; the solution is to try again on a memory stick or a hard disk drive which you do have the right to save to. 14.23 file access denied File Access Denied Maybe the file you want is already in use by another program. You'll find most word-processors label any text files open in them as "in use", and won't let other programs access them even just to read them. Close the text file down in your word processor. 247 WordSmith Tools 2007 Mike Scott 14.24 file contains none of the tags specified File contains none of the tags specified You specified tags, but none of them were found. 14.25 file has "holes" File has "holes" Your text file is defective. It may well contain useful text, but it also contains at least one unrecognised character such as character(0). The problem could have arisen because it was transferred from one system to another, part of the disk is corrupted, or else maybe the file contains unrecognised graphics, or else it is not a plain text file but e.g. a Word document. You will see the context where the problem occurred and will be told roughly how far into the text it was detected. WordSmith can proceed if you wish but you get a chance to skip the text. You can solve this problem -- which will come each time you choose that text file -- by reading the text file into a word processor and re-saving it as a plain .txt file. Also, in File Utilities there is a tool for finding such files. 14.26 file not found File not found This message, like Original Text not found, can appear when WordSmith needs to access the original source text used when a list was created, but cannot find it. Have you deleted or moved it? If the file is still available, you may be able to edit the filenames in the filename window ( ) of this list. Or the message may come after you've supplied the filename yourself. You may have mis-typed it. Is it a Windows 95 or NT long filename?. If typing in a filename, remember to include the full drive and folder as well as the filename itself. 14.27 filenames must differ! Filenames must differ You can't compare a file with itself. 14.28 folder is read-only For some purposes, WordSmith needs to save files e.g. lists of results you have made so that you can get at recent files again. To do this it needs a place where your network or operating system lets you save. Usually c:\wsmith5 is fine, but in some institutional settings drive c: may be "read-only". If you see this message, choose Adjust Settings | Folders | Settings and select there a folder where you can write as well as read. 14.29 for use on X machine only For use on pc named XXX only The software was registered for use on another PC. If you get this message, please re-install as appropriate. 248Error Messages 2007 Mike Scott 14.30 form incomplete Form incomplete You tried to close a form where one or more of the blanks needed to be filled in before WordSmith could proceed. 14.31 full drive & folder name needed Full drive:\folder name needed When typing in a filename, remember to include the full drive and folder as well as the filename itself. 14.32 function not working properly yet function not working properly yet This is a function under development, still not fully implemented. 14.33 invalid concordance file Invalid Concordance file Each type of file created by WordSmith Tools has its own default filename extension (e.g. .CNC, .LST) and its own internal structure. If you have another file with the same extension produced by another program, this will not be compatible. It would not be sensible to rename a .CNC file to .TXT, or vice-versa! WordSmith has detected that the file you're calling up wasn't produced by the current version of Concord. 14.34 invalid file name Invalid file name Filenames may not contain spaces or certain symbols such as ? and *. In Windows before Windows 95 they had to be restricted to 8 letters and a dot and three more, too. Try again. 14.35 invalid KeyWords database file Invalid Keywords Database file Each type of file created by WordSmith Tools has its own default filename extension (e.g. .KWS, .KDB) and its own internal structure. If you have another file with the same extension produced by another program, this will not be compatible. It would not be sensible to rename a .KDB file to .TXT, or vice-versa! WordSmith has detected that the file you're calling up wasn't produced for a database by the current version of KeyWords. 14.36 invalid KeyWords calculation Invalid Keywords calculation For KeyWords to calculate the key-words in a text file by comparing it with a reference corpus, both must be in the same language and both must be sorted in the same way (alphabetical order, ascending). If you see this message you are trying to compute KWs without metting these criteria. Solution: open each word-list and check to see it is OK and that it is sorted alphabetically in the same way. Check they have both been made with the same language settings and if necessary re-compute one or both of them. 249 WordSmith Tools 2007 Mike Scott 14.37 invalid WordList comparison file Invalid Wordlist Comparison file Each type of file created by WordSmith Tools has its own default filename extension (e.g. .LST, .CNC) and its own internal structure. If you have another file with the same extension produced by another program, this will not be compatible. It would not be sensible to rename a .CNC file to .TXT, or vice-versa! WordSmith has detected that the file you're calling up wasn't produced as a comparison file by WordList. 14.38 invalid WordList file Invalid Wordlist file Each type of file created by WordSmith Tools has its own default filename extension (e.g. .LST, .CNC) and its own internal structure. If you have another file with the same extension produced by another program, this will not be compatible. It would not be sensible to rename a .LST file to .TXT, or vice-versa! WordSmith has detected that the file you're calling up wasn't produced by the current version of WordList. 14.39 joining limit reached Joining limit reached: join & try again Only a certain number of words can be lemmatised in one operation. If you reach the limit and get this message, 1. lemmatise by pressing F4, 2. place the highlight on the head entry again 3. press F5 and carry on lemmatising by pressing F5 on each entry you wish to attach to the head entry 4. when you've done, press F4 to join them up. 14.40 KeyWords database file is faulty Keywords Database file is faulty Each type of file created by WordSmith Tools has its own default filename extension (e.g. . KDB, .KWS) and its own internal structure. If you have another file with the same extension produced by another program, this will not be compatible. It would not be sensible to rename a .KDB file to .TXT, or vice-versa! WordSmith has detected that the file you're calling up wasn't produced for a database of keywords, by the current version of KeyWords. 14.41 KeyWords file is faulty Key words file is faulty Each type of file created by WordSmith Tools has its own default filename extension (e.g. .KWS, .KDB) and its own internal structure. If you have another file with the same extension produced by another program, this will not be compatible. It would not be sensible to rename a .KWS file to .TXT, or vice-versa! WordSmith has detected that the file you're calling up wasn't produced by the current version of KeyWords. 14.42 limit of file-based search-words reached Limit of search-words reached No more than 15 search-words can be processed at once, unless you use a file of search words to tell Concord to do them in a batch, where the limit is 500. 250Error Messages 2007 Mike Scott 14.43 links between Tools disrupted Links between Tools disrupted WordSmith Tools Controller or an individual Tool has tried to call another Tool and failed. There may have been a fault in another program you're running or a shortage of memory. As inter-tool communication links are vital in this suite, you should exit WordSmith and re-enter. 14.44 match list details not specified Match list details not specified You pressed the Match List button but then failed to choose a valid match list file or else to type in a template for filtering. Try again. 14.45 must be a number Must be a number You typed in something other than a number. Be especially careful with lower-case L and 1, and O (the letter) instead of 0 (the number). 14.46 mutual information incompatible Mutual information list is incompatible A mutual information list derives from an index file, and knows which index file it derives from when computed. Normally when it opens up, it opens up the corresponding index file too. If that index file is not found on your PC or has been renamed, you will see this message. The mutual information can still be accessed but a) what you see in terms of Frequency and Alphabetical lists refers to a different index file, and b) it will not be possible to get concordances directly from the listing. 14.47 network registration used elsewhere Network registration running elsewhere or vice-versa The registration for use on a network is not valid for use on a stand-alone pc, and vice-versa. If you get this message, please re-install as appropriate. 14.48 no access to text file - in use elsewhere? No access to text file: in use elsewhere? The file cannot be accessed. Perhaps another application is using it. If so, close down the file in that other application and try again. 14.49 no associates found No associates found Alter settings (Settings | Min & Max Frequencies) and try again. 251 WordSmith Tools 2007 Mike Scott 14.50 no clumps identified No clumps identified Alter settings and try again. 14.51 no clusters found No clusters found Alter the settings (Settings | Clusters) and try again. There were too few concordance lines to find the minimum number needed, or the cluster length was too great. 14.52 no collocates found No collocates found In the Controller, alter the settings (Adjust Settings | Concord | Min. Frequency) and try again. There were too few concordance lines to find the minimum number needed. 14.53 no concordance entries No concordance entries found If you got no concordance entries, either a) there really aren't any in your text(s), b) there's a problem with the specification of what you're seeking, or c) there's a problem with the text selection. Check how you've spelt the search-word and context word. If you're using accented text, check the format of your texts. If you're using a search-word file, ensure this was prepared using a plain Windows word-processor such as Notepad. Have you specified any wildcards (* and ?) accurately? If you are looking for a question-mark, you may have put "?" correctly but remember that question-marks usually come at the ends of words, so you will need *"?". Tip Bung in an asterisk or two. You're more likely to find book* than book. 14.54 no concordance stop list words No concordance stop list words 14.55 no deleted lines to zap No deleted lines to Zap You pressed Alt-Z but hadn't any deleted lines to zap. No harm done. 14.56 no entries in KeyWords database No entries in Keywords Database Alter settings and try again. 252Error Messages 2007 Mike Scott 14.57 no key words found No Key Words found Alter settings and try again. The minimum frequency is set too high and/or the p value too small for any key words to be detected. For very short texts a minimum frequency of 2 may be needed. 14.58 no key words to plot No key words to plot Had you deleted them all? 14.59 no KeyWords stop list words No keyword stop list words WordSmith either failed to read your stop-list file or it was empty. 14.60 no lemma list words No lemma match list words WordSmith either failed to read your lemma list file or it was empty. 14.61 no match list words No match list words WordSmith either failed to read your match list file, or it was empty, or you forgot to check the action to be taken (one option is None). Or you tried to match up using a list of words, or a template, when the current column has only numbers. Or else there really aren't any like those you specified! 14.62 no room for computed variable No room for computed variable There isn't enough space for the variable you're trying to compute. 14.63 no statistics available No statistics available Some types of word list created by WordSmith Tools, e.g. a word list of a key words database have words in alphabetical and frequency order but no statistics on the original text files. You cannot therefore call the statistics up in WordList. You might also see this message if the statistics file you're trying to call up is corrupted. 14.64 no stop list words No stop list words WordSmith either failed to read your stop-list file or it was empty. 253 WordSmith Tools 2007 Mike Scott 14.65 no such file(s) found No such file(s) found You typed in the name of a non-existent file. If typing in a filename, remember to include the full drive and folder as well as the filename itself. 14.66 no tag list words No tag list words WordSmith either failed to read your tag file or it was empty. 14.67 no word lists selected No word lists selected For WordSmith to know which word lists to compare, you need to select them, by clicking on one in each folder. If you've changed your mind, press Cancel. 14.68 not a valid number Not a valid number Either you've just typed in, or else WordSmith Tools has just attempted to read (e.g. from wordsmith.ini, the defaults file), something which is expected to be a number but wasn't. Computers will not see the capital O as equivalent to the number 0. Or else there is a number but accompanied by some other letters or symbols, e.g. 30. If this happens when WordSmith is starting up, check out the wordsmith.ini file for mistakes. 14.69 not a WordSmith file The file you are trying to open is not a WordSmith Tools file. WordSmith makes files containing your results, files whose names end in .LST, .CNC, .KWS, etc. These are in WordSmith's own format and cannot be opened up by Microsoft Word -- likewise a plain text file or a Word .doc cannot usually be read in by WordSmith as a data file, but only as a text file for processing. See also: Converting Data from Previous Versions 14.70 not a current WordSmith file Not a Current WordSmith File The file you are trying to open was made using WordSmith but either iťs a file made using version 1-3 or iťs a file made with the beta version of WordSmith 5 and the format has had to change (sorry!) If the former, you may be able to convert it using the Converter. 14.71 nothing activated Nothing activated Some forms have choices labelled "Activated" which you can switch on and off. If they are un-checked, you can still see what they would be but WordSmith will ignore them. 254Error Messages 2007 Mike Scott 14.72 original text file needed but not found Original text file(s) needed but not found To proceed, WordSmith needed to find the original text file which the list was based on. But it has been moved or renamed. Or if on a network, your network connection is not mapped, or the network is down ...or else the right disk or CD-ROM is not in the drive! 14.73 printer needed WordSmith needs a printer driver to be installed, even if you never actually print anything. You don't need to buy a printer or to switch a printer on, but the Print Preview function in Concord, WordList, KeyWords etc. does need to know what sort of paper size you would print to. If you get a message complaining that no printer has been installed, choose Start | Settings | Printers & Faxes and install a default printer (any printer will do) in Windows. 14.74 registration code in wrong format Registration code must be as in this example 2 letters or numbers, a dot, then 4 numbers, dot, 4 numbers etc. Example: XX.1234.5678.9012.3456 (dots every 4 letters) 14.75 registration is not correct Registration is not correct It doesn't match up with whaťs required for a full updated version! The old registration code in earlier versions is no longer in use. WordSmith will still run but in Demonstration Version mode. 14.76 short of memory Short of Memory! An operation could not be completed because of shortage of RAM 14.77 source folder file(s) not found Source Folder file(s) not found You typed in the name of a non-existent file. If typing in a filename, remember to include the full drive and folder as well as the filename itself. 14.78 stop list file not found Stop list file not found You typed in the name of a non-existent file. If typing in a filename, remember to include the full drive and folder as well as the filename itself. 255 WordSmith Tools 2007 Mike Scott 14.79 stop list file not read Stop list file not read Something has gone wrong with a disk reading operation. The file you're trying to read in may be corrupted. This happens easily if you often handle very large files, especially if iťs a long time since you last ran Scandisk to check whether any clusters in your files have got lost. See your DOS or Windows manual for help on fragmentation. 14.80 tag file not found Tag File not found You typed in the name of a non-existent file. If typing in a filename, remember to include the full drive and folder as well as the filename itself. 14.81 tag file not read Tag list file not read Something has gone wrong with a disk reading operation. The file you're trying to read in may be corrupted. This happens easily if you often handle very large files, especially if iťs a long time since you last ran Scandisk to check whether any clusters in your files have got lost. See your DOS or Windows manual for help on fragmentation. 14.82 this function is not yet ready This function is not yet ready! Temporary message, for functions which are still being tested. 14.83 this is a demo version This is a demo version You will probably want to upgrade to the full version. 14.84 this program needs Windows 98 or greater This program needs Windows 98 or better As of Version 4.0 and above, this is a 32-bit program (and a 32-bit help file). 14.85 to stop getting this message ... Get an update. This is "annoyware" for the demonstration version. 14.86 too many requests to ignore matching clumps The limit is 50. Do any remaining joining manually. 14.87 too many sentences The limit is 8,000. Do the task in pieces. 256Error Messages 2007 Mike Scott 14.88 truncating at xx words -- tag list file has more The tag list file has more entries than the current limit. Or else it isn't a tag list file at all! 14.89 two files needed You need to select 2 files for this procedure. Select (by clicking while holding down the Control key) 2 file-names in the list of files. 14.90 unable to merge Keywords Databases Perhaps there wasn't enough RAM to carry out the merge. 14.91 why did my search fail? The standard search function (F12 or ) for a list of data operates on the currently highlighted column. If you want to search within data from another column, click in that column first. By default, a search is "whole word". Use * at either end of the word or number you're searching for if you want to find it, e.g. in any data consisting of more than one word. (The advantage of the asterisk system is that it allows you to specify either a prefix or a suffix or both, unlike the standard Windows search "whole word" option.) 14.92 word list file is faulty Each type of file created by WordSmith Tools has its own default filename extension (e.g. .LST, .KWS) and its own internal structure. If you have another file with the same extension produced by another program, this will not be compatible. It would not be sensible to rename a .CNC file to .TXT, or vice-versa! WordSmith has detected that the file you're calling up wasn't produced by the current version of WordList. 14.93 word list file not found You typed in the name of a non-existent file. If typing in a filename, remember to include the full drive and folder as well as the filename itself. 14.94 WordList comparison file is faulty Each type of file created by WordSmith Tools has its own default filename extension (e.g. . LST, .KWS) and its own internal structure. If you have another file with the same extension produced by another program, this will not be compatible. It would not be sensible to rename a .CNC file to .TXT, or vice-versa! WordSmith has detected that the file you're calling up wasn't produced as a comparison file by WordList. 14.95 WordSmith Tools already running Don't try to start WordSmith Tools again if iťs already running. Just Alt-tab back to the instance which is running. (You can, however, have several copies of each tool running at once.) 257 WordSmith Tools 2007 Mike Scott 14.96 WordSmith Tools expired Message for limited period users only. Your version of WordSmith Tools has passed its validity and is now in demo mode. Download another from the Internet. 14.97 WordSmith version mis-match Since the various Tools are linked to each other, it is important to ensure that the component files are compatible with each other. If you get this message it is because one or more components is dated differently from the others. Solution: download those you need from one of the contact websites. 14.98 XX days left Message for limited period users only. At the end of this time WordSmith will revert to demo mode. Index 258 2007 Mike Scott Index - . .DOC to plain text 195 .ini files 55 - 2 25 lines 238 - 3 32-bit version 208 - A about option 223 accents 212 accents & symbols 212 accents window 19 accessing previous results 51 accurate sort in WordList 162 acknowledgements 208 add to text 120 add value to corpus 74 adding notes to data 19 adjust settings 19 adjusting with mouse 202 advanced concordance settings 95 advanced settings 20 aligning 201 alignment 46 altering your data 36 alternative search words 112 alt-tab 61 annotate source texts 74 ansi 212 API 209 apostrophes in sorting 109 Application programming interface 209 ascii 212 associate defined 119 associated entries 143 associates 119 asterisk 112 auto-joining lemmas 136 autoload tag file 65 automated file-based concordancing 102 - B Babbage 239 Baltic 26 batch choosing 120 batch processing 23 batch processing and Excel 23 batch processing: file-names 23 batch processing: folders 23 bibliography 209 blanking out entries 80 BNC handling of sentences and headings 73 BNC Sampler version 218 BNC: selecting between texts 69 BNC: selecting within texts 70 BNC: tag file 71 BNC: text format 221 boolean and/not 99 boolean or 112 bracket first line 183 browsing original 93 bugs 210 burstiness 229 buttons 226 - C calculating a plot 128 calculation of KeyWords 127 call a concordance 123 calling other tools 224 cannot compare word-lists in different languages 248 can't see Concord tags 237 case sensitivity 112 CD-ROM version: defaults 55 CD-ROM: speed 232 CD-ROM: storage 229 Central European 26 WordSmith Tools259 2007 Mike Scott changing colours 30 changing font 44 changing from edit to type-in mode 218 character sets 211 characters in save as text 113 characters within a word 60 Charles Babbage 239 check current version 16 Cherokee 213 Chinese 213 chi-square 127 Choose Languages: overview 5 choosing files from standard dialogue box 30 choosing texts 27 choosing your files 120 class instructions 30 clear previous selection 27 clipboard 214 clumps 120 clumps: regrouping 130 cluster: definition 217 clusters 231 clusters in KeyWords 121 cocoa tags 65 codepages 211 codes 211 collocates 86 collocates: display 84 collocates: highlighting in concordance 83 collocates: horizons 82 collocates: minimum frequency 82 collocates: sorting 111 collocation associates 119 collocation patterns 107 collocation: settings 82 collocation: specifications 82 coloured tags in WordList 163 colours 30 colours in tags 71 column headings 43 column tagged conversion 195 column totals 32 column width 46 columns in printing 45 comparing wordlists 138 comparison display 139 compute new column of data 33 Concord: categories 81 Concord: clusters 87 Concord: collocation 86 Concord: creating exercises 80 Concord: index 79 Concord: limitations 223 Concord: multiple search-words 102 Concord: nearest tag 104 Concord: overview 4, 79 Concord: patterns 107 Concord: saving and printing 91 Concord: sorting 109 Concord: sound and video 93 Concord: source text file 93 Concord: starting tips 11 Concord: stretching the display to see more 93 Concord: text segments 111 Concord: uniform plot 90 Concord: viewing options 92 Concord: what you see and can do 93 Concord: wildcards 112 Concord: zapping unwanted lines 101 concordance batch processing 95 concordance display 93 concordance display: highlighting collocates 83 concordance settings 95 concordancing on tags 98 Concorďs save as characters 114 confirmation messages: okay to re-read 245 consistency analysis (detailed) 141 consistency analysis (simple) 142 consistency lists: sorting 157 contact addresses 216 context word 99 contextual frequency sort 109 controller (wshell.exe) 4 convert data from old version 171 convert from UTF-8 20 converter 188 copy choices 33 copy data to Word 214 copy: all 33 copy: selective 33 copy: specify 33 correcting filenames 56 Index 260 2007 Mike Scott couldn't merge KW databases 256 count data frequencies 34 crash 210 creating a database 123 custom .dll file 36 custom column headings 43 custom processing 36 custom settings 39 custom settings for BNC tags 67 customising menus 20 cut spaces 113 cutting line starts 70 Cyrillic 26 - D data as text file 129 database construction 123 database statistics 126 date format 217 decimal places 46 defaults 55 defining multimedia tags 66 definition of associate 119 definition of key key-word 125 definition of key-ness 125 definitions 217 deleting entries 62 demonstration version 218 Dickens text 27, 50 directories 219 disambiguation 120 dispersion 90 dispersion plot: sorting 111 displaying comparisons 139 DOS codes 213 DOS to Windows 195 download new version 16 dual-text aligning with Viewer 201 duplicate concordance lines 108 dynamic concordancing 103 - E edit mode 218 editing column headings 43, 46 editing concordances 101 editing WordList entries 41 end of heading marker 60 end of paragraph marker 60 end of sentence marker 60 end of text separator 183 end-of-text symbols 184 English 239 Entitities to characters 195 entity references 64 error messages 242 error messages: .ini file not found 243 error messages: base list error 243 error messages: can only save words as ASCII 244 error messages: can't call other tool 244 error messages: can't make folder as thaťs an existing filename 244 error messages: can't merge list with itself! 244 error messages: can't read file 244 error messages: character set reset to to suit 244 error messages: concordance file is faulty 245 error messages: concordance stop list file not found 245 error messages: conversion file not found 245 error messages: destination folder not found 245 error messages: disk problem -- file not saved 245 error messages: dispersions go with concordances 246 error messages: drive not valid 246 error messages: failed to access Internet 246 error messages: failed to create new folder 246 error messages: failed to read file 246 error messages: failed to save 246 error messages: file access denied 246 error messages: file contains "holes" 247 error messages: file contains none of the tags specified 247 error messages: file not found 247 error messages: filenames must differ 247 error messages: form incomplete 248 error messages: full drive & folder name needed 248 error messages: function not working properly yet 248 error messages: invalid concordance file 248 WordSmith Tools261 2007 Mike Scott error messages: invalid file name 248 error messages: invalid KeyWords database file 248 error messages: invalid KeyWords file 248 error messages: invalid WordList comparison file 249 error messages: invalid WordList file 249 error messages: joining limit reached 249 error messages: KeyWords database file is faulty 249 error messages: KeyWords file is faulty 249 error messages: limit of file-based search-words reached 249 error messages: links between Tools disrupted 250 error messages: match list 250 error messages: must be a number 250 error messages: network registration used elsewhere 250 error messages: no access to text file - in use elsewhere? 250 error messages: no associates found 250 error messages: no clumps identified 251 error messages: no clusters found 251 error messages: no collocates found 251 error messages: no concordance entries found 251 error messages: no concordance stop list words 251 error messages: no deleted lines to zap 251 error messages: no entries in KeyWords database 251 error messages: no key words found 252 error messages: no key words to plot 252 error messages: no KeyWords stop list words 252 error messages: no lemma list words 252 error messages: no match list words 252 error messages: no room for computed variable 252 error messages: no statistics available 252 error messages: no stop list words 252 error messages: no such file(s) found 253 error messages: no tag list words 253 error messages: no word lists selected 253 error messages: not a valid number 253 error messages: not a WordSmith file 253 error messages: nothing activated 253 error messages: original text file needed but not found 254 error messages: printer needed but not found 254 error messages: registration string is not correct 254 error messages: registration string must be 20 letters long 254 error messages: short of memory 254 error messages: source folder file(s) not found 254 error messages: stop list file not found 254 error messages: stop list file not read 255 error messages: tag file not found 255 error messages: tag file not read 255 error messages: the program needs Windows 98 or greater 255 error messages: this function is not yet ready 255 error messages: this is a demo version 255 example 124 Excel 52 exercises 80 exiting 52 expiry date 257 export to spreadsheet etc. 52 extracting from text files 189 - F favourite texts 26 file associations 218 File Utilities: compare 2 files 186 File Utilities: file chunker 187 File Utilities: find duplicates 187 File Utilities: index 183 File Utilities: overview 5 File Utilities: rename 188 File Viewer 181 file-based lemmatisation 143 file-based search-words or phrases 102 filenames 225 filenames display 58 filenames: editing 56 file-types 218 filtering 48 finding a word 57 finding by typing 57 finding entries 150 finding relevant files 43 finding source texts 219 first use of WordSmith 50 Index 262 2007 Mike Scott folders 219 folders created using text converter 197 follow-up concordancing 103 fonts 44 for use on pc named XXX 247 format 46 formulae 220 frequency of happi* 34 full lemma processing 132 - G general settings 45 get favourite text selection 26 getting started 2 getting started with Concord 11 getting started with KeyWords 12 getting started with WordList 13 globality of plot 229 Greek 26 greek font 44 grow and shrink 93 - H handling multiple windows 61 handling tag-types 65 heading marker 60 headings 46 headings (specifying) 161 headings: definition 217 headings: start & end 73 hide tags 113 hide words 113 highlighting collocates in concordance 83 history list 51, 112 holes in file 247 horizons 82 hotkey combinations 225 how many words 223 how much text 223 how to build a database 123 HTML XML 221 HTML & SGML tags 65 HTML headers: cutting out 67 HTML/BNC entities to characters 195 hyphen treatment 221 hyphens 60 - I idioms 231 illegible 239 importing text into a word list 158 index lists: uses 144 information about WordSmith version 234 installing WordSmith Tools 15 instructions folder 16 interface 222 international versions 222 Internet Explorer 225 Into Unicode 195 introduction to WordSmith Tools 2 inverted commas 237 it won't do what I want 237 - J Japanese 213 joiner 185 joining entries 143 joining text files 185 - K key key word defined 125 key key-words 126 key words example 124 keyboard 225 key-ness defined 125 keys for searching 150 KeyWords database 126 keywords minimal processing 132 KeyWords: advice 126 KeyWords: calculation 127 KeyWords: clusters 121 KeyWords: display 131 KeyWords: index 118 KeyWords: limitations 223 KeyWords: links 127 KeyWords: overview 5 WordSmith Tools263 2007 Mike Scott KeyWords: purpose 118 KeyWords: sorting 130 KeyWords: starting tips 12 KeyWords: tips 126 - L language 26 Languages Chooser: font 177 Languages Chooser: language 175 Languages Chooser: other languages 178 Languages Chooser: overview 174 Languages Chooser: saving settings 178 Languages Chooser: sort order 177 layout 46 lemma file 137 lemma list 137 lemma matching: WordList 137 lemmas 143 lemmatising source texts 195 lemmatising with custom .dll 36 limitations 223 links between tools 224 list of buttons 226 localisation 222 locating entry-types 150 log file to trace problems 20 log likelihood 127 Log Likelihood score 155 logging 20 long file names 225 - M machine requirements 226 make a word list from keywords data 128 making a tag file 71 making Wordlist Index 146 manual for WordSmith Tools 226 marking 143 marking context-word in txt 91 marking search-word in txt 91 mark-up 64 mark-up types 64 match list 48 memory usage 229 menu choices 226 menu shortcuts 20 merge concordances 139 merge wordlists 139 MI score 155 MI3 score 155 Microsoft Word 214 Minimal Pairs 178 Minimal Pairs: aim 178 Minimal Pairs: choosing files 179 Minimal Pairs: output 179 Minimal Pairs: overview 6 Minimal Pairs: requirements 178 Minimal Pairs: rules and settings 180 Minimal Pairs: running the program 180 modify source texts 74 moving sentences 202 multimedia concordancing 93 multimedia tags 66 multiple file analysis 126 multiple lists 23 multi-word unit 74 mutual information scores 151 mutual information screen 155 mutual information: computing 153 - N nag message 255 nearest tag 104 negative keyness 125 negative keywords 132 network defaults 55 network settings 16 network version 16 networks: defaults 55 new in version 5 4 new user 50 n-grams in WordList 147 not a current WordSmith file 253 notes 19 number of concordance entries 113 number sort 109 numbering: paragraphs 203 numbering: sentences 203 numbers 60 Index 264 2007 Mike Scott numbers: how treated 229 - O online screenshots 4 options for defaults 55 ordering details 218 over-writing 190 Oxford University Press 218 - P p value 128 paragraph marker 60 paragraph numbering 203 paragraph: start & end 73 paragraphs (specifying) 161 paragraphs: definition 217 partial save 56 patterns: highlighting in concordance 83 percentages v. raw numbers 116 phrases 231 plot dispersion value 229 plot display 129 plots and links 127 plotting key words 128 popup menu 20 Portuguese 26 potato-peeling machine 239 previous lists 51 price 218 print preview 51 printer settings 45 printing 51 programming WordSmith 209 purple marks 93 purpose of Splitter 183 purpose of Text Converter 188 purpose of Viewer 200 - Q quitting 52 quotation marks 237 - R RAM availability 229 random deletion of entries 52 range 141, 142 raw numbers 116 raw numbers v. percentages 116 reduce data to N entries 52 reference corpus 230 registry 218 regrouping clumps 130 remove duplicates 108 rename numerous files 188 re-ordering 62 re-ordering word lists 41 repeated concordance lines 108 replacing 190 report on a crash 210 research uses 79 re-sorting a word list 162 re-sorting: collocates 111 re-sorting: Concord 109 re-sorting: consistency lists 157 re-sorting: dispersion plot 111 re-sorting: KeyWords 130 restore last file 230 restore last work 45 restricted search 99 ruler 129 Russian 26 russian font 44 - S save as HTML 52 save as text 52 save as XML 52 save favourite text file set 26 save layout 46 save part of data 56 saving defaults 55 saving results 56 search & replace 56 search by typing 218 search word syntax 112 WordSmith Tools265 2007 Mike Scott searching by typing 57 searching for a word or part of a word 57 searching using menu 150 section tag 71 section: start & end 73 selecting between texts 69 selecting multiple entries 230 selecting within texts 70 sentence marker 60 sentence numbering 203 sentence only 113 sentence: start & end 73 sentences (specifying) 161 sentences: definition 217 Set column 81 setting up a training sesssion 30 shortcuts 225 show help at startup 55 show help file 45 single words 231 slash 112 slow 240 sorting tags 104 sorting: Concord 109 sorting: KeyWords 130 sorting: WordList 162 sound & video tagged files 93 sound file tags 66 source texts 219 source texts conversion 195 source texts: modify 74 specific limitations 223 speed 232 Splitter 183 Splitter: filenames 184 Splitter: index 183 Splitter: overview 6 Splitter: symbols 184 Splitter: wildcards 184 splitting 203 standardised or mean type/token ratio 160 start and end of sentence 73 statistics 157 statistics of a database 126 status bar 226, 233 statusbar 45 stop lists 58 stoplist.cod 197 stopping 59 storage 229 store text files 27 student use 79 summary statistics 34 suspending processing 59 symbols 212 - T tag concordancing 98 tag context 95 tag file 71 tag types 64 tagged text 64 tags as selectors 67 tags in WordList 163 tags to exclude 71 tags to retain 71 tags: overview 64 teacher instructions 30 teaching uses 79 text characteristics 60 Text Converter: asterisk 193 Text Converter: conversion file 197 Text Converter: cutting header 190 Text converter: extracting 189 Text Converter: folders 190 Text Converter: index 189 Text Converter: limitations 223 Text Converter: move if 196 Text Converter: overview 6 Text Converter: removing all tags 193 Text Converter: sample conversion file 198 Text Converter: settings 190 Text Converter: syntax 193 Text Converter: wildcards 193 text file: use to build a word list 158 text formats 60 text segments in Concord 111 texts: choosing 27 texts: more texts 27 the ~ operator 99 tie-breaking 109 Index 266 2007 Mike Scott too many requests to ignore matching clumps 255 too many sentences 255 toolbar 45, 226 tools for pattern-spotting 233 training students 30 troubleshooting 237 troubleshooting: accented symbols 238 troubleshooting: apostrophes not found 237 troubleshooting: colours unreadable 239 troubleshooting: column spacing 237 troubleshooting: Concord tags problem 237 troubleshooting: Concord/WordList mismatch 238 troubleshooting: crashed 238 troubleshooting: curly quotation marks 237 troubleshooting: demo limit 238 troubleshooting: keys don't respond 239 troubleshooting: pineapple-slicing 239 troubleshooting: printer won't print 239 troubleshooting: quotation marks not found 237 troubleshooting: smart quotations 237 troubleshooting: takes ages 240 troubleshooting: Viewer 205 troubleshooting: weird symbols 238 troubleshooting: won't start 240 troubleshooting: WordList out of order 240 truncating at xx words 256 two files needed 256 Two word-list analysis 119 type/token ratios 160 typeface 46 type-in mode 218 type-in search 57 types of tag 64 - U undefined tags 113 Unicode codes 213 university or school work 30 Unix to Windows 195 unjoining 143 unmarking 143 unreadable 239 updater.exe 15 updating your version 15 user-defined categories 81 user-defined categories: saving 74 user-defined process 36 UTF16 195 UTF8 195, 214 UTF-8 214 - V value-added annotation 74 version 4 differences 208 Version Checker: overview 7 version checking 16 version date 234 version francaise 222 version mis-match 257 Viewer 200 Viewer: aligning the sentences 202 Viewer: colours 203 Viewer: editing 202 Viewer: languages 202 Viewer: limitations 223 Viewer: overview 8 viewer: reading in your plain text 203 Viewer: sentence joining 203 Viewer: settings 204 Viewer: technical aspects 204 Viewer: translation mis-matches 205 Viewer: unusual sentences 206 Viewer: viewing options 203 viewing original text file 93 - W WebGetter: display 173 WebGetter: limitations 174 WebGetter: overview 8, 171 WebGetter: settings 171 what is a concordance 80 Whaťs new 4 whole word search 112 why did search fail? 256 why won't it... 237 window management 61 Windows 2000 226 Windows 95 filenames 225 Windows 98 226 WordSmith Tools267 2007 Mike Scott Windows character set codes 213 Windows NT 226 Windows Vista 226 Windows XP 226 word list file not found 256 word list is faulty 256 word patterns 107 word separators 218 word: definition 217 WordList comparison file faulty 256 WordList index lists: viewing 144 WordList overview 5 WordList: altering entries 41 WordList: case sensitivity 161 WordList: clusters 147 WordList: create using text file 158 WordList: index 135 WordList: limitations 223 WordList: minimum & maximum settings 161 WordList: purpose 135 WordList: sort 240 WordList: sort order 162 WordList: starting tips 13 WordList: tags 163 WordList: the basic display 164 WordSmith already running 256 WordSmith controller: Concord: settings 113 WordSmith controller: KeyWords settings 132 WordSmith controller: WordList settings 167 WordSmith Tools: installation 15 WordSmith Tools: manual 226 WordSmith version 234 wshell.exe (controller) 4 wshell.ini and networks 16 - X XX days left 257 - Y Yasumasa Someya 137 - Z Z score 155 zapping 62 zip files 235