IB047 Unix Text Tools Pavel Rychlý pary@fi.muni.cz 2014 Mar 3 Pavel Rychlý IB047 nix Text Tools Tradition ■ Unix has tools for text processing from the very beginning (1970s) ■ Small, simple tools, each tool doing only one operation ■ Pipe (pipeline): powerful mechanism to combine tools Short Description of Basic Text Tools cat concatenate files and print on the standard output head output the first part (few lines) of files tail output the last part (few lines) of files sort sort lines of text files uniq remove duplicate lines from a sorted file comm compare two sorted files line by line wc print the number of newlines, words, and bytes in files join paste tr cut remove sections (columns) from each line of files join lines of two files on a common field merge lines of files translate or delete characters Short Description of Basic Text Tools egrep prints lines matching a pattern (g)awk pattern scanning and processing language sed stream editor, use for substring replacement use peri -p for extended regular expressions Text Tools Documentation info run info and select from a menu or run directly ■ info coreutils ■ info head, info sort,... ■ info gawk man ■ man 7 regex ■ man grep, man awk, man tail, ... -help most tools display a short help message on the —help option ■ sort --help, uniq -help, ... Unix Text Tools Packages Where to find it ■ set of system tools ■ different sets and different features/options on each U type ■ GNU textutils ■ GNU coreutils - textutils + shellutils + fileutils ■ other GNU packages: grep, sed, gawk Unix Text Tools Packages Where to find it ■ set of system tools ■ different sets and different features/options on each Unix type ■ GNU textutils ■ GNU coreutils - textutils + shellutils + fileutils ■ other GNU packages: grep, sed, gawk ■ installed on all Linux machines ■ on Windows: install mingw32/cygwin, then coreutils, grep, Text Tools Usage ■ command line tools - enter command in a terminal (console) window ■ command name followed by options and arguments ■ options start with - ■ quote spaces and metacharacters:',", $ ■ redirect input and output from/to files using <, > ■ use less to only display a result without saving Text Tools Example 1 task Convert plain text file to a vertical text input plain.txt output plain.vert solutions Pavel Rychlý IB047 Text Tools Example 1 task Convert plain text file to a vertical text, input plain.txt output plain.vert solutions tr -s ' ' '\n' plain.vert Text Tools Example 1 task Convert plain text file to a vertical text, input plain.txt output plain.vert solutions tr -s ' ' '\n' plain.vert tr -sc a-zA-ZO-9 '\n' plain.vert 4 □ ► 4 fi> ► 4 3 ► 4 = * Text Tools Example 1 task Convert plain text file to a vertical text, input plain.txt output plain.vert solutions tr -s ' ' '\n' plain.vert tr -sc a-zA-ZO-9 '\n' plain.vert perl -ne 'print "$&\n" while /(\w+|["\w\s]+)/g' \ plain.txt >plain.vert Text Tools Example 2 task Create a word list input vertical text output list of all unique words with frequencies solutions Pavel Rychly IB047 Text Tools Example 2 task Create a word list input vertical text output list of all unique words with frequencies solutions sort plain.vert | uniq -c >dict sort plain.vert | uniq -c | sort -rn Text Tools Example 3 task Corpus/list size input vertical text/word list output number of tokens/different words solutions Pavel Rychlý IB047 Text Tools Example 3 task Corpus/list size input vertical text/word list output number of tokens/different words solutions wc -1 plain.vert wc -1 diet grep -c -i '"[a-zO-9]*$' plain.vert Pavel Rychly IB047 Text Tools Example 4 task Create a list of bigrams input vertical text output list of bigrams solution Pavel Rychly IB047 Text Tools Example 4 task Create a list of bigrams input vertical text output list of bigrams solution tail +2 plain.vert paste plain.vert - \ sort |uniq -c >bigram Pavel Rychly IB047 Text Tools Example 5 task Filtering input word list output selected values from word list solutions Pavel Rychly IB047 Text Tools Example 5 task Filtering input word list output selected values from word list solutions grep '"[0-9]*$' diet awk '$1 > 100' diet Pavel Rychly IB047 Text Tools Debuging data driven programming cut the pipline and display partial results try single command with a test input Pavel Rychlý IB047 Text Tools Exercise task Find all words from a word list differing with s/z alternation only: apologize/apologise Pavel Rychlý IB047 « □ ► 4 fi> ► 4 3 ► < Text Tools Exercise task Find all words from a word list differing with s/z alternation only: apologize/apologise solutions tr s z < diet | sort |uniq -d >szalte Text Tools Exercises Find all words from a word list differing with s/z alternation only, and each alternation has higher frequency than 50 Pavel Rychly IB047 Text Tools Exercises ■ Find all words from a word list differing with s/z alternation only, and each alternation has higher frequency than 50 ■ and display their frequences Pavel Rychly IB047 Text Tools Exercises ■ Find all words from a word list differing with s/z alternation only, and each alternation has higher frequency than 50 ■ and display their frequences ■ Find all words which occurs in the word list only with capital letter (names). Pavel Rychly IB047 XML processing ■ XML is a text ■ use same tools (textutils, grep, sort, ...) ■ API ■ SAX - Simple API for XML ■ DOM - Document Object Model ■ analogy of "texftools for XML Pavel Rychly IB047 XML API - SAX ■ Simple API for XML ■ event driven computation ■ events ■ begin/end of an element ■ element attribute ■ text ■ a method/function is called for each event ■ minimal resources required Pavel Rychly IB047 XML API - DOM ■ Document Object Model ■ XML document is represented by a tree ■ methods for accessing items of a document ■ methods for editing (making changes) ■ all in main memory ■ good for a random access Pavel Rychly IB047