VB036 - English 2 Textbook _ Authors: Martin Dvořák, Kateřina Repová, Ivana Túlajová All the texts in this textbook have been taken from Encyclopedia of Database Systems (Ozsu, M. Tamer; Liu, Ling (Eds.) 2009. Approx. 4100 p. 60 illus. ISBN: 978-0-387-49616-0) and abridged. Tento projekt je spDlufinjn:nván Evropským sniiálním fondem a stát nim rozpočtem České republiky, • EVROPSKÁ UNIE IT*1* M NI£"EnSTVO S-ÍOLSTVÍ. M -ADEÍE A TĚLOVÝCHOV P" I INVESTICE DO ROZVOJE VZDĚLÁVÁNÍ Phonetics Vowels Long Vowels il sheep Short Vowels I e ship head ai farm ae hat 9 above UI coo U foot 3«, mother (US) 31 horse D sock (UK) worm (US) 31 bird cup Consonants Voiced b book Z zoo j yes Voiceless P s pen say d 3 w t S day vision we town she 9 m k give jump moon cat cheese V very I look n name f fish ô r n e the run sing think Diphthongs ei day OU nose (US) ai eye 13 ear (UK) 31 boy ea hair (UK) au mouth U3 pure (UK) 3U nose (UK) Other Symbols h [|aend] hand \ ['bA\.&] butter (US) D ['kwaes.B] croissant (UK) u [,in.flu'en.za] influenza ['haepj] happy ['lit.!] little al, 3m, 3n can be pronounced either: dl or!, etc.: ['lei.bH] = ['lei.bal] = ['lei.b!] r linking r is pronounced only before a vowel in British English: [foir] four: [foiraep.lz] four apples ' main stress [,ek.spek'tei.J~3n] expectation , secondary stress [^|'tel] retell Syllable division ['SIS.tBITl] System Cambridge Dictionary Online [Internet]. c2010 [cited 2010 Feb 24]. Phonetics. Available from: . Annotation-based Image Retrieval XIN-JING WANG, LEI ZHANG Microsoft Research Asia, Beijing, China 5 Definition Given (i) a textual query and (ii) a set of images and their annotations (phrases or keywords) annotation-10 based image retrieval systems retrieve images according to the matching score of the query and the corresponding annotations. There are three levels of queries according to Eakins [7]: 15 • Level 1: Retrieval by primitive features such as color, texture, shape or the spatial location of image elements, typically querying by an example, i.e., "find pictures like this." • Level 2: Retrieval by derived features, with 20 some degree of logical inference. For example, "find a picture of a flower." • Level 3: Retrieval by abstract attributes, involving a significant amount of high-level reasoning about the purpose of the objects or 25 scenes depicted. This includes retrieval of named events, of pictures with emotional or religious significance, etc., e.g., "find pictures of a joyful crowd." 30 Together, levels 2 and 3 are referred to as semantic image retrieval, which can also be regarded as annotation-based image retrieval. Historical Background 35 There are two frameworks of image retrieval [6]: annotation-based (or more popularly, text-based) and content-based. The annotation-based approach can be tracked back to the 1970s. In such systems, the images 40 are manually annotated by text descriptors, which are used by a database management system (DBMS) to perform image retrieval. There are two disadvantages with this approach. The first is that a considerable level of human labor is required for manual annotation. The 45 second is that because of the subjectivity of human perception, the manually labeled annotations may not converge. To overcome the aforementioned disadvantages, content-based image retrieval (CBIR) was introduced in the early 1980s. In CBIR, images are 50 indexed by their visual content, such as color, texture, shapes. In the past decade, several commercial products and experimental prototype systems were developed, such as QBIC, Photobook, Virage, VisualSEEK, Netra, SIMPLIcity. 55 However, the discrepancy between the limited descriptive power of low-level image features and the richness of user semantics, which is referred to as the "semantic gap" bounds the performance of CBIR. On the other hand, due to the explosive growth of visual 60 data (both online and offline) and the phenomenal success in Web search, there has been increasing expectation for image search technologies. For these reasons, the main challenge of image retrieval is understanding media by bridging the semantic gap 65 between the bit stream and the visual content interpretation by humans [3]. Hence, the focus is on automatic image annotation techniques. Foundations 70 The state-of-the-art image auto-annotation techniques include four main categories [3,6]: (i) using machine learning tools to map low-level features to concepts, (ii) exploring the relations between image content and 75 the textual terms in the associated metadata, (iii) generating semantic template (ST) to support high-level image retrieval, (iv) making use of both the visual content of images and the textual information obtained from the Web to learn the annotations. 80 Machine Learning Approaches A typical approach is using Support Vector Machine (SVM) as a discriminative classifier over image low-85 level features. Though straightforward, it has been shown effective in detecting a number of visual concepts. Recently there has been a surge of interest in leveraging and handling relational data, e.g., images 90 and their surrounding texts. Blei et al. [1] extend the Latent Dirichlet Allocation (LDA) model to the mix of words and images and proposed a Correlation LDA model. 95 Relation Exploring Approaches Another notable direction for annotating image visual content is exploring the relations among image content and the textual terms in the associated metadata. Such metadata are abundant, but are often incomplete and noisy. By exploring the co-occurrence relations among the images and the words, the initial labels may be filtered and propagated from initial labeled images to additional relevant ones in the same collection [3]. Jeon et al. [5] proposed a cross-media relevance model to learn the joint probabilistic distributions of the words and the visual tokens in each image, which are then used to estimate the likelihood of detecting a specific semantic concept in a new image. 100 105 110 3 Semantic Template Approaches Though it is not yet widely used in the techniques mentioned above, Semantic Template (ST) is a 115 promising approach in annotation-based image retrieval (a map between high-level concept and low-level visual features). Chang and Chen [2] show a typical example of ST, in which a visual template is a set of icons or example 120 scenes/objects denoting a personalized view of concepts such as meetings, sunset, etc. The generation of a ST is based on user definition. For a concept, the objects, their spatial and temporal constraints, and the weights of each feature of each object are specified. 125 This initial query scenario is provided to the system, and then through the interaction with users, the system finally converges to a small set of exemplar queries that "best" match (maximize the recall) the concept in the user's mind. 130 Large-Scale Web Data Supported Approaches Good scalability to a large set of concepts is required in ensuring the practicability of image annotation. On 135 the other hand, images from the Web repositories, e.g., Web search engines or photo sharing sites, come with free but less reliable labels. In [9], a novel search-based annotation framework was proposed to explore such Web-based resources. Fundamentally, it is to 140 automatically expand the text labels of an image of interest, using its initial keyword and image content. The process of [9] is shown in Fig. 1. It contains three stages: the text-based search stage, the content-based search stage, and the annotation learning stage, which 145 are differentiated using different colors (black, brown, blue) and labels (A., B., C.).When a user submits a query image as well as a query keyword, the system first uses the keyword to search a large-scale Web image database (2.4 million images crawled from 150 several Web photo forums), in which images are associated with meaningful but noisy descriptions, as tagged by "A." in Fig. 1. The intention of this step is to select a semantically relevant image subset from the original pool. 155 Visual feature-based search is then applied to further filter the subset and save only those visually similar images (the path labeled by "B." in Fig. 1). By these means, a group of image search results which are both semantically and visually similar to the query image 160 are obtained. Finally, based on the search results, the system collects their associated textual descriptions and applies the Search Result Clustering (SRC) algorithm to group the images into clusters. 165 (Abridged) Figure 1: (Lake) 170 Recommended Reading 1. Blei D. and Jordan M.I.Modeling Annotated Data. In Proc. 26th Annual Int. ACM SIGIR Conf. on 175 Research and Development in Information Retrieval, 2003, pp. 127-134. 2. Chang S.-F, Chen W., and Sundaram H. Semantic Visual Templates: Linking Visual Features to Semantics. In Proc. Int. Conf. on Image Processing, 180 Vol. 3. 1998, pp. 531-534. 3. Chang S.-F., Ma W.-Y., and Smeulders A. Recent Advances and Challenges of Semantic Image/Video Search. In Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, 2007, pp. 1205-1208. 185 4. Eakins J. and Graham M. Content-based image retrieval, Technical Report, University of Northumbria at Newcastle, 1999. 5. Jeon J., Lavrenko V., and Manmatha R. Automatic Image Annotation and Retrieval Using Cross-Media Relevance Models, In 190 Proc. 26th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2003, pp. 119-126. 6. Liu Y., Zhang D., Lu G., and MaW.-Y. A survey of content-based image retrieval with high-level 195 semantics. Pattern Recognition., 40(1):262 -282, 2007. 7. Long F., Zhang H.J., and Feng D.D. Fundamentals of content-based image retrieval. In Multimedia Information Retrieval and Management, D. Feng (eds.). Springer, 2003. 200 8. Rui Y., Huang T.S., and Chang S.-F. Image retrieval: current techniques, promising directions, and open issues, J. Visual Commun. Image Represent. 10(4):39-62, 1999. 9. Wang X.-J., Zhang L., Jing F., and Ma W.-Y. 205 AnnoSearch: Image Auto-Annotation by Search, Proc. IEEE Int. Conf. on Computer Vision and Pattern Recognition, 2006, pp. 1483-1490. 10. Zhuang Y., Liu X., and Pan Y. Apply Semantic Template to Support Content-based Image Retrieval. 4 210 In Proc. SPIE, Storage and Retrieval for Media Databases, vol. 3972, December 1999, pp. 442^149. 5 Answer the following questions: 1) Describe retrieval by primitive features. 2) What is meant by abstract attributes in the context of retrieving images'? 3) What is meant by semantic image retrieval? Why is it called semantic? 4) What is annotation and what are its disadvantages? 5) What are the machine learning tools good for in image retrieval? 6) Describe the term of relation exploring approach. 7) What is the generation of semantic template based on? Match the following terms and their definitions: 1) semantic gap 2) SVM 3) metadata 4) CBIR a) a machine learning approach b) data about data c) the discrepancy between the limited descriptive potential of low-level image features and the richness of user's description d) getting back stored data objects byinspecting their visual characteristic, suchas color, texture, shapes. Mark the following statements as true or false: 1) Retrieval by derived features can be based on color of the picture. 2) The content-based approach uses descriptors to retrieve images. 3) SVM works on low-level features. 4) Images from the Web repositories come with highly reliable labels. 5) Search Result Clustering refers to grouping information about images. 6 Vocabulary abundant [a'bAn.cbnt] H"- hojný, překypující query ['kwia.ri] © ['kwir.i] - dotaz annotation [.aen.ai/'tei.J^n] © [-a-] =J - anotace classifier ['klae.si.faia] ^')- klasifikátor co-occurrence [kaua'ka.rans] © [koua'ka.rans] - společný výskyt descriptor [dis'kripta] - deskriptor, popisek discriminative [dis'kriminativ] © [dis'krimanativ] ^'j - rozlišující, schopný rozlišovat exemplar [ig'zem.pla:1'] © [-plair] - typický příklad, vzor, model explosive [ik'splau.siv] ©[-'splou-] - explozivní, výbušný hence [henřs] - tudíž (formální) incomplete [,in.kam'pli:t] - nekompletní likelihood ['lai.kli.hud] ^ - pravděpodobnost machine learning [ma'/iin] ['l3:.nin] © [1?:-] - strojové učení metadata ['metadeita] - metadata, data popisující jiná data noisy f'riDi.zi] H') - obsahující šum, hlučný phenomenal [fa'nom.i.n3!] © [-'na:.ma-] - úžasný, výjimečný pool [pu:l] ^')- úložiště, zásoba, bazén relational [ri'leifanal] ^')- relační, vztahový retrieval [n'triival] ^')- získávání, vyhledávání scenario [si'na:.ri.au] © [ sa'ner.i.ou] A') - scénář semantic [si'maen.tik] © [-tik-] 4 - sémantický, významový subset ['sAb.set] =J - podmnožina token ['tau.kan] © ['tou-] A') - znamení, znak, symbol to annotate st ['aen.aa.teit] © [-a-] &J - anotovat něco, vybavit anotací to bridge st [brid3] =J - překlenout něco to leverage st ['Ii:.v3r.id3] ©[ 'Iev.av.1d3] - využívat k užitku to overcome st (overcome, overcame, overcome) [,au.va'l - vrstva to alleviate [a'lii.vi.eit] ^')-odlehčit, ulevit to dedicate ['ded.i.keit] &J - věnovat, vyhradit to distinguish [dťstirj.gwif] =J - rozlišit, odlišit to execute ['ek.si.kjuit] =J - provést to focus on ['fau.kas on] © ['fou- am] ^') - zaměřit se na, soustředit se na to handle [haen.dI] &J - zabývat se, zacházet to offload [.Dflaud] © ['aif.loud] =J - odlehčit, snížit Phrases in relation to [rťlei.J"sn] =J - ve vztahu k prior to [praiar] © [prair] - před 11 Image Representation VALERIE GOUET-BRUNET CNAM Paris, Paris, France 5 Definition In computer science, the representation of an image can take many forms. Mostly it refers to the way that 10 the conveyed information, such as color, is coded digitally and how the image is stored, i.e. how an image file is structured. Several open or patented standards were proposed to create, manipulate, store and exchange digital images. They describe the format 15 of image files, the algorithms of image encoding such as compression as well as the format of additional information often called metadata. The visual content of the image can also take part in its representation. The more recent concept has provided new approaches 20 to representation and new standards, gathered together into the discipline named content-based image indexing. Historical Background 25 The first use of digital images began in the early 1920s with the technological development of facsimile transmission, and in particular with the Bartlane cable transmission system that was the first system that 30 translated pictures into a digital code for efficient picture transmission. Later, in the 1960s and 1970s, the advent of digital image technology was closely tied to the development of government programs for space exploration and espionage and also to medical research 35 with the invention of computerized axial tomography. With the availability of the CCD image sensors (charge-coupled device), the private sector also began to make significant contributions to the development of digital cameras. 40 In the mid-1980s, image format TIFF was created by the company Aldus with the aim of agreeing on a common file format for bitmapped images issued from scanners. In parallel, researchers at Xerox PARC had developed the first laser printer and had recognized the 45 need for a standard means of defining page images. After several fruitless attempts, Adobe Systems proposed the PostScript language in 1982, quickly adapted for driving laser printers. Since then, other file formats dedicated to digital images were also 50 proposed, such as GIF (1987), JPEG (1992), PDF (1993), PNG (1995) and SVG (1998). Today, a great effort is made to propose standard formats able to preserve data integrity for archiving purposes, to migrate easily to future technologies as well as to 55 provide efficient compression for access and dissemination. In 2000, the standard JPEG 2000 was proposed. Moreover, ambitious programs, MPEG-7 and MPEG-21, started in the late 1990s, are focusing on the 60 harmonization of methods for representing, storing, sharing and accessing multimedia contents (text, audio, image and video) in a unified framework. Foundations 65 Basics of Image Representation Digital images can be classified into two main categories: vector graphics and bitmapped images (also called raster images). Vector images are geometrical 70 2D objects created with drawing software or CAD (computer-aided design) systems. They are represented by geometrical primitives such as points, lines, curves, and shapes or polygons, which are all based upon mathematical equations. Unlike bitmaps that are 75 resolution-dependent, vector images are scalable, which means that the scale at which they are shown will not affect their appearance. Such images are dedicated to the representation of images with simple content, such as diagrams, icons or logos. 80 A bitmapped image is composed of a set of dots or squares, called pixels (for picture elements), arranged in a matrix of columns and rows. Each pixel has a specific color or shade of gray, and in combination with neighboring pixels it creates the illusion of a 85 continuous tone image. Unlike human vision, sensors that capture images are not limited to the visual band of the electromagnetic spectrum and digital images can cover almost the entire spectrum, ranging from gamma to radio waves. 90 Managing bitmapped images requires the choice and manipulation of several parameters such as: Color model. A color model is an abstract mathematical model describing the way colors can be represented as tuples of numbers. From this model, the 95 combination of dedicated primary colors provides all the colors possible that are embedded in the corresponding color space. RGB, CMYK, CIELAB and CIELUV are the most known color models and spaces. 100 Dynamic range. The dynamic range of a digital image (also called color depth) determines the maximum range of gray level or color values carried by each pixel. The number of bits used to represent each pixel determines how many colors can appear in the image. 105 Photographic-quality images are usually associated with 24-bit dynamic range, such as in the JPEG format. Resolution. Resolution expresses the density of elements, pixels for instance, within a specific area. This term does not have any sense when dealing with 110 digital images as files, but it applies when associating a digital image with a physical support, such as display on a screen, printing on a printer or capture with a 12 scanner. Resolution is classically represented in terms of dpi (dots per inch) unit, which was originally the 115 unit adopted for printing. Appearance of bitmapped images, which are made up of a fixed grid of pixels, clearly depends on the resolution chosen, unlike vector images that are scalable and then have the same appearance whatever 120 the dimensions chosen for visualization. Image Compression Image compression is the process of shrinking the size of digital image files. Compression algorithms are 125 especially characterized by two factors: compression ratio and generational integrity. Compression ratio is the ratio of compressed image size to uncompressed size and generational integrity refers to the ability for a compression scheme to prevent or mitigate loss of 130 data, and therefore image quality, through multiple cycles of compression and decompression. Lossless compression ensures that the image data is retained, as with Run-length encoding, Huffman coding and LZW coding. On the other hand, lossy compression schemes 135 involve intentionally sacrificing the quality of stored images by selectively discarding pieces of data. Run-Length Encoding (RLE) is probably the simplest form of lossless data compression: sequences in which the same data value occurs in many consecutive data 140 elements are stored as a single data value and count. Image data is normally run-length encoded in a sequential process that treats the image data as a ID stream, line by line, column by column or diagonally in a zigzag fashion. Common digital image formats for 145 run-length encoded data include TGA, PCX. It is possible with BMP, TIFF and JPEG. Huffman Coding, created in 1951 by David A. Huffman, is an entropy encoding algorithm used for lossless data compression. The basic idea of this 150 algorithm is to code with few digits the most common input symbols of a document. Each symbol is encoded by using a variable-length code table, where the codes are defined according to the estimated probability of occurrence for each possible symbol. Today, Huffman 155 coding is often used during the final process of some other compression methods such as JPEG and MP3. LZW Coding Lempel-Ziv-Welch (LZW) is a lossless data compression algorithm created by Abraham Lempel, Jacob Ziv, and Terry Welch and published in 160 1984. The compression algorithm builds a string translation table that is based on fixed-length codes (usually 12-bit). As the system character serially examines the document if the string read is not stored in the table, a new code is created in the table and 165 associated with this string. Otherwise, the current string is encoded with an existing code. This algorithm became very widely used after it became part of the GIF image format in 1987. In contrast to other compression techniques such as JPEG, it allows 170 preserving very sharp edges, suitable for line art images often stored in GIF format. JPEG Compression JPEG is the most common image format used for compressing and storing by digital cameras and other photographic image capture devices. 175 ' 'JPEG'' stands for Joint Photographic Experts Group, the name of the committee that created the standard, which was approved in 1994 as ISO 10918-1 standard. This algorithm stands on the representation of the image in the frequency domain by using a two- 180 dimensional DCT (Discrete Cosine Transform) that describes the variability of the signal in terms of low-level and high-level frequencies. The human eye notices small differences in brightness over a relatively large area, but does not distinguish the exact strength 185 of a high frequency brightness variation very well. Consequently, the amount of information in the high frequency components of the DCT can be neglected without drastically affecting perceptual image quality: the DCT components are divided by factors of a 190 quantization matrix that increase with the spatial frequency, and then rounded to the nearest integer. This is the main lossy operation in the whole process. Typically, many of the higher frequency components are rounded to zero and the other components become 195 small numbers, which take many fewer bits to store. File Formats There are a lot of image formats for vector graphics as well as for bitmapped images. Most of them are open 200 standards and patent expired for the others. Vector image formats contain a geometric description of the objects which can be rendered smoothly at any desired size. Among the most common formats, there are: 205 - SVG (Scalable Vector Graphics) is an open XML based standard created in 1998 and developed by the World Wide Web Consortium to address the need for a versatile, scriptable and all-purpose vector format for the web and otherwise. 210 - EPS (Encapsulated PostScript) is a standard file format created by Adobe Systems in the mid-1980s. It follows DSC (Document Structuring Conventions) rules that are a set of standards for PostScript. File formats abound for bitmapped images, but many 215 digital imaging projects have settled on the formula of TIFF, JPEG, GIF and also PNG files. - TIFF, for Tagged Image File Format, is a file format for storing images such as photographs as well as graphics. It was originally created by the company 220 Aldus, was then under the control of Adobe Systems and is now in the public domain. TIFF supports several lossless and lossy techniques of image compression, such as LZW, Huffman coding and JPEG. The ability to store image data in a lossless format makes TIFF 13 225 files a useful method for archiving images and preservation purposes. - JPEG, for Joint Photographers Experts Group, is a file format that was developed specifically for high quality compression of photographic images in a 24-bit 230 RGB color model. It is generally employed for online presentation and dissemination; the associated lossy algorithm for compression makes it inappropriate for archiving purposes. The file format associated is JFIF (JPEG File Interchange 235 Format, 1992), a public domain storage format for JPEG compressed images. Unlike TIFF, JFIF does not allow for the storage of associated metadata, a failing that has led to the development of SPIFF (Still Picture Interchange File Format), which is now the 240 international standard. - GIF, for Graphics Interchange Format, is an 8-bit image format for indexed colors that was introduced by CompuServe in 1987 and has since come into widespread usage for art images such as diagrams or 245 logos with a limited numbers of colors. In 1995 CompuServe proposed the PNG format (Portable Network Graphics) as a replacement for the GIF format without patent license. PNG offers a better and lossless compression technique called DEFLATE 250 (that combines LZ77 with Huffman coding). Since 2003, it has been an international standard. Today, the status of TIFF as the de facto standard format for archival digital image files is challenged by other formats such as PNG and JPEG 2000 that are 255 able to preserve data integrity as well as to provide efficient compression ratios for access and dissemination. Metadata Representation 260 Metadata are commonly defined as "data about data." They constitute the documentation or a structured description associated with a document. Image files automatically include a certain amount of metadata that are stored in an area of the file defined by the file 265 format and called the header but information may also be stored externally. In the widely used TIFF format, the term "tagged" indicates that developers can define and apply dedicated tags to enable them to include their own 270 proprietary information (called "private tags") inside a TIFF file without causing problems of compatibility. More recently, Exif format (Exchangeable image file format) was created by the Japan Electronic Industries Development Association. The latest version was 275 published in 2002 and while the specification is not currently maintained by any industry or standards organization, its use by camera manufacturers is nearly universal. Exif fields are generated at the creation of the image and should not be modified after with the 280 aim of including additional information like title or keywords. To do this, other formats are recommended, such as XMP (extensible Metadata Platform). This last is an XML-based standard for creating, processing and storing standardized, extensible and proprietary 285 metadata, created by Adobe Systems in 2001. XMP metadata can be embedded into a significant number of popular file formats. It is used in PDF and other image formats such as JPEG, JPEG 2000, GIF, PNG, TIFF and EPS. 290 In parallel to generic standards, standards dedicated to specific image applications also exist, such as DICOM (Digital Imaging and Communications in Medicine). This is a standard created in 1992 and widely adopted by hospitals, for handling, storing, printing and 295 transmitting information in medical imaging. It includes a file format definition and a network communications protocol. Metadata constitute the documentation of all aspects of digital files essential to their persistence, usefulness 300 and access. Images without appropriate metadata may become hard to view, migrate to new technology, or to access among large volumes of images. When annotation is inappropriate or is missing, the representation of the visual content of images by image 305 analysis may be an interesting alternative. Image Content Representation Born in the early 1990s, Content-Based Image Indexing (CBIR) is a discipline that exploits 310 techniques of image analysis and databases. Indexing an image by its content consists of automatically extracting structures that describe the visual content relevantly for the considered application. These structures can describe the visual content of an image 315 globally or locally by characterizing its distribution of color, shape and texture, or parts or objects of the image. The visual structures exhibited are considered as the index of the image, they are digitally represented with one or several multidimensional vectors called 320 signature of the image. Key Applications JPEG 2000 is a standard that gathers an image file 325 format and an algorithm of image compression, created by the Joint Photographic Experts Group committee in 2000. The coding algorithm of JPEG 2000 is similar to the JPEG one. It mainly differs in the use of wavelets instead of a DCT. Wavelets provide a decomposition 330 of the image into a pyramid of sub-images which store different levels of resolution of the image. They can be of two types, according to the objective: (1) a Daubechies wavelet transform that requires quantization to reduce the amount of bits representing 335 data, as JPEG does, and then imposes lossy compression; 14 (2) a rounded version of Le Gall wavelet transform, that uses only integer coefficients and then does not require quantization, providing lossless coding. 340 The aim of this standard is not only improving compression performance but also adding features, among which are transmission error resilience and region of interest (ROI). This last offers the opportunity of storing parts of the same picture using 345 different quality. Some parts of particular interest such as faces can be stored with higher quality, to the detriment of other ones where low quality/high compression can be tolerated. The JPEG 2000 standard defines two file formats that support embedded XML 350 metadata: JP2, which supports simple XML, and JPX, which has a more robust XML system based on an embedded metadata initiative of the International Imaging Industry Association (the DIG35 specification). 355 Future Directions Today, there is a need of a unified framework for the creation, representation, storage, access, delivery, 360 management and protection of multimedia contents. New standards for managing these contents are the international standards MPEG-21 and MPEG-7. MPEG-21 is a standard started in 1999 by MPEG (Moving Picture Experts Group) and now normalized 365 as ISO/TEC 21000. Its main objectives are to define an open framework for multimedia applications and more precisely to provide a standardized structure for various media contents and to facilitate their access, delivery, management and protection. MPEG-7 is 370 another ISO/IEC standard started by MPEG in 1998 and formally called Multimedia Content Description Interface. One of its main objectives is to provide unified and efficient searching, filtering and content identification methods for these media. 375 (Abridged) Recommended Reading 380 1. Adobe XMP main page: http://www.adobe.com/products/xmp/index.html. 5 2. Datta R. Joshi D. Li J. and Wang J. Z. Image retrieval: Ideas, influences, and trends of the new age. ACM Comput. Surv. 40 (2), 5 April 2008. 385 3. File format info. http://www.fileformat.info/format/could.htm. 3 4. Ian S., Burnett, Pereira F., Van de Walle R., and Koenen R. The MPEG-21 Book. Wiley, 462 pages, 6 May 2006. 390 5. Manjunath B.S., Salembier P., and Sikora T. Introduction to MPEG-7: Multimedia Content Description Interface. Wiley & Sons, 6 April 2002. 6. Murray J. D. and van Ryper W. Encyclopedia of Graphics File Formats. O'Reilly, 1152 pages, 2nd 395 edition, 1996. 7. Oleg S., Pianykh. Digital Imaging and Communications in Medicine (DICOM): A Practical Introduction and Survival Guide. Springer, 2008. 8. Salomon D., Motta G., and Bryant D. Data 400 Compression: The Complete Reference. Springer, 4th edition, 2006. 9. Taubman D. S. and Marcellin M. W. JPEG 2000: Image Compression Fundamentals, Standards and Practice. Kluwer International Series in Engineering 405 and Computer Science, Sees 642, 52001. 10. Wallace G. K. The JPEG still picture compression standard. Commun. of the ACM, 34(4):30-44, 3 April 1991. 15 Answer the following questions: 1) What are the two main categories of digital images and what is the difference between them? 2) What are the key factors in determining image quality? 3) What is compression and what types of compression are mentioned in the text? 4) Name some types of lossless compression. 5) What is JPEG and what is it used for? 6) Name some vector image formats. 7) What are file formats in which bitmap data can be saved? 8) What term is used for "data about data"? 9) Why do we need content-based image indexing? 10) In what way does JPEG 2000 differ from the JPEG standard? 11) What is the standard for multimedia applications? Match the following terms with their definitions: 1) Run-length encoding (RLE) 2) Bitmapped image 3) Content-based information retrieval 4) Color model 5) Lossy compression 6) TIFF 7) JPEG 2000 f) Large runs of consecutive identical data values replaced by a simple code with the data value and length of the run g) Tagged Image File Format Mark the following statements as true or false: 1) Vector graphics are images that are completely described using mathematical definitions. 2) Vector drawings can usually be scaled without any loss in quality. 3) Common raster images include 2- and 3-D architectural drawings, flow charts, logos and fonts. 4) The Lempel-Ziv (LZ) compression methods are among the most popular algorithms for lossy storage. 5) JPEG 2000 is a newer version of the JPEG format that compresses images via wavelets. 6) PNG (Portable Network Graphics) was a predecessor of the GIF format for transfer and display images online. a) A file format that uses wavelet compression b) An attempt to describe color in a mathematical, predictable and reproducible way c) An image made up of a given number of pixels, each with a specific color value, laid out in a grid d) Technology that is able to retrieve images on the basis of machine-recognizable visual criteria e) Reduction in file size that involves permanent loss of information 16 Vocabulary advent ['aed.vent], [-vant] sj - příchod, nástup band [baend] sj - pásmo column ['kol.am] © ['kai.lam] si - sloupec consecutive [kan'sek.ju.tiv] © [-t iv] H') - následný cosine ['kau.sain] © ['kou-] M 1 - kosinus curve [k3:v] © [ k?:v] M■>) - křivka dedicated ['ded.i.kei.tid] ® [-tid] - zaměřený, věnovaný detriment ['det.ri.mant] si - škoda, újma dimension [,dai'men.ř/3n] «J - rozměr dissemination [di'sem.i.neif3n] ^ - šíření edge [ed3] A '■ - hrana formula ['fDi.mju.la] © f'fDir-] H 5 - vzorec, program, plán framework ['freim.w3:k] © [-w?:k] M - rámec geometrical primitives [,d3i:.a'met.nk.al] ^ ) ['pri.mi.tivz] - geometrické útvary grid [grid] =i - mřížka charge coupled device [t/aid^] © [tja:rd3] M : ['kAp.lt] =i [di'vais] si - zařízení s vázanými náboji intentionally [in'ten.f3n.ali] si - záměrně line [lain] =i - čára, přímka lossless f'loslis] =i - bezeztrátový lossy ['losi] si - ztrátový means [mi:nz] - prostředek objective [ab'd3ek.tiv] 4 0 - cíl occurrence [a'kAr.3nfs] ® [-'k?:-] =i - výskyt perceptual [pa'sep.Ju.al] =i -vnímavostní persistence [pa'sis.t3nfs] © [paH <4 - stálost, vytrvalost point [p3int] =i - bod polygon ['polygon] © ['pa:.Ii.gain] H - mnohoúhelník, polygon probability [.prob.a'bil.i.ti] ©[,pra:.ba'bil.9.ti] - pravděpodobnost proprietary [pra'praia.tri] © [-ter.i] H>> -vlastnický, patentovaný replacement for [ri'pleis.mant] ^ - náhrada za representation [,rep.n.zen'tei.J3n] si - zobrazení, vyobrazení resilience [rťzil.i.anf s] =i - odolnost row [rau] © [ rou] si -řádek run-length coding [rAn] M■') [lerjXr 0] H ! [kaudirj] ® [koudirj] - kódování délkou běhu scalable ['skei.la.bl] =í - škálovatelný scriptable ['skřip.ta.bl] - skriptovatelný shade [feid] ^ - odstín shape [feip] ^) - tvar, útvar smoothly ['smu:5.li] - hladce strength [strerj9] - síla, intenzita string [strirj] ^ - řetězec to allow for [a'lau] ^ - počítat s čím to associate with [a'sau.si.eit] ® [-sou-] 4 - spojovat s to be made up of [meid] =i - být složen to capture ['kaep.tJV] ® [-tfa*] 4 - zachytit to convey [kan'vei] =J - sdělit, vyjádřit to differ In st ['dif.ar] ® [-a*] ^ - Ušit se v něčem to discard [dťskaid] © [-'skaird] *J - odhodit to distinguish [di'stin.gwif] =4 - rozlišit to embed st in st [im'bed] si - zapustit, zasadit do to employ [im'pbi] si - použít, využít to exploit [ik'spbit] ^1 ^ - využít, zužitkovat to facilitate [fa'sil.i.teit] ^ - usnadnit to focus on st ['fau.kas] ® ['fou-] si - zaměřit se na to challenge ['tfael.ind3] =i - zpochybnit, vznést námitky to migrate [maťgreit] ^) ^')- přesunout, přemístit to mitigate ['mit.i.geit] ® ['mít-] si - zmírnit to neglect st [ni'glekt] si - zanedbat, opomenout to preserve [pri'z3:v] ® [-'z?:v] ^ - uchovat to render ['ren.dar] ® [-da*] *t - zobrazit to retain [n'tein] A - uchovat to round [raund] =i - zaokrouhlit to sacrifice ['saek.n.fais] =i - obětovat to settle on st ['set.I] ® ['set-] *J - rozhodnout se pro to shrink [frirjk] si -zmenšit to stand for [staend] ^ ■) - znamenat to tie [tai] =i - vázat se, souviset to treat st as [tri it] si - považovat co za tuple [tju:pal] - n-tice unlike [An'laik] =i - na rozdíl versatile [V3:.sa.tail] © [Vyi.sa.tH] ^ ) <') -všestranný, univerzální 17 Phrases in a zigzag fashion ['zig.zaeg] ^ ) ['faej>n] -klikatě, cikcak in contrast to ['kon.traist] © ['kain.traest] =J -naproti tomu, na rozdíl in parallel to ['paer.a.lel] © ['per-] ^') - souběžně in particular [pa'tik.ju.lar] © [p^'tik.ja.la^] ^')-obzvláště widely used - hojně využívaný with the aim of agreeing - s cílem shodnout se 18 Processor Cache PETER BONCZ CWI, Amsterdam, The Netherlands 5 Definition To hide the high latencies of DRAM access, modern computer architecture now features a memory 10 hierarchy that besides DRAM also includes SRAM cache memories, typically located on the CPU chip. Memory access first checks these caches, which takes only a few cycles. Only if the needed data is not found, an expensive memory access is needed. 15 Key Points CPU caches are SRAM memories located on the CPU chip, intended to hide the high latency of accessing off- 20 chip DRAM memory. Caches are organized in cache lines (typically 64 bytes). In a fully-associative cache, each memory line can be stored in any location of the cache. To make checking the cache fast, however, CPU caches tend to have limited associativity, such that 25 storage of a particular cache line is possible in only 2 or 4 locations. Thus only 2 or 4 locations need to be checked during lookup (these are called 2- resp. 4-way associative caches). The cache hit ratio is determined by the spatial and temporal locality of the memory 30 access generated by the running program(s). Cache misses can either be compulsory misses (getting the cache lines of all used memory once), capacity misses (caused by the cache being too small to keep all multiply used lines in cache), or conflict misses (due to 35 the limited associativity of the cache). Most modern CPUs have at least three independent caches: an instruction cache to speed up executable instruction fetch, a data cache to speed up data fetch and store, and a Translation Lookaside Buffer(TLB) 40 used to speed up virtual-to-physical address translation for both executable instructions and data. The TLB is not organized in cache lines; it simply holds pairs of (virtual, logical) page mappings, typically a fairly limited amount (e.g., 64). In practice, this means that 45 algorithms that repeatedly touch memory in more than 64 pages (whose size is often 4 KB) shortly after each other, run into TLB thrashing. This problem can sometimes be mitigated by setting a large virtual memory page size, or by using special large OS pages 50 (sometimes supported in the CPU with a separate, smaller, TLB for large pages). Another issue is the tradeoff between latency and hit rate. Larger caches have better hit rates but longer latency. To address this tradeoff, many computers use 55 multiple levels of cache, with small fast caches backed up by larger slower caches. Multi-level caches generally operate by checking the smallest Level 1 (LI) cache first; if it hits, the processor proceeds at high speed. If the smaller cache misses, the next larger 60 cache (L2) is checked, and so on, before external memory is checked. As the latency difference between main memory and the fastest cache has become larger, some processors have begun to utilize as many as three levels of on-chip cache. 65 For multi-CPU and multi-core systems, the fact that some of the higher levels of cache are not shared, yet provide coherent access to shared memory, causes additional cache-coherency inter-core communication to invalid stale copies of cache lines on other cores 70 when one core modifies it. In multi-core CPUs, an important issue is that cache level is shared among all cores - this cache level is on the one hand a potential hot-spot for cache conflicts, on the other hand provides an opportunity for very fast inter-core data exchange. 75 In case of sequential data processing, the memory controller or memory chipset in modern computers often detects this access pattern and starts requesting the subsequent cache lines in advance. This is called hardware prefetching. Prefetching effectively allows 80 hiding compulsory cache misses. Without prefetching, the effective memory bandwidth would equate cache line size divided by memory latency (e.g., 64/50ns = 1.2 GB/s). Thanks to hardware prefetching, modern computer architectures reach four times that on 85 sequential access. Modern CPUs also offer explicit prefetching instructions, which a software writer can exploit to perform (non-sequential) memory access in advance, hiding their latency. In database systems, such software prefetching has successfully been used 90 in making hash-table lookup faster (e.g., in hash-join and hash aggregation). In database systems, a series of cache-conscious data storage layouts (e.g., DSM and PAX) have been proposed to improve cache line usage. Also, a number 95 of cache-conscious query processing algorithms, such as cache-partitioned hash join and hash-join using memory prefetching, have been studied. In the area of data structures and theoretical computer science, there has recently been interest in cache-oblivious 100 algorithms that regardless of the exact parameters of the memory hierarchy (number of levels, cache size, cache line sizes and latencies) perform well. (Abridged) 105 19 Answer the following questions: 1) What kinds of independent caches are mentioned in the text? 2) What is the cache hit ratio determined by? 3) What kinds of cache misses are mentioned in the text? 4) What are cache-oblivious algorithms? 5) How do multi-level caches operate? 6) What enables a faster hash-table lookup? Match the following terms with their definitions: 1) hardware prefetching 2) cache miss 3) fully-associative cache 4) translation loookaside buffer a) this type of cache speeds up virtual-to-physical address translation b) requesting the subsequent memory lines in advance c) a situation when the requested data does not occur in the cache d) in this type of cache each memory line can be stored in any location of the cache Mark the following statements as true or false: 1) Modern computer architecture does not include DRAM, only SRAM. 2) A big problem with DRAM is its high latency. 3) A translation lookaside buffer is organized in cache lines. 4) Larger caches have worse hit rates but lower latency. Vocabulary Coherent [kaa 'hia.rant] ©[ kou'hir.snt] ^0 - souvislý, soudržný Compulsory [kam'pAl.ssr.i] © [ -sa*-] ^ - povinný conscious ['komfTas] © [ 'ka:n-] - vědom si něčeho fairly ['fea.li] © [ 'fer-] ^ - celkem, dost hotspot [/?DtspDt] © [hDitspoit] - ohnisko independent [,in.di'pen.dsnt] =J - nezávislý invalid [in'vael.id] =J - neplatný latency ['lei.t3nf .si] sj - zpoždění oblivious [a'bliv.i.as] - zapomnětlivý, zapomínající, nedbající (něčeho) ratio ['rei.Ji.au] © [ -ou] - poměr sequential [sťkwen.Jal] &J - následující, postupný spatial ['spei.J3!] prostorový stale [steil] ^') - zde zastaralý subsequent ['sAb.si.kwant] - následující, příští table [Vei.bl] ^)) - tabulka temporal ['tem.psr.3l] © [ -pa*.al] ^') - časový thrashing [Oraejirj] - zahlcení paměti to equate [ťkweit] - rovnat se to fetch [fetj] - přinést to mitigate ['mit.i.geit] © [ 'mít-] - zmírnit to multiply ['mAl.ti.plai] © [ -t i-] ^') - násobit to propose [pra'pauz] © [ -'pouz] ^') - navrhnout to tend to [tend] ^ ■) - mít tendenci k to utilize ['jui.ti.laiz] © [ -t al.ai-] ^-') - využít, upotřebit trade-off ['treid.of] © [ -a:f] ^) - kompromis Phrases in advance [ad'va:nfs] © [-Vaenřs] H : - dopředu, předběžně, v předstihu regardless of [rťga:d.las] © [-'ga:rd-] ^ - bez ohledu na 21 Spatial Data Mining SHASHI SHEKHAR, JAMES KANG, VIJAY GANDHI 5 University of Minnesota, Minneapolis, MN, USA Definition Spatial data mining is the process of discovering 10 nontrivial, interesting, and useful patterns in large spatial datasets. The most common spatial pattern families are co-locations, spatial hotspots, spatial outliers, and location predictions. 15 Historical Background Spatial data mining research began several decades ago when practitioners and researchers noticed that critical assumptions in classical data mining and statistics were 20 violated by spatial datasets. First, whereas classical datasets often assume that data are discrete, spatial data were observed to reside in continuous space. For example, classical data mining and statistical methods may use market-basket datasets (e.g., history of 25 Walmart's transactions), where each item-type in a transaction is discrete. However, "transactions" are not natural in continuous spatial datasets, and decomposing space across transactions leads to loss of information about neighbor relationships between 30 items across transaction boundaries. In addition, spatial data often exhibits heterogeneity (i.e., no places on the Earth are identical), whereas classical data mining techniques often focus on spatially stationary global patterns (i.e., ignoring spatial variations across 35 locations). Finally, one of the common assumptions in classical statistical analysis is that data samples are independently generated. However, this assumption is generally false when analyzing spatial data, because spatial data tends to be highly self-correlated. For 40 example, people with similar characteristics, occupation and background tend to cluster together in the same neighborhoods. In spatial statistics this tendency is called spatial auto-correlation. Ignoring spatial auto-correlation when analyzing data with 45 spatial characteristics may produce hypotheses or models that are inaccurate or inconsistent with the data set. Thus, classical data mining algorithms often perform poorly when applied to spatial data sets. Better methods are needed to analyze spatial data to detect 50 spatial patterns. Foundations The spatial data mining literature has focused on four 55 main types of spatial patterns: (i) spatial outliers, which are spatial locations showing a significant difference from their neighbors; (ii) spatial co-locations, or subsets of event types that tend to be found more often together throughout space than other 60 subsets of event types; (iii) location predictions, that is, information that is inferred about locations favored by an event type based on other explanatory spatial variables; and (iv) spatial hotspots, unusual spatial groupings of events. The remainder of this section 65 presents a general overview of each of these pattern categories. A spatial outlier is a spatially referenced object whose non-spatial attribute values differ significantly from 70 those of other spatially referenced objects in its spatial neighborhood. For example, consider spatial outliers, detected in traffic measurements for sensors on highway I-35W (North bound) in the Minneapolis-St. Paul area, for a 24-h time period. Station 9 may be 75 considered a spatial outlier as it exhibits inconsistent traffic flow compared with its neighboring stations. Once a spatial outlier is identified, one may proceed with diagnosis. For example, the sensor at Station 9 may be diagnosed as malfunctioning. Spatial attributes 80 are used to characterize location, neighborhood, and distance. Non-spatial attribute dimensions are used to compare a spatially referenced object to its neighbors. Spatial statistics literature provides two kinds of bipartite multidimensional tests, namely graphical tests 85 and quantitative tests. Graphical tests, such as Variogram clouds and Moran scatterplots, are based on the visualization of spatial data and highlight spatial outliers. Quantitative methods provide a precise test to distinguish spatial outliers from the remainder of data. 90 Spatial co-location pattern discovery finds frequently co-located subsets of spatial event types given a map of their locations. Spatial co-location is a generalization of a classical data mining pattern family 95 called association rules, since transactions are not natural in spatial datasets, and partitioning space across transactions leads to loss of information about neighbor relationships between items near transaction boundaries. Additional details about co-location 100 interest measures, e.g. participation index and K functions, and mining algorithms are described in [2]. Location prediction is concerned with the discovery of a model to infer preferred locations of a spatial phenomenon from the maps of other explanatory 105 spatial features. For example, ecologists may build models to predict habitats for endangered species using maps of vegetation, water bodies, climate, and other related species. For example, consider an example of a dataset used in building a location prediction model for 110 red-winged blackbirds in the Darr and Stubble wetlands on the shores of Lake Erie in Ohio, USA. This dataset consists of nest location, distance to open 22 water, vegetation durability and water depth maps. Classical prediction methods may be ineffective in this 115 problem due to the presence of spatial auto-correlation. Spatial data mining techniques that capture the spatial autocorrelation of nest location such as the Spatial Autoregressive Model (SAR) [1] and Markov Random Fields based Bayesian Classifiers (MRF-BC) are used 120 for location prediction modeling. Spatial Hotspots are unusual spatial groupings of events that tend to be much more closely related than other events. Examples of spatial hotspots can be 125 incidents of crime in a city or outbreaks of a disease. Hotspot patterns have properties of clustering as well as anomalies from classical data mining. However, hotspot discovery [4] remains a challenging area of research due to variation in shape, size, density of 130 hotspots and underlying space (e.g., Euclidean or spatial networks such as roadmaps). Additional challenges arise from the spatio-temporal semantics such as emerging hotspots, displacement etc. 135 Key Applications Spatial data mining and the discovery of spatial patterns has applications in a number of areas. Detecting spatial outliers is useful in many applications 140 of geographic information systems and spatial databases, including the domains of public safety, public health, climatology, and location-based services. As noted earlier, for example, spatial outlier applications may be used to identify defective or out of 145 the ordinary (i.e., unusually behaving) sensors in a transportation system. Spatial co-location discovery is useful in ecology in the analysis of animal and plant habitats to identify co-locations of predator-prey species, symbiotic species, or fire events with fuel and 150 ignition sources. Location prediction may provide applications toward predicting the climatic effects of El Nino on locations around the world. Finally, identification of spatial hotspots can be used in crime prevention and reduction, as well as in epidemiological 155 tracking of disease. (Abridged) Recommended Reading 160 1. Cressie N.A. Statistics for Spatial Data (Revised Edition). Wiley, New York, NY, 1993. 2. Huang Y., Shekhar S., and Xiong H. Discovering co-location patterns from spatial datasets: a general 165 approach. IEEE Trans. Knowl. Data Eng. (TKDE), 16(12): 1472-1485, 2004. 3. Shekhar S., Schrater P., Vatsavai R., Wu W., and Chawla S. Spatial contextual classification and prediction models for mining geospatial data. IEEE 170 Trans. Multimed. (special issue on Multimedia Databases), 4(2):174-188, 2002. 4. US Department of Justice - Mapping and Analysis for Public Safety report. Mapping Crime: Understanding Hot Spots, 2005 175 (http://www.ncjrs.gov/pdffiles l/nij/209393.pdf). 5. Shekhar S. and Chawla S. A Tour of Spatial Databases. Prentice Hall, 2003. 6. Longley P.A., Goodchild M., Maquire D.J., and Rhind D.W. Geographic Information Systems and 180 Science. Wiley, 2005. 7. Shekhar S., Zhang P., Huang Y., and Vatsavai R. Trend in spatial data mining. In Data Mining: Next Generation Challenges and Future Directions, H. Kargupta, A. Joshi, K. Sivakumar, and Y. Yesha (eds.). 185 AAAI/MIT Press, 2003. 8. Solberg A.H., Taxt T., and Jain A.K. A Markov random field model for classification of multisource satellite imagery. IEEE Trans. Geosci. Remote Sens., 34(1): 100-113, 1996. 190 9. Kou Y., Lu C.T., and Chen D. Algorithms for spatial outlier detection. In Proc. 2003 IEEE Int. Conf. on Data Mining, 2003, pp. 597-600. 10. Shekhar S., Lu C.T., and Zhang P. A unified approach to detecting spatial outliers. Geolnformatica, 195 7(2):139-166,2003. 11. Mamoulis N., Cao H., and Cheung D.W. Mining frequent spatio- temporal sequential patterns. In Proc. 2003 IEEE Int. Conf. on Data Mining, 2005, pp. 82-89. 23 Answer the following questions: 1) What is spatial data mining? 2) What are the basic differences between classical datasets and spatial datasets? 3) Why are classical data mining algorithms unsuitable for spatial datasets? 4) What does it mean that spatial data tends to be highly self-correlated? How would you explain spatial auto correlation? 5) Name the four main types of spatial patterns. 6) Explain a spatial outlier and give a simple example. 7) What are spatial and non-spatial attributes? 8) Which pairs are frequently co-located? 9) Give some examples of hotspots. 10) Name some areas where spatial data mining is applied. Mark the following statements as true or false: 1) Many of the relationships on spatial data are implicit. 2) Spatial autocorrelation means that nearby things are more different than distant things. 3) In spatial data mining data samples are not independent. 4) Classical data mining techniques perform well when applied to spatial data sets. 5) Spatial data mining is based on the same assumptions as classical data mining. Match the following terms with their definitions: 1) Spatial data mining (SDM) 2) Spatial outlier 3) Location prediction 4) Spatial co-location 5) Hotspot analysis 6) Spatial autocorrelation 7) Variogram clouds and Moran scatterplots a) Presence of two or more spatial objects at the same location or at significantly close distances from each other b) Process of finding unusually dense event clusters across time and space c) Process of discovering interesting, useful, non-trivial patterns from large spatial datasets d) Spatial outlier detection algorithms based on visualization e) Observation which appears to be inconsistent with its neighborhood f) Prediction of events occurring at particular geographic locations g) Spatial data values are influenced by values in their immediate vicinity 24 Vocabulary anomaly [a'rmm.a.li] © [-'nai.ma-] - odchylka assumption [a'sAm/?.J"3n] ftj - předpoklad, domněnka boundary ['baun.dar.i] [-dri] © [-daH H') - hranice, mez defective [dťfek.tiv] ^í> - vadný, chybný discrete [di'skriit] ^')- nespojitý, oddělený, samostatný displacement [dťspleis.mant] přemístění, posun endangered [in'dein.d3ad] © [-d3^d] - ohrožený grouping ['grui.pin] ^ ■) - seskupení habitat ['haeb.i.taet] přirozené prostředí, domov hypothesis (pi. hypotheses) [hai'pDfj.a.sis] © [-'pa:.9a-] 4>), pi. [hai'poe.a.siiz] 4>) -hypotéza, předpoklad, domněnka ignition [ig'nij>n] ^ i - vznícení, vzplanutí incident ['inř.si.dsnt] ^')- případ, incident, příhoda, událost inconsistent [,in.kan'sis.t3nt] M - neslučitelný, jsoucí v rozporu namely ['neim.li] ^) - a to, jmenovitě nontrivial [non'triv.i.al] - podstatný, důležitý out of the ordinary [aut] ^) [Di.di.na.ri] © ['D:r.d3n.er-] ^) - neobvyklý, zvláštní outbreak ['aut.breik] ^■) - vypuknutí outlier ['aut.lai.a] =J -co leží mimo pattern ['paet.3n] © ['paet .a^n] ^>)-vzorec, způsob, průběh phenomenon (pi. phenomena) [fa'nom.i.nan] © [-'na:.ma.na:n] M'>, pi. [fa'nom.i.na] - jev predator ['pred.a.tar] © [-t&] M■■) 4') - dravec, predator prey [prei] - kořist remainder [ri'mein.d9r] © [-da^] - zbytek species ['spi:./i:z] ^) - druh stationary ['stei./3n.sr.i] © [-/a.ner-] 4') - neměnný, nehybný throughout space [9ru:'aut] 4) [speis] po celém prostoru to be concerned with st [kan's3:nd] © [ -'s?:nd] - zabývat se čím to cluster ['klAs.tar] © [-ta^] A - shlukovat se, seskupit to co-locate [kau.lau'keit] © [kou.lou'keit] ^))- vyskytovat se společně to decompose [,di:.kam'pauz] ®[-'pouz] 4) - rozložit to favor ['fei.var] © [ -va^] M •) - podporovat to highlight ['hai.lait] M$>- zvýraznit, poukázat to infer [in'f3ir] © [ -'f?:] M% - dedukovat, vyvozovat to note [naut] © [ nout] ^') - upozornit, poukázat to partition [pa:'tij>n] © [pair-] ^ - rozdělit to proceed [praa'siid] © [prou-] 4 - pokračovat, postupovat to reference ['ref>r.snfs] © [-aH ^) - odkazovat to tend [tend] ^ í - mít sklon, tendenci, být náchylný to track [traek] =J - sledovat to violate [Vaia.leit] Hí- porušit, nedodržet Phrases due to - kvůli either - or - buď - nebo from the standpoint ['staend.pDint] z hlediska in infant stages ['in.fant] ^ - na počátku vývoje 25 Stemming CHRIS D. PAICE Lancaster University, Lancaster, UK 5 Definition Stemming is a process by which word endings or other affixes are removed or modified in order that word 10 forms which differ in non-relevant ways may be merged and treated as equivalent. A computer program which performs such a transformation is referred to as a stemmer or stemming algorithm. The output of a stemming algorithm is known as a stem. 15 Historical Background The need for stemming first arose in the field of information retrieval (IR), where queries containing 20 search terms need to be matched against document surrogates containing index terms. With the development of computer-based systems for IR, the problem immediately arose that a small difference in form between a search term and an index term could 25 result in a failure to retrieve some relevant documents. Thus, if a query used the term "explosion" and a document was indexed by the term "explosives," there would be no match on this term (whether or not the document would actually be retrieved would 30 depend on the logic and remaining terms of the query). The first stemmer for the English language to be fully described in the literature was developed in the late 1960s by Julie Beth Lovins [11]. This has now been largely superseded by the Porter stemmer [14], which 35 is probably the most widely used, and the Paice/Husk stemmer [12]. Stemmers have also been developed for a wide variety of other languages. Foundations 40 Definitions In an IR context, the process of taking two distinct words, phrases or other expressions and treating them as semantically equivalent is referred to as conflation. 45 The two expressions need not be precisely synonymous, but they must refer to the same core concept (compare "computed" and "computable"). In this article, the term "practically equivalent" is used to mean that, for the purposes of a particular 50 application, the words may as well be taken as equivalent. The term conflation is sometimes used as though it is equivalent to stemming, but it is in fact a much broader concept, since it includes (i) cases where the strings concerned are multi-word expressions, as in 55 "access time" and "times for access", and (ii) cases where the strings are not etymologically related, as in "index term" and "descriptor". In case (i) special string matching techniques may be used, whereas in case (ii) reference to a dictionary or thesaurus is 60 necessary. The present account deals exclusively with the conflation of single etymologically related words. There are various possible approaches to word conflation, including the following. 65 1. Direct matching. In this method, the character sequences of two words are compared directly, and a similarity value is computed. The words are then considered to match if their mutual similarity exceeds a predefined threshold. To give a simple example, the 70 first six letters of the words "exceeds" and "exceeded" are the same, so these words together contain 12 matching letters out of 15. Hence, a similarity of 12/15 = 0.80 can be computed. Use of a threshold (say, 0.70) allows a decision as to whether 75 the words can be considered equivalent. With such a method, setting the threshold is problematic. Thus, the similarity between "exceeds" and "excess" is 0.62, which is below the stated threshold. However, allowing for this by lowering the threshold to 0.60 80 would cause "excess" and "except" (similarity 0.67) to be wrongly conflated. 2. Lexical conflation. In this case a thesaurus or dictionary is used to decide whether two words are 85 equivalent. Obviously, this method can be used even for etymologically unrelated words. A problem here is obtaining a suitably comprehensive and up-to-date thesaurus, and one which explicitly lists routine variants such as plurals. 90 3. Cluster-based conflation. This method, investigated by Xu and Croft [15], involves creating clusters of practically equivalent words by analyzing the word associations in a large representative text corpus. Each 95 query word is then supplemented by adding in the other words in its cluster. In contrast to method (2), the clusters created are specific to the text collection in question. However, the creation of the clusters can be very time-consuming. 100 4. N-gram conflation. In this method, each word is decomposed into a collection of N-letter fragments (N-grams), and a similarity is computed between the N-gram collections of two words; a threshold is then 105 applied to decide whether the words are equivalent. This approach was pioneered by Adamson and Boreham[l], who used sets of bigrams, where N = 2. For example, after eliminating duplicates and sorting into order, "exceeds" can be represented by the 110 bigram set {ce, ds, ed, ee, ex, xc} and "exceeded" by {ce, de, ed, ee, ex, xc}. Out of 7 distinct bigrams here, 26 5 are shared between the two words; hence a similarity of 5/7 = 0.712 can be computed. 115 5. Stemming. Stemming refers to the removal of any suffixes (and sometimes other affixes) from an input word to produce a stem. Two words are then deemed to be equivalent if their stems are identical. This method is much favored because it is fast: all words 120 can be reduced to stems on input to the system, and simple string matching used thereafter. The remainder of this article focuses on stemming in this narrow sense. 125 Prefixes and Infixes In English, stemmers are usually designed for removing suffixes from words. The removal of "intimate" prefixes such as "intro-," "pro-" and 130 "con-" generally results in words being wrongly conflated (consider "intro-duction," "pro-duction" and "con-duction"). However, there may be a case for removing looser prefixes such as "hyper-" or "macro-." Also, prefix 135 removal may be desirable in certain domains with highly artificial vocabularies, such as chemistry and medicine. As explained below, there are some languages in which removal or replacement of prefixes, or even infixes, is in fact essential. 140 Performance and Evaluation Since stemmers were originally developed to aid the operation of information retrieval systems, it was 145 natural that they were first assessed in terms of their effect on retrieval performance, as well as on "dictionary compression" rates. Researchers were frustrated to find that the effects on retrieval performance for English language material were small 150 and often negative [10]. Removal of "-s" and other regular inflectional endings might be modestly helpful, but use of heavier stemming could easily result in a loss of performance [7]. Stemming errors are of two kinds: understemming, in 155 which a pair of practically equivalent words are not conflated, and overstemming, in which two semantically distinct words are wrongly conflated. Non-English Stemmers 160 Stemming is appropriate for most (though not all) natural languages, and appears to be especially beneficial for highly inflected languages [9]. There is neither space nor need to describe non-English 165 stemmers here, except to note that some languages exhibit much greater structural complexity, and this warrants special approaches. Thus, a typical Arabic word consists of a root verb of three (or occasionally four or five) consonants (e.g., "k-t-b" for "to write"), 170 into which various prefixes, infixes and suffixes are inserted to produce specific variant forms ("katabna": "we wrote" and "kitab": "book"). Some researchers have concentrated on extracting the correct root from a word [3], but Aljlayl and Frieder 175 have demonstrated that better retrieval performance is obtained by using a simpler "light stemming" approach, in which only the most frequent suffixes and prefixes are removed [4]. Their results showed that extraction of roots causes unacceptable levels of 180 overstemming. Key Applications As noted earlier, stemmers are routinely used in 185 information retrieval systems to control vocabulary variability. They also find use in a variety of other natural language tasks, especially when it is required to aggregate mentions of a concept within a document or set of documents. For example, stemmers may be used 190 in constructing lexical chains within a text. Stemming can also have a role to play in the standardization of data for input to a data warehouse. (Abridged) 195 Recommended Reading 1. Adamson G.W. and Boreham J. The use of an association measure based on character structure to 200 identify semantically related pairs of words and document titles. Inf. Process. Manage., 10(7/8):253-260, 1974. 2. Ahmad F., Yusoff M., and Sembok M.T. Experiments with a stemming algorithm for Malay 205 words. J. Am. Soc. Inf. Sci. Technol., 47(12):909-918, 1996. 3. Al-Sughaiyer LA. and Al-Kharashi LA. Arabic morphological analysis techniques: a comprehensive survey. J. Am. Soc. Inf. Sci. Technol., 55(3):189-213, 210 2004. 4. Aljlayl M. and Frieder O. On Arabic search: Improving the retrieval effectiveness via a light stemming approach. In , 2002, pp. 340-347. 5. Bacchin M., Ferro N., and Melluci M. A 215 probabilistic model for stemmer generation. Inf. Process. Manage., 41(1):121-137, 2005. 6. Frakes W.B. and Fox C.J. Strength and similarity of affix removal stemming algorithms. SIGIR Forum, 37(l):26-30, 2003 (Spring 2003). 220 7. Harman D. How effective is suffixing? J. Am. Soc. Inf. Sci., 42(1):7-15, 1991. 27 8. Hull D. A Stemming algorithms: a case study for detailed evaluation. J. Am. Soc. Inf. Sci., 47(l):70-84, 1996. 225 9. Krovetz R. Viewing morphology as an inference process. Artificial Intelligence, 118(l/2):277-294, 2000. 10. Lennon M., Pierce D.S., Tarry B.D., and Willett P. An evaluation of some conflation algorithms for 230 information retrieval. J. Inf. Sci., 3:177-183, 1981. 11. Lovins J.B. Development of a stemming algorithm. Mech. Transl. Comput. Linguist., 11:22-31, 1968. 12. Paice CD. Another stemmer. SIGIR Forum, 24(3):56-61, 1990. 235 13. Paice CD. A method for the evaluation of stemming algorithms based on error counting. J. Am. Soc. Inf. Sci., 47(8):632-649, 1996. 14. Porter M.F. An algorithm for suffix stripping. Program, 14(3):130-137, 1980. 240 15. Xu J. and CroftW.B. Corpus-based stemming using co-occurrence of word variants. ACM Trans. Inf. Syst., 16(1):61-81, 1998. 28 Answer the following questions: 1) How would you describe stemming? What is its purpose? 2) What often resulted in a failure to retrieve relevant documents during searches in the past? 3) What is conflation? 4) Is there any difference between conflation and stemming? 5) What tools have to be used when strings are not etymologically related? 6) Describe direct matching. 7) What does the term of threshold refer to in the text? 8) What is the disadvantage of cluster-based conflation? 9) What are bigrams? 4) Hyper- in hyperactive is a suffix. 5) The term affix covers both prefix and suffix. 6) Stemming appears beneficial for highly inflected languages. 7) The light-stemming approach is based on removing the least frequent affixes. Match the following terms and their definitions: 1) lexical conflation 2) cluster-based conflation 3) N-gram conflation 4) understemming 5) overstemming a) a method using a corpus of texts b) a method based on bigrams c) a situation where more-or-less equivalent words are not conflated d) a method using a dictionary or thesaurus e) a situation where two semantically distinct words are wrongly conflated Mark the following statements as true or false: 1) During the conflation, the expressions need to be synonymous. 2) The words mother and father are etymologically related. 3) In stemming, two words are considered equivalent provided their stems are identical. 29 Vocabulary account [a'kaunt] ^) - výčet; účet actual ['aek.tju.al] [-tju-] [-tful] M$>- vlastní actually ['aek.tju.9.li] [-tju-] [-tju.li] H:>) - vlastně affix ['aef.iks] ^')- affix (předpona, přípona) algorithm ['ael.ga.ri.ô3m] ^) - algoritmus bigram ['baigraem] - bigram (skupina dvou písmen, slabik či slov) cluster ['klAs.t9r] © [-ta*] - hrozen, skupina, klastr comprehensive [.kDm.prťhenf.siv] © [,ka:m-] - komplexní, obsáhlý conflation [kan'fleif3n] Hi - spojování compression rate [kam'prej.3n] [reit] - kompresivita consonant ['kon.sa.nant] © ['ka:n-] - souhláska core [kDír] © [kDír] M') - jádro; jádřinec corpus ['kDi.pas] © ['kDír-] sj - korpus, tělo; soubor textů distinct [dťstirjkt] - různý rozdílný duplicate ['dju:.pli.kat] © ['du:-] 4 - duplikát; duplikovaný (srovnej výslovnost s „to duplicate" ['djui.pli.keit] ®['du:-] 4») inflectional [in'flekfanal] A - skloňovací, skloňující, skloňovatelný equivalent [i'kwiv.3l.ant] =í - ekvivalentní etymological [,et.i'mDl.a.d3ikal] © [-'ma:.la-] 4) - etymologický, vztahující se k původu slova exclusive [ik'sklui.siv] sj -výhradní failure ['fei.ljar] © [-Ija*] 4> - neúspěch hence [henřs] 4>) - tudíž however [,hau'ev.ar] © [-&] =J - však, avšak identical [aťden.ti.k3!] © [-ta-] sj - identický, stej ný lexical ['lek.si.k3!] =J - lexikální lexical chains ['lek.si.k3!] ^') [tjein] ^ ■) - lexikální řetězce loose [lu:s] 4>) - volný mutual ['mjui.tju.al] 4)-vzájemný predefined [prirdi'faind] © [pri:da'faind] - předem definovaný prefix ['prii.fiks] 4>) - předpona query ['kwia.ri] © [ 'kwir.i] 4) 4) - dotaz remainder [ri'mein.dar] © [-da*] &J - zbytek root [ru:t] 4) 4) - kořen routine [ru:'ti:n] 4>)- obvyklý semantic [sťmaen.tik] © [-t ik] ^■) - sémantický, významový stem [stem] ^i> - kmen; stopka surrogate ['sAr.a.gat] ®['s?:-] A1)- náhradník; náhradní suffix ['sAf.iks] 4>) - přípona synonymous [sťnon.i.mas] © [-'na:.na-] 4>) - podobného významu thereafter [,5ea'ra:f.tar] © [,5er'aef.ta*] 4>) - poté thesaurus [9a'so:ras] 4) - tesaurus threshold ['9reJ./?auld] © [-/?ould] 4>) - práh thus [5as] M■>) - tak, a tak to aggregate st ['aeg.ri.geit] =J - (na)hromadit něco to aid st [eid] A S - napomáhat něčemu to arise (arise, arose, arisen) [a'raiz] 4>) -objevit se; vyvstat to assess st [a'ses] =J - hodnotit něco to conflate [kan'fleit] 4) - spojit, spojovat to decompose st [,di:.kam'pauz] © [-'pouz] 4>) - rozložit něco to deem [di:m] A ■) - považovat to duplicate st ['djui.pli.keit] © ['du:-] ^ 5 - duplikovat něco to eliminate st [ťlim.i.neit] &J - eliminovat něco to exceed st [ik'siid] M •) - překročit něco to exhibit st [ig'zib.it] 4) - vykazovat něco to extract st [ik'straekt] =í - extrahovat, vytáhnout něco to favor st ['fei.var] ® [-va*] =J - dávat něčemu přednost 30 to investigate st [in'ves.ti.geit] M S - vyšetřovat něco to merge st [m3:d3] © [ m?:d3] =í - spojit něco, sloučit to obtain st [ab'tein] - získat něco to pioneer [,paia'niar] © [-'nir] ^ ■) - razit cestu to retrieve st [rťtriiv] M$ - vyhledat, vyzvednout něco to supersede st [,su:.pa'si:d] © [-paH ^') - nahradit něco to supplement st ['sAp.li.mant] =J - doplnit něco to treat st [triit] =í - zacházet s něčím to warrant st [Wr.ant] © ['wDír-] M') - opravňovat něco variability [.vea.ri.abil .i.ti] © [,ver.i.a'bil.a.t i] -variabilita variant [Vea.ri.ant] © [Ver.i-] ^ - varianta warehouse ['wea.haus] © ['wer-] M - skladiště whereas [wea'raez] © [wer'aez] M >) - kdežto Phrases In contrast to st ['kon.traist] © ['kain.traest] -Oproti něčemu Obviously, ... [Db.vi.a.sli] © ['a:b-] 4>) -Samozřejmě 31 Storage Devices KALADHAR VORUGANTI Network Appliance, Sunnyvale, CA, USA 5 Definition One of the goals of database, file and block storage systems is to store data persistently. There are many 10 different types of persistent storage devices technologies such as disks, tapes, DVDs, and Flash. The focus of this write-up is on the design trade-offs, from a usability standpoint, between these different types of persistent storage devices and not on the 15 component details of these different technologies. Historical Background From a historical standpoint, tapes were the first type 20 of persistent storage followed by disks, CDs, DVDs, and Flash. Newer types of memory technologies such as PRAM and MRAM are still in their infant stages. These newer non-volatile memory technologies promise DRAM access speeds and packaging 25 densities, but these technologies are still too expensive with respect to cost/gigabyte. Foundations 30 - Tapes/Tape Libraries: Tape readers/tape head, tape library, tape robot, and tape cartridge are the key components of a tape subsystem. Tapes provide the best storage packaging density in comparison to other types of persistent storage devices. Tapes do not 35 provide random access to storage. Data on tapes can be stored either in compressed or uncompressed format. Unlike disks, tape cartridges can be easily transported between sites. Most organizations typically migrate data from older tape cartridges to newer tape cartridges 40 every 5 years to prevent data loss due to material degradation. One can employ disk based caches in front of tape subsystems in order to allow for tapes to handle bursty traffic. Tapes that provide Write-Once, Read Many (WORM) characteristics are also available. 45 WORM tapes are useful in data compliance environments where regulations warrant guarantees that a piece of data has not been altered. DLT and LTO are currently the two dominant tape technologies in the market. Technology-wise both these standards have 50 minor differences. Finally, from a pure media cost standpoint, tapes are less expensive (cost per gigabyte) than disks and other forms of persistent media. - Disks/Storage Controllers/NAS Boxes: Disks are the 55 most widely used form of persistent storage media. Disks are typically accessed by enterprise level applications when they are packaged as part of the processing server box (direct attached storage model), or are part of a network attached storage box (NAS) 60 and accessed via NAS protocols or, are packaged as part of a storage controller box and accessed via storage area network protocols (SAN). The current trend is for protocol consolidation, where the same storage controller provides support for both SAN and 65 NAS protocols. Typically, the size of the storage controllers can vary from a few terabytes to hundreds of terabytes (refrigerator sized storage controllers). A storage controller typically consists of redundant processors, protocol processing network cards, and 70 RAID processing adapter cards. The disks are connected to each other via either arbitrated loop or switched networks. Storage controllers also contain multi-gigabyte volatile caches. Disks are also packaged as part of laptops. 75 There is a marked difference in the manufacturing process, and testing process between the enterprise class disks and commodity laptop class disks. Disks vary in their form factor, rotational speed, storage capacity, number of available ports, and the protocols 80 used to access them. Currently, serial SCSI, parallel SCSI, serial ATA and parallel ATA, Fiber Channel, and SSA are the different protocols in use for accessing disks. Lower RPM and disk idle mode are new disk spin-down modes that allow disks to 85 consume less power when they are not actively being used. - DVD/Juke Boxes: DVDs and CDs are optical storage media that provide random access and WORM 90 capabilities. Only recently, the multiple erase capacity of an individual CD or DVD was less than the capacity of a single disk drive or tape cartridge. DVDs can store more data than a CD, and a high definition DVD can store more data than a DVD. There are numerous 95 competing standards for CDs, DVDs and high definition DVDs, however, format agnostic DVD players and DVD writers are emerging. Usage of DVDs is more prominent in the consumer space rather than in the enterprise space. A juke box system allows 100 one to access a library of CDs or DVDs. DVDs have slower access speeds than most types of disks. - Flash/SSDs/Hybrid Disks: Flash is memory technology that has non-volatile characteristics. Flash memory has slower read times than DRAM. Moreover, 105 it has much slower write times than DRAM. One has to perform an erase operation before one can re-use a flash memory location. One can only perform a limited number of erase operations. Thus, the number of write operations determines the Flash memory life. 110 SLC and MLC are the two different NAND flash technologies. SLC can be erased a greater number of times, and it has faster access times than MLC based 32 flash. NAND flash has faster write and erase times than NOR flash. NOR flash has faster read times than 115 NAND flash. NAND flash is used to store large amounts of data, whereas NOR flash is used to store executable code. People are using MLC flash in cameras and digital gadgets, and are using SLC flash as part of solid state 120 disks (SSDs). SSDs provide block level access interface (SCSI), and they contain a controller that performs flash wear leveling and block allocation. Hybrid disks that contain a combination of disks and Flash are emerging. Hybrid disks provide a Flash 125 cache in front of the disk media. One typically can store meta-data or recently used data in the flash portion of hybrid disks to save on power consumption. That is, one does not have to spin up the disk. Flash storage provides much better random access speeds 130 than disk based storage. Key Applications Tapes are being used primarily for archival purposes 135 because they provide good sequential read/write times. Disks are the media of choice for most on-line applications. Optical media (CDs, DVDs) are popular in the consumer electronic space. Flash based SSDs are popular for those workloads that exhibit random IOs. 140 Disks are being used in laptops, desktops and storage servers (SANs, NAS, DAS). Tape based WORM media and content addressable based disk storage are providing WORM media capabilities in tape and disk technologies, and thus, these technologies can be used 145 to also store compliance/regulatory data. (Abridged) Recommended Reading 150 1. Anderson D., Dykes J., and Riedel E. More than an interface- SCSI versus ATA. In Proc. Second Annual Conf. on FAST. San Francisco, CA, 2003. 2. Toigo J. Holy Grail of Network Storage 155 Management. Prentice Hall, Englewood Cliffs, NJ, 2003. 3. Voruganti K., Menon J., and Gopisetty S. Land below a DBMS. SIGMOD Rec, 33(l):64-70, 2004. Answer the following questions: Mark the following statements as true or false: 1) Name some of the persistent storage devices 1) Portable and inexpensive to purchase, tapes are often used for backing up or archiving data. mentioned in the text. 2) What are the advantages and disadvantages of tapes? 3) Where are WORM tapes useful? 4) Name the two dominant tape technologies that 2) Flash is memory technology that has volatile characteristics. 3) WORM disks can be written only once. are currently on the market. 5) What is the most widely used form of 4) NAND has significantly lower storage capacity than NOR. use for accessing disks? 7) What spin-down modes are used to save on persistent storage medium? 6) What are the protocols which are currently in 5) MLC is used in solid state drives (SSDs) and SLC is used in consumer appliances like cameras, media players, cell phones, etc. power consumption? 8) What are the characteristics of some of the 6) An SSD (solid-state drive or solid-state disk) is a storage device that stores persistent data on solid-state flash memory. optical storage media? 9) What is flash technology? 10) Where is flash memory used? 7) In hybrid disks the flash memory handles the data most frequently written to or retrieved from storage. Match the following terms with their definitions: 1) A library 2) A disk cache 3) Fibre channel 4) A storage area network (SAN) 5) DLT (digital linear tape) 6) A jukebox a) A mechanism for improving the time it takes to read from or write to a hard disk b) A unit in which optical disk drives are mounted c) A high-speed special-purpose network (or sub-network) that interconnects different kinds of data storage devices d) A collection of physical storage media such as tapes or disks and a way to access them e) A form of magnetic tape and drive system used for computer data storage and archiving f) A type of high speed interconnection 34 Vocabulary access time ['aek.ses] ^') [taim] M') - přístupová doba bursty traffic [baisti] [traefik] =J - shlukový provoz, shluky dat cache [kaef] ^■■) - skrýš, úkryt competing [kam'piitin] - konkurenční data compliance ['dei.ta]© [-ta] 4) 4') [kam'plai.anfs] ^ i - ochrana dat před přepsáním wear leveling [wear] © [wer] 4') ['lev.alin] - strategie zápisu a mazání (vyrovnání opotřebení) focus on st ['fau.kas] © ['fou-] H 5 - zaměření format agnostic ['fDi.maet]© ffDir-] 4->) [aeg'nos.tik] © [-'nai.stik] 4')- přehrávající jakýkoli formát gadget ['gaed3.1t] 43)- přístroj, zařízení goal [gaul] © [goul] 4') - cíl high definition [hai] 4) [.def.ťnif.an] 4') - vysoké rozlišení idle ['ai.dl] 4>) - nečinný life [laif] 4-') - životnost marked [ma:kt] © [ mairkt] 4') 41) - výrazný non-volatile [nonVol.a.tail] © [non' vai.la.t al] 4') - stálý, stabilní numerous ['njui.ma.ras] © [ 'nu:-] =J - početný, četný persistent [pa'sis.tsnt] © [paH 4-í - stálý, trvalý read time [rnd] =J [taim] 4í> - přístupová doba pro čtení solid state disk(SSD) ['sol.id] © ['sai.lid] =J [steit] 4>) [disk] 4'J - pevný disk na bázi paměťových čipů standpoint ['staend.pDint] 4)- stanovisko, hledisko to access st ['aek.ses] 4á - dostat se k, mít přístup k to allocate ['ael.a.keit] 4') - přidělit to alter ['Dl.tar] © [ 'ail.t a*] 4>) - změnit to compete [kam'piit] sj - soutěžit to determine [dťt3:.min] ® [ -tr:-] =í - určovat, udávat to emerge [i'm3:d3] © [ -'m?:d3] 4í> - objevit se to handle ['haen.dl] 4) - zvládnout, vypořádat se to package ['paek.id3] 4>>- zabalit to spin down [spin] ftj [daun] ^ ■) - zpomalit otáčení to spin up [spin] ^') [Ap] 4í> - zrychlit otáčení to vary [Vea.ri] © [ Ver.i] 4:') - lišit se, různit se to warrant [Wr.3nt] © ['wDír-] 4') - zaručit trade-off ['treid.of] © [-a:f] =í - kompromis via [vaia] © [Vii.a] 4:') 4') - přes, prostřednictvím wise (suffix) [waiz] =J - pokud jde o, hovoříme-li o write time [rait] 4') [taim] 4í> - přístupová doba zápisu 35 Storage Protection KENICHIWADA Hitachi Limited, Tokyo, Japan 5 Definition Storage protection is a kind of data protection for data stored in a storage system. The stored data can be lost 10 or becomes inaccessible due to, mainly, a failure in storage component hardware (such as a hard disk drive or controller), a disastrous event, an operator's mistake, or intentional alteration or erasure of the data. Storage protection provides the underlying foundation 15 for high availability and disaster recovery. Historical Background In 1956, IBM shipped the first commercial storage that 20 had a hard disk drive. To protect data from bit errors on disk platters, the hard disk drive commonly uses cyclic redundancy check (CRC) and an error-correcting code (ECC). CRC and ECC cannot protect data from a whole disk 25 failure in which an entire disk becomes inaccessible (for example, because of a disk head crash). The IBM 3990, which was shipped in the 1980s, had the replication functionality in which two identical copies of data were maintained on separate media. This 30 approach protected data from this kind of failure. Replication functionality can be implemented in many other layers of the computer system. Most DBMS support database replication. Some file systems and Logical Volume Managers have file or volume 35 replication functionality. Further, many storage systems and storage virtualization appliances support volume replication functionality. RAID (Redundant Array of Inexpensive Disks) is another technology for protecting data from whole disk 40 failure. D. Patterson et al. published a paper "A Case for Redundant Arrays of Inexpensive Disks (RAID)" in June 1988 at the SIGMOD conference [6]. This paper introduced a five level data protection scheme. The term RAID was adopted from this paper, but 45 currently RAID is an acronym for Redundant Arrays of Independent Disks. It is noted that the patent covering RAID level 5 technology was issued in 1978 [5]. RAID level 1 is a kind of replication. RAID level 2 to 5 can reduce the capacity required to protect data 50 against disk drive failure than replication, but it is limited to protect disk drive failure. Replication, on the other hand, can be used to protect databases, file systems and logical volume. Further replication can be used for disaster recovery, if data are replicated 55 remotely. Foundations Hard disk drives commonly use Reed-Solomon code 60 [7] to correct bit errors. Data in hard disk drives is usually stored in fixed length blocks. Controllers in hard disk drives calculate ECC for each block and record it associated with the original data. When data are read the controller checks data integrity using ECC. 65 CRC can be used with ECC for detecting bit errors and/or reducing the possibility of correction error. Most DBMS support database replication with master/slave relation between the original and the replica. The master process updates and transfers it to 70 the slave. This type of replication can provide high availability to the client of the DBMS in case of storage system failures as well as server failures. Another type of database replication is multi-master, which is mostly used to provide high performance 75 parallel processing. Both types can be either synchronous or asynchronous replication. In synchronous replication, updates made in original are guaranteed in the replica, note there may be some delay in asynchronous replication. 80 Volume replication by storage system is also widely accepted as data protection. There are synchronous and asynchronous replications, the same as database replication. Asynchronous volume replication is often used for long distance remote replication. It may 85 prevent performance degradation caused by replication delay, but could cause some data loss in case of recovery. Synchronous replication, on the other hand, may provide no data loss recovery, but may cause performance degradation due to replication delay. 90 Volume replication is also used within a local data center for online backup. Backup servers use replica volume for backup when original volume is online. To support this, a storage system can pause update delegation from original to replica volume. 95 RAID (Redundant Array of Independent Disks) is a set of disks from one or more commonly accessible disk subsystems, combined with a body of control software, in which part of the physical storage capacity is used to store redundant information about user data stored on 100 the remainder of the storage capacity. The term RAID refers to a group of storage schemes that divide and replicate data among multiple disks, to enhance the availability of data at desired cost and performance levels. A number of standard schemes have evolved 105 which are referred to as levels. Originally, five RAID levels were introduced [6], but many more variations have evolved. Currently, there are several sublevels as well as many non-standard levels. There are trade-offs among RAID levels in terms of performance, cost and 110 reliability. 36 Key Applications 115 Storage protection is essential to achieve business continuity and legal compliance with adequate performance, cost, and reliability. (Abridged) 120 Recommended Reading 1. ANSI. NFPA1600 Standard on Disaster/Emergency Management and Business Continuity Programs. 125 2. BSI. BS25999; Business Continuity Management. 3. Houghton A. Error Coding for Engineers. Kluwer Academic Publications, Hingham, MA, 2001. 4. Keeton K., Santos C, Beyer D., Chase J., and Wilkes J. Designing for disasters. In Proc. 3rd 130 USENIX Conf. on File and Storage Technologies, 2004. 5. Ouchi N.K. (IBM Corporation). System for recovering data stored in failed memory unit. US Patent 4,092,732, 1978. 135 6. Patterson D., Gibson G, and Katz R. A case for redundant arrays of inexpensive disks (RAID). In 1988. 7. Sweeney P. Error Control Coding From Theory to Practice. Wiley, New York, 2002. 140 8. http://www.sec.gov/ Answer the following questions: 1) How can data loss occur? 2) What are CRC and ECC? 3) What does RAID stand for? 4) How many RAID levels were originally introduced? 5) What is Reed-Solomon code used for? 6) Besides disaster recovery, what is storage protection good for? 7) What is the basic difference between RAID 1 and the other RAID levels? Match the following terms with their definitions: 1) synchronous replication 2) asynchronous replication 3) RAID level 4) data integrity check a) in this approach, changes in the original can take some time to reflect in the replica b) a way to determine whether data is corrupted c) in this approach, changes in the original are immediately reflected in the replica d) a specific strategy of distributing data over multiple disks Mark the following statements as true or false: 1) When data is read, the controller checks data integrity using CRC. 2) Replication can also be used to protect databases and file systems. 3) Remote replication often employs the asynchronous volume replication approach. Vocabulary appliance [a'plai.anf s] =J - přístroj array [a'rei] 4) - pole, sada compliance [kam'plai.anf s] sj - shoda, dodržení degradation [,deg.ra'dei.J"an] sj - pokles, zhoršení disastrous [di'za:.stras] © [-'zaes.tras] - katastrofální, neblahý disk platter [disk] 4> ['plaet.ar] © ['plaet M 4>) - disková plotna erasure [i'rei.39r] © [-3^] M - smazání failure [fei.ljar] © [Ij*] 4* - selhání inaccessible [.in.ak'ses.i.bl] =J - nepřístupný, nedostupný integrity [in'teg.ra.ti] © [-tj] 4% - celistvost, neporušenost intentional [in'ten.J"an.al] sj - záměrný loss [Ids] © [la:s] 4$ - ztráta recovery [n'k/w^r.i] © [-&H =J - obnova redundancy [ri'dAn.danf .si] =J - nadbytečnost replica [Yep.li.ka] 4% - kopie, duplikát scheme [ski:m] =J - schéma, soustava to evolve [i'vdIv] © [-Va:lv/] 4$ - vyvinout se volume [vDl.juim] © [Va:l-] =í - objem; svazek Phrases due to [dju:] © [du:] 4 í - kvůli; způsobený (čím) 39 Video YINGLI IBM T.J. Watson Research Center, Hawthorne, 5 NY, USA Definition Video, which means ' T see'' in Latin, is an electronic 10 representation of a sequence of images or frames, put together to simulate motion and interactivity. From the producer's perspective, a video delivers information created from the recording of real events to be processed simultaneously by a viewer's eyes and ears. 15 For most of time, a video also contains other forms of media such as audio. Video is also referred to as a storage format for moving pictures as compared to image, audio, graphics and animation. 20 Historical Background Video technology was first developed for television systems, but it has been further developed in many 25 formats to allow for consumer video recordings. Generally speaking, there are two main types of video: analog video and digital video. Analog videos are usually recorded as PAL (Phase Alternating Line) or NTSC (National Television System Committee) 30 electric signals following the VHS (Video Home System) standard and stored in magnetic tapes. Digital videos, on the contrary, are usually captured by digital cameras and stored in digital video formats such as DVD (Digital Versatile Disc), QuickTime and MPEG- 35 4 (Moving Picture Experts Group). Launched in September 1976, VHS became a standard format for consumer recording and viewing by the 1990s. Since then, it has dominated both home and commercial video markets. In March 1997, the DVD 40 format was introduced to American consumers, which gradually pulled consumers away from VHS in the following years due to its much better quality. In June 2003, the DVD's market share exceeded that of the VHS for the first time. Since then, it has been steadily 45 expanding its consumer market, and by July 2006, most major film studios have stopped releasing new movie titles in VHS format, opting for DVD-only releases. Now, VHS is gradually disappearing from both rental and retail stores, and DVD has dominated 50 the whole commercial market. Nevertheless, VHS is still popular for home recording of television programs, due to the large installed base and the lower cost of VHS recorders and tape. For the last few decades, as video technology quickly 55 advances and the cost of storage devices rapidly decreases, digital videos have become widely available in diverse application areas such as medicine, remote sensing, entertainment, education and online information services. This has thus led to very active 60 research in various video-related areas. Foundations The last three decades have witnessed a significant 65 amount of research efforts on various aspects of video technologies. Roughly speaking, they fall into the following three general categories: video representation, video content analysis, and video application. Specifically, video representation deals 70 with the way a video is represented, in another word, the file format. Video content analysis, on the other hand, aims to automatically structure and ultimately understand the video by analyzing its underlying content. Due to the difficult nature of this problem, 75 such process usually involves the analysis of multiple media modalities including visual, audio and text information. Finally, video application applies what it has learned from the analysis engine, and facilitates various types of content access including video 80 browsing, summarization and retrieval. A brief discussion on each of these three research domains is given below. Video Representation 85 A video sequence with accompanying sound track can occupy a vast amount of storage space when represented in digital format. As estimated in [6], a 1-min video clip could possibly occupy up to 448 MB. 90 Consequently, compression has been playing an important role in modern schemes for video representation. A wide variety of methods have been proposed to compress the video stream. Nevertheless, almost all of 95 them build their approaches upon the fact that video data contains both spatial and temporal redundancy. Specifically, to reduce the spatial redundancy, an intra-frame compression is applied which registers differences between parts of a single frame. Such a 100 task is more closely related to image compression. Likewise, to reduce the temporal redundancy, an inter-frame compression is exploited which registers differences between neighboring frames. This involves discrete Cosine transform (DCT), motion 105 compensation and other techniques. Some popular video compression mechanisms include H.261, H.263, H.264, MPEG-l,MPEG-2, MPEG-4 and MJPEG (Motion-Joint Photographic Experts Group). Specifically, H.261 is a 1990 ITU-T 110 (Telecommunication Standardization Sector of International Telecommunication Union) video coding standard originally designed for transmission over 40 ISDN lines. Later on, H.263 and H.264, which provide more capabilities and mainly target at video- 115 conferencing applications, were standardized in 1995 and 2003, respectively. In 1998, the Moving Picture Experts Group (MPEG) was formed to establish an international standard for the coded representation of moving pictures and associated audio on digital storage 120 media. Currently, there have been three established MPEG standards from this effort: MPEG-1, MPEG-2, and MPEG-4. Each of them targets at different commercial applications. For instance, MPEG-1 is usually used as the Video CD (VCD) format, MPEG-2 125 for High Definition Television (HDTV), and MPEG-4 for streaming video applications. Finally, to facilitate mobile appliances such as digital cameras, MJPEG was developed in the 1990s which uses intra-frame coding technology that is very similar to those used in MPEG- 130 1 or MPEG-2. However, it does not use inter-frame prediction, which on the one hand, results in a loss of compression capability, yet on the other hand, it makes the degree of compression capability independent of the amount of motion in the scene. Moreover, it also 135 eases video editing as simple editing can now be performed at any frame. Video Content Analysis 140 Video is a type of rich media as it often consists of other media types such as audio and text. Consequently, research on video content analysis can be grouped into three classes: visual content analysis, audio content analysis, and audiovisual content 145 analysis. A general goal of video content analysis is to extract the underlying video structure so as to facilitate convenient and nonlinear content access. Yet a more aggressive goal is to automatically understand video semantics so as to support applications such as video 150 summarization and retrieval that require an in-depth understanding of the video content. Video Application 155 Besides the large amount of research efforts on video content analysis, there are also many attentions on studying various video applications. After all, making the bulky and unstructured video content convenient and efficient to access, present, share, search and 160 deliver is the ultimate goal of the entire research community in this area. (Abridged) 165 Recommended Reading 1. Cheung S. and Zakhor A. Efficient video similarity measurement with video signature. IEEE Trans. Circ. Syst. Video Tech., 13(l):59-74, 2003. 170 2. Li Y. and Dorai C. SVM-based audio classification for instructional video analysis. In Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 2004. 3. Li Y. and Kuo C.-C. Video Content Analysis Using 175 Multimodal Information: for Movie Content Extraction, Indexing and Representation. Kluwer, MA, USA, 2003. 4. Li Y., Narayanan S., and Kuo C.-C. Content-based movie analysis and indexing based on audiovisual 180 cues. IEEE Trans. Circ. Syst. Video Tech., 14(8): 1073-1085, 2004. 5. Mahmood T.S. and Srinivasan S. Detecting topical events in digital video. In Proc. 8th ACM Int. Conf. on Multimedia, 2000, pp. 85-94. 185 6. Mitchell J., Pennebaker W., Fogg C, and LeGall D. MPEG Video Compression Standard. Chapman & Hall, New York, NY, USA, 1992. 7. MPEG Requirements Group, MPEG-7 Applications Document v.8, ISO/MPEG N2860, MPEG Vancouver 190 Meeting, July 1999. 8. MPEG Requirements Group, MPEG-7 Context, Objectives and Technical Roadmap, ISO/MPEG N2861, MPEG Vancouver Meeting, July 1999. 9. MPEG Requirements Group, MPEG-7 195 Requirements Document V.15, ISO/MPEG N4317, MPEG Sydney Meeting, July 2001. 10. Nock H., Adams W., Iyengar G., Lin C, Naphade M., Neti C, Tseng B., and Smith J. User-trainable video annotation using multimodal cues. In Proc. 26th 200 Annu. Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2003, pp. 403-404. 11. Oh J. and Hua K. Efficient and cost-effective techniques for browsing and indexing large video 205 databases. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2000, pp. 415^-26. 12. Pfeiffer S., Lienhart R., Fischer S., and Effelsberg W. Abstracting digital movies automatically. J. Vis. Comm. Image Represent., 7(4):345-353, 1996. 210 13. Yan R., Hauptmann A., and Jin R. Negative paeudo-relevance feedback in content-based video retrieval. In Proc. 11th ACM Int. Conf. on Multimedia, 2003, pp. 343-346. 14. Yeung M., Yeo B., and Liu B. Extracting story 215 units from long programs for video browsing and navigation. In 1996, pp. 296-305. 15. Zhang T. and Kuo C.-C. Audio content analysis for on-line audiovisual data segmentation. IEEE Trans. Speech Audio Process., 9(4):441^157, 2001. 220 16. Zheng W., Li J., Si Z., Lin F., and Zhang B. and Using highlevel semantic features in video retrieval. In Image and Video Retrieval. Springer, Berlin Heidelberg, New York, 2006, pp. 370-379. 41 Answer the following questions: 1) What is video ? 2) What is the difference between analog and digital video? 3) What makes VHS still popular for home recording? 4) What is meant by remote sensing? 5) What do the terms video representation, video content analysis, and video application refer to? 6) What do most compression formats build on? 7) What is MPEG-1? 8) What does the term "rich media" refer to? Match the following terms and their definitions: 1) video representation 2) video content analysis 3) intra-frame compression 4) inter-frame compression 5) likewise a) a way of reducing spatial redundancy b) deals with the file format c) a way to reduce temporal redundancy d) involves structuring the video e) means the same as "similarly" Mark the following statements as true or false: 1) Video was first developed for home use. 2) H.261 is a video coding standard originally designed for transmission over ISDN lines. 3) MPEG-4 is used as a high definition television standard. 4) MJPEG has been designed for use in mobile appliances. 5) MJPEG has nothing in common with MPEG-1 and MPEG-2 formats 6) A general goal of video content analysis is to facilitate convenient and linear content access. Vocabulary analysis [a'nael.a.sis] 4 i - analýza, pi. analyses appliance [a'plai.anfs] =J - zařízení audio ['di.di.au] © ['ai.di.ou] =J - audio, zvuk audiovisual [,D:.di.au'vi3.u.al] © [,a:.di.ou-] - audiovizuální capability [.kei.pa'bil.i.ti] © [-a.t i] 4) - schopnost compression [kam'prej>n] 4% - komprese consumer [kan'sjui.ma1'] © [-'sui.ma*] 4 5 - spotřebitel discrete cosine transform [di'skri it] ftj ['kau.sain] © ['kou-] 4 [traens'fDim] © [-'fDirm] 4') - diskrétní kosinová transformace diverse [dai'v3is] © [ di'vns] 4) - rozdílný domain [dai/'mein] © [dou-] sj -doména due to st [dju:] © [ du:] 4 - kvůli něčemu format ['fDi.maet] © f'fDir-] 4-b - formát frame [freim] 4$>- rám, rámec linear ['lin.i.ar] © [-&] 4 - lineární market share ['ma:.kit] © ['mair-] 4') [/ear] © [Jer] 4 - podíl na trhu modality - modalita motion ['mau./3n] © ['mou-] - pohyb nevertheless [,nev.a.Sales] ©[-a*-] &J - nicméně non-linear [non'lin.i.a] - nelineární present ['prezent] - současný (compare the pronunciation with that of verb to present [pri'zent] ^:J)) research [ri's3:tf] © [ 'ri:.s3:tj] 4') 4) - výzkum retrieval [n'triival] =J -vyhledávání, vyzvedávání sequence ['sii.kwanřs] =í - sekvence spatial and temporal redundancy [spei./9l] 4 ) ['tem.p3r.3l] ®[-pa*.al] 4 [ri'dAn.d3nt .si] 4') - prostorová a časová nadbytečnost (redundance) steady ['sted.i] - stálý thus [5as] 4') - tak, a tak to aim to do st [eim] 4% - snažit se něco dělat, být zaměřen na dělání něčeho to browse st [brauz] 4$ - listovat, procházet něčím to capture st ['kaep.tJV] ® [-tTa*] =J - zachytit něco to deal with st [dial] 4 - zabývat se něčím to develop [di'vel.ap] 4$ - vyvinout, vyvíjet to disappear [,dis.a'piar] © [-'pír] 4$ - zmizet, mizet to dominate ['dom.i.neit] © ['dai.ma-] 4 ■) - dominovat to ease st [i:z] 4 i - udělat něco jednodušší, zjednodušit to exceed st [iksiid] 4^ - přesahovat, překročit něco to extract st [ik'straekt] . - extrahovat, vytáhnout něco to facilitate st [fa'sil.i.teit] 4i> - zjednodušit něco, umožnit něco to fall into st [fDil] © [ fail] 4$ - spadat do něčeho to involve st [in'volv] © [-Va:lv] 4 - zahrnovat něco to launch st [binřj] © [ lainřj] 4 - vypustit něco, vydat něco to occupy ['ok.ju.pai] © ['ai.kju-] 4$ - zabírat, okupovat to opt for st [opt] © [a:pt] 4 'j - rozhodnout se pro něco, zvolit něco to present st [pri'zent] &J - prezentovat něco to refer to st [ri'fa:] 4$ - odkazovat na něco, označovat něco to release st [ri'liis] - vypustit, vydat něco to simulate ['sim.ju.leit] © [-t id] 4$ - simulovat, napodobovat to witness st ['wit.nas] 4 - být něčemu svědkem ultimate ['Al.ti.mat] © [-ta-] =S - konečný, nejzazší underlying [.An.da'lai.in] © [-da*-] 4 5 - základní 43 Phrases consequently ['kDnf.si.kwant.li] © ['kainř-] -následně generally speaking - obecně řečeno independent of st [,in.di'pen.dsnt] ^■) - nezávislý na něčem likewise [laik.waiz] =J - podobně on the contrary ['kon.tra.ri] © ['kain.tre-] ^:)) - naopak on the one hand on the other hand ... - na jedné straně na druhé straně ... to play an important role in st - hrát důležitou úlohu v něčem 44