VB036 - English 2 Textbook
_
Authors: Martin Dvořák, Kateřina Repová, Ivana Túlajová
All the texts in this textbook have been taken from Encyclopedia of Database Systems (Ozsu, M. Tamer; Liu, Ling (Eds.) 2009. Approx. 4100 p. 60 illus. ISBN: 978-0-387-49616-0) and abridged.
Tento projekt je spDlufinjn:nván Evropským sniiálním fondem a stát nim rozpočtem České republiky,
•
EVROPSKÁ UNIE
IT*1*
M NI£"EnSTVO S-ÍOLSTVÍ.
M -ADEÍE A TĚLOVÝCHOV       P" I
INVESTICE  DO ROZVOJE VZDĚLÁVÁNÍ
Phonetics
Vowels
Long Vowels il sheep Short Vowels
I e
ship head
ai farm
ae hat
9 above
UI
coo
U foot
3«,     mother (US)
31 horse
D
sock (UK) worm (US)
31
bird
cup
Consonants
Voiced
b book Z zoo j yes
Voiceless
P
s
pen say
d
3 w
t S
day
vision
we
town she
9 m
k
give
jump
moon
cat
cheese
V very I look n name
f fish
ô r
n
e
the run sing
think
Diphthongs
ei day
OU    nose (US)
ai eye
13     ear (UK)
31 boy
ea    hair (UK)
au mouth U3    pure (UK)
3U    nose (UK)
Other Symbols
h     [|aend] hand
\      ['bA\.&] butter (US)
D ['kwaes.B] croissant (UK) u     [,in.flu'en.za] influenza
['haepj] happy ['lit.!] little
al, 3m, 3n can be pronounced either: dl or!, etc.: ['lei.bH] = ['lei.bal] = ['lei.b!]
r linking r is pronounced only before a vowel in British English: [foir] four: [foiraep.lz] four apples
' main stress [,ek.spek'tei.J~3n] expectation , secondary stress [^|'tel] retell
Syllable division ['SIS.tBITl] System Cambridge Dictionary Online [Internet]. c2010 [cited 2010 Feb 24]. Phonetics.
Available from: <http://dictionary.cambridge.org/help/phonetics.asp>.
Annotation-based Image Retrieval
XIN-JING WANG, LEI ZHANG Microsoft Research Asia, Beijing, China
5
Definition
Given (i) a textual query and (ii) a set of images and their annotations (phrases or keywords) annotation-10 based image retrieval systems retrieve images according to the matching score of the query and the corresponding annotations. There are three levels of queries according to Eakins [7]:
15 • Level 1: Retrieval by primitive features such as color, texture, shape or the spatial location of image elements, typically querying by an example, i.e., "find pictures like this."
• Level 2: Retrieval by derived features, with 20 some   degree   of   logical   inference. For
example, "find a picture of a flower."
• Level 3: Retrieval by abstract attributes, involving a significant amount of high-level reasoning about the purpose of the objects or
25 scenes depicted. This includes retrieval of
named events, of pictures with emotional or religious significance, etc., e.g., "find pictures of a joyful crowd."
30 Together, levels 2 and 3 are referred to as semantic image retrieval, which can also be regarded as annotation-based image retrieval.
Historical Background
35
There are two frameworks of image retrieval [6]: annotation-based (or more popularly, text-based) and content-based. The annotation-based approach can be tracked back to the 1970s. In such systems, the images
40 are manually annotated by text descriptors, which are used by a database management system (DBMS) to perform image retrieval. There are two disadvantages with this approach. The first is that a considerable level of human labor is required for manual annotation. The
45 second is that because of the subjectivity of human perception, the manually labeled annotations may not converge. To overcome the aforementioned disadvantages, content-based image retrieval (CBIR) was introduced in the early 1980s. In CBIR, images are
50 indexed by their visual content, such as color, texture, shapes. In the past decade, several commercial products and experimental prototype systems were developed, such as QBIC, Photobook, Virage, VisualSEEK, Netra, SIMPLIcity.
55 However, the discrepancy between the limited descriptive power of low-level image features and the richness of user semantics, which is referred to as the "semantic gap" bounds the performance of CBIR. On the other hand, due to the explosive growth of visual
60 data (both online and offline) and the phenomenal success in Web search, there has been increasing expectation for image search technologies. For these reasons, the main challenge of image retrieval is understanding media by bridging the semantic gap
65 between the bit stream and the visual content interpretation by humans [3]. Hence, the focus is on automatic image annotation techniques.
Foundations
70
The state-of-the-art image auto-annotation techniques include four main categories [3,6]: (i) using machine learning tools to map low-level features to concepts, (ii) exploring the relations between image content and 75 the textual terms in the associated metadata, (iii) generating semantic template (ST) to support high-level image retrieval, (iv) making use of both the visual content of images and the textual information obtained from the Web to learn the annotations.
80
Machine Learning Approaches
A typical approach is using Support Vector Machine (SVM) as a discriminative classifier over image low-85 level features. Though straightforward, it has been shown effective in detecting a number of visual concepts.
Recently there has been a surge of interest in leveraging and handling relational data, e.g., images 90 and their surrounding texts. Blei et al. [1] extend the Latent Dirichlet Allocation (LDA) model to the mix of words and images and proposed a Correlation LDA model.
95  Relation Exploring Approaches
Another notable direction for annotating image visual content is exploring the relations among image content and the textual terms in the associated metadata. Such metadata are abundant, but are often incomplete and noisy. By exploring the co-occurrence relations among the images and the words, the initial labels may be filtered and propagated from initial labeled images to additional relevant ones in the same collection [3]. Jeon et al. [5] proposed a cross-media relevance model to learn the joint probabilistic distributions of the words and the visual tokens in each image, which are then used to estimate the likelihood of detecting a specific semantic concept in a new image.
100
105
110
3
Semantic Template Approaches
Though it is not yet widely used in the techniques mentioned above, Semantic Template (ST) is a 115 promising approach in annotation-based image retrieval (a map between high-level concept and low-level visual features).
Chang and Chen [2] show a typical example of ST, in which a visual template is a set of icons or example
120 scenes/objects denoting a personalized view of concepts such as meetings, sunset, etc. The generation of a ST is based on user definition. For a concept, the objects, their spatial and temporal constraints, and the weights of each feature of each object are specified.
125 This initial query scenario is provided to the system, and then through the interaction with users, the system finally converges to a small set of exemplar queries that "best" match (maximize the recall) the concept in the user's mind.
130
Large-Scale Web Data Supported Approaches
Good scalability to a large set of concepts is required in ensuring the practicability of image annotation. On
135 the other hand, images from the Web repositories, e.g., Web search engines or photo sharing sites, come with free but less reliable labels. In [9], a novel search-based annotation framework was proposed to explore such Web-based resources. Fundamentally, it is to
140   automatically expand the text labels of an image of interest, using its initial keyword and image content. The process of [9] is shown in Fig. 1. It contains three stages: the text-based search stage, the content-based search stage, and the annotation learning stage, which
145 are differentiated using different colors (black, brown, blue) and labels (A., B., C.).When a user submits a query image as well as a query keyword, the system first uses the keyword to search a large-scale Web image database (2.4 million images crawled from
150 several Web photo forums), in which images are associated with meaningful but noisy descriptions, as tagged by "A." in Fig. 1. The intention of this step is to select a semantically relevant image subset from the original pool.
155 Visual feature-based search is then applied to further filter the subset and save only those visually similar images (the path labeled by "B." in Fig. 1). By these means, a group of image search results which are both semantically and visually similar to the query image
160 are obtained. Finally, based on the search results, the system collects their associated textual descriptions and applies the Search Result Clustering (SRC) algorithm to group the images into clusters.
165 (Abridged)
Figure 1:
(Lake)
170
Recommended Reading
1. Blei D. and Jordan M.I.Modeling Annotated Data. In Proc. 26th Annual Int. ACM SIGIR Conf. on
175 Research and Development in Information Retrieval, 2003, pp. 127-134.
2. Chang S.-F, Chen W., and Sundaram H. Semantic Visual Templates: Linking Visual Features to Semantics. In Proc. Int. Conf. on Image Processing,
180  Vol. 3. 1998, pp. 531-534.
3. Chang S.-F., Ma W.-Y., and Smeulders A. Recent Advances and Challenges of Semantic Image/Video Search. In Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, 2007, pp. 1205-1208.
185 4. Eakins J. and Graham M. Content-based image retrieval, Technical Report, University of Northumbria at Newcastle, 1999. 5. Jeon J., Lavrenko V., and Manmatha R. Automatic Image Annotation and Retrieval Using Cross-Media Relevance Models, In
190 Proc. 26th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2003, pp. 119-126.
6. Liu Y., Zhang D., Lu G., and MaW.-Y. A survey of content-based    image   retrieval    with high-level
195   semantics. Pattern Recognition., 40(1):262 -282, 2007.
7. Long F., Zhang H.J., and Feng D.D. Fundamentals of content-based image retrieval. In Multimedia Information Retrieval and Management, D. Feng (eds.). Springer, 2003.
200 8. Rui Y., Huang T.S., and Chang S.-F. Image retrieval: current techniques, promising directions, and open issues, J. Visual Commun. Image Represent. 10(4):39-62, 1999.
9. Wang X.-J., Zhang L., Jing F., and Ma W.-Y. 205   AnnoSearch: Image Auto-Annotation by Search, Proc.
IEEE Int. Conf. on Computer Vision and Pattern Recognition, 2006, pp. 1483-1490.
10. Zhuang Y., Liu X., and Pan Y. Apply Semantic Template to Support Content-based Image Retrieval.
4
210   In Proc.  SPIE, Storage and Retrieval for Media Databases, vol. 3972, December 1999, pp. 442^149.
5
Answer the following questions:
1) Describe retrieval by primitive features.
2) What is meant by abstract attributes in the context of retrieving images'?
3) What is meant by semantic image retrieval? Why is it called semantic?
4) What is annotation and what are its disadvantages?
5) What are the machine learning tools good for in image retrieval?
6) Describe the term of relation exploring approach.
7) What is the generation of semantic template based on?
Match the following terms and their definitions:
1) semantic gap
2) SVM
3) metadata
4) CBIR
a) a machine learning approach
b) data about data
c) the discrepancy between the limited descriptive potential of low-level image features and the richness of user's description
d) getting back stored data objects byinspecting their visual characteristic, suchas color, texture, shapes.
Mark the following statements as true or false:
1) Retrieval by derived features can be based on color of the picture.
2) The content-based approach uses descriptors to retrieve images.
3) SVM works on low-level features.
4) Images from the Web repositories come with highly reliable labels.
5) Search Result Clustering refers to grouping information about images.
6
Vocabulary
abundant [a'bAn.cbnt] H"- hojný, překypující
query ['kwia.ri] © ['kwir.i] - dotaz
annotation [.aen.ai/'tei.J^n] © [-a-] =J - anotace
classifier ['klae.si.faia] ^')- klasifikátor
co-occurrence [kaua'ka.rans]
© [koua'ka.rans] - společný výskyt
descriptor [dis'kripta]     - deskriptor, popisek
discriminative [dis'kriminativ]
© [dis'krimanativ] ^'j - rozlišující, schopný
rozlišovat
exemplar [ig'zem.pla:1'] © [-plair]     - typický
příklad, vzor, model
explosive [ik'splau.siv] ©[-'splou-]
- explozivní, výbušný
hence [henřs]     - tudíž (formální)
incomplete [,in.kam'pli:t]     - nekompletní
likelihood ['lai.kli.hud] ^ - pravděpodobnost
machine learning [ma'/iin] ['l3:.nin]
© [1?:-] - strojové učení
metadata ['metadeita]     - metadata, data
popisující jiná data
noisy f'riDi.zi] H') - obsahující šum, hlučný phenomenal [fa'nom.i.n3!] © [-'na:.ma-]
- úžasný, výjimečný
pool [pu:l] ^')- úložiště, zásoba, bazén relational [ri'leifanal] ^')- relační, vztahový retrieval [n'triival] ^')- získávání, vyhledávání scenario [si'na:.ri.au] © [ sa'ner.i.ou] A') - scénář semantic [si'maen.tik] © [-tik-] 4 - sémantický, významový
subset ['sAb.set] =J - podmnožina
token ['tau.kan] © ['tou-] A') - znamení, znak,
symbol
to annotate st ['aen.aa.teit] © [-a-] &J - anotovat
něco, vybavit anotací
to bridge st [brid3] =J - překlenout něco
to leverage st ['Ii:.v3r.id3] ©[ 'Iev.av.1d3]
- využívat k užitku
to overcome st (overcome, overcame, overcome) [,au.va'l<Am] © [,ou.va*-] 4)- překonat něco to propagate ['prop.a.geit] © ['pra:.pa-]
- rozšiřovat, množit
to retrieve [n'triiv] aj - získávat, vyhledávat visual [Vi3.u.al] M : - vizuální
7
Client-Server DBMS
M. TAMER OZSU
University of Waterloo, Waterloo, ON, Canada
5
Definition
Client-server DBMS (database management system) refers to an architectural paradigm that separates 10   database functionality between client machines and servers.
Historical Background
15 The original idea, which is to offload the database management functions to a special server, dates back to the early 1970s [1]. At the time, the computer on which the database system was run was called the database machine, database computer, or backend computer,
20   while the computer that ran the applications was called the host computer. More recent terms for these are the database server and application server, respectively. The client-server architecture, as it appears today, has become a popular architecture around the beginning of
25 the 1990s [2]. Prior to that, the distribution of database functionality assumed that there was no functional difference between the client machines and servers (i.e., an earlier form of today's peer-to-peer architecture). Client-server architectures are believed
30 to be easier to manage than peer-to-peer systems, which has increased their popularity.
Foundations
35 Client-server DBMS architecture involves a number of database client machines accessing one or more database server machines. The general idea is very simple and elegant: distinguish the functionality that needs to be provided and divide these functions into
40 two classes: server functions and client functions. This provides a two-level architecture that makes it easier to manage the complexity of modern DBMSs and the complexity of distribution.
In client-server DBMSs, the database management 45   functionality is shared between the clients and the server(s) (Fig. 1). The server is responsible for the bulk of the data management tasks as it handles the storage, query   optimization,   and   transaction management (locking and recovery). The client, in addition to the 50   application and the user interface, has a DBMS client module that is responsible for managing the data that are cached to the client, and (sometimes) managing the transaction locks that may have been cached as well. It is also possible to place consistency checking of user 55   queries at the client side, but this is not common since it requires the replication of the system catalog at the
60
65
70
75
80
client machines. The communication between the clients and the server(s) is at the level of SQL statements: the clients pass SQL queries to the server without trying to understand or optimize them; the server executes these queries and returns the result relation to the client. The communication between clients and servers is typically over a computer network.
	User	Application	
.E E	interface	program	
Dperat syste	Client DBMS		
	Communication software		
SQL queries w
Result relation
0 p Q	Communication software
	Semantic data controller
r a t	Query optimizer
	Transaction manager
i n	Recovery manager
g	Runtime support processor
	System
Database
85
Client-Server DBMS. Figure 1. Client-server reference
In the model discussed above, there is only one server which is accessed by multiple clients. This is referred to as multiple client-single server architecture [3]. There are a number of advantages of this model. As indicated above, they are simple; the simplicity is primarily due to the fact that data management responsibility is delegated to one server. Therefore, from a data management perspective, this architecture is similar to centralized databases although there are some (important) differences from centralized systems in the way transactions are executed and caches are managed. A second advantage is that they provide predictable performance. This is due to the movement of non-database functions to the clients, allowing the server to focus entirely on data management. This, however, is also the cause of the major disadvantage of client-server systems. Since the data management functionality is centralized at one server, the server becomes a bottleneck and these systems cannot scale very well.
The disadvantage of the simple client-server systems
8
are partially alleviated by a more sophisticated architecture where there are multiple servers in the system (the so-called multiple client-multiple server
90 approach). In this case, two alternative management strategies are possible: either each client manages its own connection to the appropriate server or each client knows of only its "home server", which then communicates with other servers as required. The
95 former approach simplifies server code, but loads the client machines with additional responsibilities, leading to what has been called "heavy client" systems. The latter approach, on the other hand, concentrates the data management functionality at the 100 servers. Thus, the transparency of data access is provided at the server interface, leading to "light clients."
The integration of workstations in a distributed environment enables an extension of the client-server
105 architecture and provides for a more efficient function distribution. Application programs run on workstations, called application servers, while database functions are handled by dedicated computers, called database servers. The clients run the user interface.
110 This leads to the present trend in three-tier distributed system architecture, where sites are organized as specialized servers rather than as general-purpose computers (Fig. 2).
Client		Client	
			
Network
120 application server to be dedicated to one or a few applications, while database servers operate in the multiple server fashion discussed above.
Key Applications
125
Many of the current database applications employ either a two-layer client-server architecture or the three-layer application-server approach.
130 (Abridged)
Recommended Reading
1. Canaday R.H., Harrisson R.D., Ivie E.L., Rydery 135   J.L., and Wehr L.A. A back-end computer for data base
management. Comm. ACM, 17(10):575-582, 1974.
2. Orfali R., Harkey D., and Edwards J. Essential Client/Server Survival Guide, Wiley, New York, 1994.
3. Ozsu M.T and Valduriez R Principles of Distributed 140   Database Systems, 2nd edn., Prentice-Hall, Englewood
Cliffs, NJ, 1999.
Application server
			
			Network
	Database server		
Client-Server DBMS. Figure 2. Database server approach.
115 The application server approach (indeed, an n-tier distributed approach) can be extended by the introduction of multiple database servers and multiple application servers, as can be done in client-server architectures. In this case, it is common for each
9
Answer the following questions:
1) What is a host computer?
2) What is the basic idea behind the client-server architecture?
3) Give examples of server functions.
4) Give examples of client functions.
5) What is SQL?
6) What is the major disadvantage of simple client-server systems?
7) What is a multiple client - multiple server approach?
8) What management strategies are possible with the approach in no. 7?
9) What is a three-tier distributed system architecture?
Match the following terms with their definitions:
1) query optimization
2) heavy client
3) result relation
4) transaction management
a) the return of query execution
b) a technique of handling concurrent access to data
c) a technique of finding the best execution plan to satisfy requests
d) a computer that manages its own connection to the appropriate server
Mark the following statements as true or false:
1) Peer-to-peer systems are easier to manage than client-server architectures.
2) Consistency checking of user queries is invariably placed at the server side.
3) Multiple client - single server architectures provide predictable performance.
4) Nowadays, there is a tendency to organize sites as specialized servers rather than general purpose computers.
10
Vocabulary
bulk [bAlk] ^■') - množství, objem cache [kaef] ^') - vyrovnávací paměť consistency [kan'sis.t3nf.si] =J - konzistence, neporušenost
entirely [in'taia.li] © [-'tair-j - úplně, zcela extension [ik'sten.f J"sn] =í - rozšíření, nástavba host [haust] © [houst] ^')- hostitel, hostitelský module ['modjuil] © ['ma:.d3u:l]     - modul, jednotka
optimization [op.ti.mai'zeij^n] - optimalizace paradigm ['paer.a.daim] © ['per-]
- paradigma, model
predictable [pri'dik.ta.bl]     - předvídatelný
primarily [praťmer.i.li] ^')-v prvé řadě
tier [tiar] © [tir] ^í> - vrstva
to alleviate [a'lii.vi.eit] ^')-odlehčit, ulevit
to dedicate ['ded.i.keit] &J - věnovat, vyhradit
to distinguish [dťstirj.gwif] =J - rozlišit, odlišit
to execute ['ek.si.kjuit] =J - provést
to focus on ['fau.kas on] © ['fou- am] ^')
- zaměřit se na, soustředit se na
to handle [haen.dI] &J - zabývat se, zacházet to offload [.Dflaud] © ['aif.loud] =J - odlehčit, snížit
Phrases
in relation to [rťlei.J"sn] =J - ve vztahu k prior to [praiar] © [prair]     - před
11
Image Representation
VALERIE GOUET-BRUNET CNAM Paris, Paris, France
5
Definition
In computer science, the representation of an image can take many forms. Mostly it refers to the way that
10 the conveyed information, such as color, is coded digitally and how the image is stored, i.e. how an image file is structured. Several open or patented standards were proposed to create, manipulate, store and exchange digital images. They describe the format
15 of image files, the algorithms of image encoding such as compression as well as the format of additional information often called metadata. The visual content of the image can also take part in its representation. The more recent concept has provided new approaches
20 to representation and new standards, gathered together into the discipline named content-based image indexing.
Historical Background
25
The first use of digital images began in the early 1920s with the technological development of facsimile transmission, and in particular with the Bartlane cable transmission system that was the first system that
30 translated pictures into a digital code for efficient picture transmission. Later, in the 1960s and 1970s, the advent of digital image technology was closely tied to the development of government programs for space exploration and espionage and also to medical research
35 with the invention of computerized axial tomography. With the availability of the CCD image sensors (charge-coupled device), the private sector also began to make significant contributions to the development of digital cameras.
40 In the mid-1980s, image format TIFF was created by the company Aldus with the aim of agreeing on a common file format for bitmapped images issued from scanners. In parallel, researchers at Xerox PARC had developed the first laser printer and had recognized the
45 need for a standard means of defining page images. After several fruitless attempts, Adobe Systems proposed the PostScript language in 1982, quickly adapted for driving laser printers. Since then, other file formats   dedicated   to   digital   images   were also
50 proposed, such as GIF (1987), JPEG (1992), PDF (1993), PNG (1995) and SVG (1998). Today, a great effort is made to propose standard formats able to preserve data integrity for archiving purposes, to migrate easily to future technologies as well as to
55 provide efficient compression for access and dissemination.
In 2000, the standard JPEG 2000 was proposed. Moreover, ambitious programs, MPEG-7 and MPEG-21, started in the late 1990s, are focusing on the 60 harmonization of methods for representing, storing, sharing and accessing multimedia contents (text, audio, image and video) in a unified framework.
Foundations
65
Basics of Image Representation
Digital images can be classified into two main categories: vector graphics and bitmapped images (also called raster images). Vector images are geometrical
70 2D objects created with drawing software or CAD (computer-aided design) systems. They are represented by geometrical primitives such as points, lines, curves, and shapes or polygons, which are all based upon mathematical  equations.  Unlike bitmaps  that are
75 resolution-dependent, vector images are scalable, which means that the scale at which they are shown will not affect their appearance. Such images are dedicated to the representation of images with simple content, such as diagrams, icons or logos.
80 A bitmapped image is composed of a set of dots or squares, called pixels (for picture elements), arranged in a matrix of columns and rows. Each pixel has a specific color or shade of gray, and in combination with neighboring pixels it creates the illusion of a
85 continuous tone image. Unlike human vision, sensors that capture images are not limited to the visual band of the electromagnetic spectrum and digital images can cover almost the entire spectrum, ranging from gamma to radio waves.
90  Managing bitmapped images requires the choice and manipulation of several parameters such as: Color   model.   A   color   model   is   an abstract mathematical model describing the way colors can be represented as tuples of numbers. From this model, the
95 combination of dedicated primary colors provides all the colors possible that are embedded in the corresponding color space. RGB, CMYK, CIELAB and CIELUV are the most known color models and spaces.
100 Dynamic range. The dynamic range of a digital image (also called color depth) determines the maximum range of gray level or color values carried by each pixel. The number of bits used to represent each pixel determines how many colors can appear in the image.
105 Photographic-quality images are usually associated with 24-bit dynamic range, such as in the JPEG format. Resolution. Resolution expresses the density of elements, pixels for instance, within a specific area. This term does not have any sense when dealing with
110 digital images as files, but it applies when associating a digital image with a physical support, such as display on a screen, printing on a printer or capture with a
12
scanner. Resolution is classically represented in terms of dpi (dots per inch) unit, which was originally the
115   unit adopted for printing.
Appearance of bitmapped images, which are made up of a fixed grid of pixels, clearly depends on the resolution chosen, unlike vector images that are scalable and then have the same appearance whatever
120   the dimensions chosen for visualization.
Image Compression
Image compression is the process of shrinking the size of digital image files. Compression algorithms are
125 especially characterized by two factors: compression ratio and generational integrity. Compression ratio is the ratio of compressed image size to uncompressed size and generational integrity refers to the ability for a compression scheme to prevent or mitigate loss of
130 data, and therefore image quality, through multiple cycles of compression and decompression. Lossless compression ensures that the image data is retained, as with Run-length encoding, Huffman coding and LZW coding. On the other hand, lossy compression schemes
135   involve intentionally sacrificing the quality of stored images by selectively discarding pieces of data. Run-Length Encoding (RLE) is probably the simplest form of lossless data compression: sequences in which the same data value occurs in many consecutive data
140 elements are stored as a single data value and count. Image data is normally run-length encoded in a sequential process that treats the image data as a ID stream, line by line, column by column or diagonally in a zigzag fashion. Common digital image formats for
145   run-length encoded data include TGA, PCX. It is possible with BMP, TIFF and JPEG. Huffman  Coding,  created in  1951  by David A. Huffman, is an entropy encoding algorithm used for lossless data compression. The basic idea of this
150 algorithm is to code with few digits the most common input symbols of a document. Each symbol is encoded by using a variable-length code table, where the codes are defined according to the estimated probability of occurrence for each possible symbol. Today, Huffman
155   coding is often used during the final process of some other compression methods such as JPEG and MP3. LZW Coding Lempel-Ziv-Welch (LZW) is a lossless data compression  algorithm created by Abraham Lempel, Jacob Ziv, and Terry Welch and published in
160 1984. The compression algorithm builds a string translation table that is based on fixed-length codes (usually 12-bit). As the system character serially examines the document if the string read is not stored in the table, a new code is created in the table and
165 associated with this string. Otherwise, the current string is encoded with an existing code. This algorithm became very widely used after it became part of the GIF image format in 1987. In contrast to other
compression techniques  such as JPEG, it allows
170  preserving very sharp edges, suitable for line art images often stored in GIF format. JPEG Compression JPEG is the most common image format used for compressing and storing by digital cameras and other photographic image capture devices.
175 ' 'JPEG'' stands for Joint Photographic Experts Group, the name of the committee that created the standard, which was approved in 1994 as ISO 10918-1 standard. This algorithm stands on the representation of the image in the frequency domain by using a two-
180 dimensional DCT (Discrete Cosine Transform) that describes the variability of the signal in terms of low-level and high-level frequencies. The human eye notices small differences in brightness over a relatively large area, but does not distinguish the exact strength
185 of a high frequency brightness variation very well. Consequently, the amount of information in the high frequency components of the DCT can be neglected without drastically affecting perceptual image quality: the DCT components are divided by factors of a
190 quantization matrix that increase with the spatial frequency, and then rounded to the nearest integer. This is the main lossy operation in the whole process. Typically, many of the higher frequency components are rounded to zero and the other components become
195   small numbers, which take many fewer bits to store.
File Formats
There are a lot of image formats for vector graphics as well as for bitmapped images. Most of them are open 200   standards and patent expired for the others.
Vector image formats contain a geometric description of the objects which can be rendered smoothly at any desired size. Among the most common formats, there are:
205 - SVG (Scalable Vector Graphics) is an open XML based standard created in 1998 and developed by the World Wide Web Consortium to address the need for a versatile, scriptable and all-purpose vector format for the web and otherwise.
210   - EPS (Encapsulated PostScript) is a standard file format created by Adobe Systems in the mid-1980s. It follows DSC (Document Structuring Conventions) rules that are a set of standards for PostScript. File formats abound for bitmapped images, but many
215   digital imaging projects have settled on the formula of TIFF, JPEG, GIF and also PNG files. - TIFF, for Tagged Image File Format, is a file format for storing images such as photographs as well as graphics. It was originally created by the company
220 Aldus, was then under the control of Adobe Systems and is now in the public domain. TIFF supports several lossless and lossy techniques of image compression, such as LZW, Huffman coding and JPEG. The ability to store image data in a lossless format makes TIFF
13
225 files a useful method for archiving images and preservation purposes.
- JPEG, for Joint Photographers Experts Group, is a file format that was developed specifically for high quality compression of photographic images in a 24-bit
230 RGB color model. It is generally employed for online presentation and dissemination; the associated lossy algorithm for compression makes it inappropriate for archiving purposes. The file format associated is JFIF (JPEG File Interchange
235 Format, 1992), a public domain storage format for JPEG compressed images. Unlike TIFF, JFIF does not allow for the storage of associated metadata, a failing that has led to the development of SPIFF (Still Picture Interchange   File   Format),   which   is   now the
240   international standard.
- GIF, for Graphics Interchange Format, is an 8-bit image format for indexed colors that was introduced by CompuServe in 1987 and has since come into widespread usage for art images such as diagrams or
245   logos with a limited numbers of colors.
In 1995 CompuServe proposed the PNG format (Portable Network Graphics) as a replacement for the GIF format without patent license. PNG offers a better and lossless compression technique called DEFLATE
250   (that combines LZ77 with Huffman coding). Since 2003, it has been an international standard. Today, the status of TIFF as the de facto standard format for archival digital image files is challenged by other formats such as PNG and JPEG 2000 that are
255 able to preserve data integrity as well as to provide efficient compression ratios for access and dissemination.
Metadata Representation
260 Metadata are commonly defined as "data about data." They constitute the documentation or a structured description associated with a document. Image files automatically include a certain amount of metadata that are stored in an area of the file defined by the file
265 format and called the header but information may also be stored externally.
In the widely used TIFF format, the term "tagged" indicates that developers can define and apply dedicated tags to enable them to include their own
270 proprietary information (called "private tags") inside a TIFF file without causing problems of compatibility. More recently, Exif format (Exchangeable image file format) was created by the Japan Electronic Industries Development Association. The latest version was
275 published in 2002 and while the specification is not currently maintained by any industry or standards organization, its use by camera manufacturers is nearly universal. Exif fields are generated at the creation of the image and should not be modified after with the
280   aim of including additional information like title or
keywords. To do this, other formats are recommended, such as XMP (extensible Metadata Platform). This last is an XML-based standard for creating, processing and storing standardized, extensible and proprietary 285 metadata, created by Adobe Systems in 2001. XMP metadata can be embedded into a significant number of popular file formats. It is used in PDF and other image formats such as JPEG, JPEG 2000, GIF, PNG, TIFF and EPS.
290 In parallel to generic standards, standards dedicated to specific image applications also exist, such as DICOM (Digital Imaging and Communications in Medicine). This is a standard created in 1992 and widely adopted by hospitals,  for handling,  storing,  printing and
295 transmitting information in medical imaging. It includes a file format definition and a network communications protocol.
Metadata constitute the documentation of all aspects of digital files essential to their persistence, usefulness
300 and access. Images without appropriate metadata may become hard to view, migrate to new technology, or to access among large volumes of images. When annotation is inappropriate or is missing, the representation of the visual content of images by image
305   analysis may be an interesting alternative.
Image Content Representation
Born in the early 1990s, Content-Based Image Indexing   (CBIR)   is   a   discipline   that exploits
310 techniques of image analysis and databases. Indexing an image by its content consists of automatically extracting structures that describe the visual content relevantly for the considered application. These structures can describe the visual content of an image
315 globally or locally by characterizing its distribution of color, shape and texture, or parts or objects of the image. The visual structures exhibited are considered as the index of the image, they are digitally represented with one or several multidimensional vectors called
320   signature of the image.
Key Applications
JPEG 2000 is a standard that gathers an image file 325   format and an algorithm of image compression, created by the Joint Photographic Experts Group committee in 2000. The coding algorithm of JPEG 2000 is similar to the JPEG one. It mainly differs in the use of wavelets instead of a DCT. Wavelets provide a decomposition 330   of the image into a pyramid of sub-images which store different levels of resolution of the image. They can be of two types, according to the objective: (1) a Daubechies wavelet transform that requires quantization to reduce the amount of bits representing 335   data,   as   JPEG   does,   and   then   imposes lossy compression;
14
(2) a rounded version of Le Gall wavelet transform, that uses only integer coefficients and then does not require quantization, providing lossless coding.
340 The aim of this standard is not only improving compression performance but also adding features, among which are transmission error resilience and region of interest (ROI). This last offers the opportunity of storing parts of the same picture using
345 different quality. Some parts of particular interest such as faces can be stored with higher quality, to the detriment of other ones where low quality/high compression can be tolerated. The JPEG 2000 standard defines two file formats that support embedded XML
350 metadata: JP2, which supports simple XML, and JPX, which has a more robust XML system based on an embedded metadata initiative of the International Imaging Industry Association (the DIG35 specification).
355
Future Directions
Today, there is a need of a unified framework for the creation,  representation,  storage,  access, delivery,
360   management and protection of multimedia contents. New standards for managing these contents are the international standards MPEG-21 and MPEG-7. MPEG-21 is a standard started in 1999 by MPEG (Moving Picture Experts Group) and now normalized
365 as ISO/TEC 21000. Its main objectives are to define an open framework for multimedia applications and more precisely to provide a standardized structure for various media contents and to facilitate their access, delivery, management and protection. MPEG-7 is
370 another ISO/IEC standard started by MPEG in 1998 and formally called Multimedia Content Description Interface. One of its main objectives is to provide unified and efficient searching, filtering and content identification methods for these media.
375
(Abridged)
Recommended Reading
380   1. Adobe XMP main page:
http://www.adobe.com/products/xmp/index.html. 5 2. Datta R. Joshi D. Li J. and Wang J. Z. Image retrieval: Ideas, influences, and trends of the new age. ACM Comput. Surv. 40 (2), 5 April 2008.
385   3. File format info.
http://www.fileformat.info/format/could.htm. 3 4. Ian S., Burnett, Pereira F., Van de Walle R., and Koenen R. The MPEG-21 Book. Wiley, 462 pages, 6 May 2006.
390 5. Manjunath B.S., Salembier P., and Sikora T. Introduction to MPEG-7: Multimedia Content Description Interface. Wiley & Sons, 6 April 2002.
6. Murray J. D. and van Ryper W. Encyclopedia of Graphics File Formats. O'Reilly, 1152 pages, 2nd
395   edition, 1996.
7. Oleg S., Pianykh. Digital Imaging and Communications in Medicine (DICOM): A Practical Introduction and Survival Guide. Springer, 2008.
8. Salomon D., Motta G.,  and Bryant D. Data 400   Compression: The Complete Reference. Springer, 4th
edition, 2006.
9. Taubman D. S. and Marcellin M. W. JPEG 2000: Image Compression Fundamentals, Standards and Practice. Kluwer International Series in Engineering
405   and Computer Science, Sees 642, 52001.
10. Wallace G. K. The JPEG still picture compression standard. Commun. of the ACM, 34(4):30-44, 3 April 1991.
15
Answer the following questions:
1) What are the two main categories of digital images and what is the difference between them?
2) What are the key factors in determining image quality?
3) What is compression and what types of compression are mentioned in the text?
4) Name some types of lossless compression.
5) What is JPEG and what is it used for?
6) Name some vector image formats.
7) What are file formats in which bitmap data can be saved?
8) What term is used for "data about data"?
9) Why do we need content-based image indexing?
10) In what way does JPEG 2000 differ from the JPEG standard?
11) What is the standard for multimedia applications?
Match the following terms with their definitions:
1) Run-length encoding (RLE)
2) Bitmapped image
3) Content-based information retrieval
4) Color model
5) Lossy compression
6) TIFF
7) JPEG 2000
f) Large runs of consecutive identical data values replaced by a simple code with the data value and length of the run
g) Tagged Image File Format
Mark the following statements as true or false:
1) Vector graphics are images that are completely described using mathematical definitions.
2) Vector drawings can usually be scaled without any loss in quality.
3) Common raster images include 2- and 3-D architectural drawings, flow charts, logos and fonts.
4) The Lempel-Ziv (LZ) compression methods are among the most popular algorithms for lossy storage.
5) JPEG 2000 is a newer version of the JPEG format that compresses images via wavelets.
6) PNG (Portable Network Graphics) was a predecessor of the GIF format for transfer and display images online.
a) A file format that uses wavelet compression
b) An attempt to describe color in a mathematical, predictable and reproducible way
c) An image made up of a given number of pixels, each with a specific color value, laid out in a grid
d) Technology that is able to retrieve images on the basis of machine-recognizable visual criteria
e) Reduction in file size that involves permanent loss of information
16
Vocabulary
advent ['aed.vent], [-vant] sj - příchod, nástup band [baend] sj - pásmo column ['kol.am] © ['kai.lam] si - sloupec consecutive [kan'sek.ju.tiv] © [-t iv] H')
- následný
cosine ['kau.sain] © ['kou-] M 1 - kosinus curve [k3:v] © [ k?:v] M■>) - křivka dedicated ['ded.i.kei.tid] ® [-tid] - zaměřený, věnovaný
detriment ['det.ri.mant] si - škoda, újma dimension [,dai'men.ř/3n] «J - rozměr dissemination [di'sem.i.neif3n] ^ - šíření edge [ed3] A '■ - hrana formula ['fDi.mju.la] © f'fDir-] H 5 - vzorec, program, plán
framework ['freim.w3:k] © [-w?:k] M - rámec
geometrical primitives [,d3i:.a'met.nk.al] ^ )
['pri.mi.tivz] - geometrické útvary
grid [grid] =i - mřížka
charge coupled device [t/aid^] © [tja:rd3] M :
['kAp.lt] =i  [di'vais] si - zařízení s vázanými
náboji
intentionally [in'ten.f3n.ali] si - záměrně
line [lain] =i - čára, přímka
lossless f'loslis] =i - bezeztrátový
lossy ['losi] si - ztrátový
means [mi:nz] - prostředek
objective [ab'd3ek.tiv] 4 0 - cíl
occurrence [a'kAr.3nfs] ® [-'k?:-] =i - výskyt
perceptual [pa'sep.Ju.al] =i -vnímavostní
persistence [pa'sis.t3nfs] © [paH <4 - stálost,
vytrvalost
point [p3int] =i - bod
polygon ['polygon] © ['pa:.Ii.gain] H
- mnohoúhelník, polygon
probability [.prob.a'bil.i.ti] ©[,pra:.ba'bil.9.ti]
- pravděpodobnost
proprietary [pra'praia.tri] © [-ter.i] H>> -vlastnický, patentovaný
replacement for [ri'pleis.mant] ^ - náhrada za representation [,rep.n.zen'tei.J3n] si - zobrazení, vyobrazení
resilience [rťzil.i.anf s] =i - odolnost
row [rau] © [ rou] si -řádek
run-length coding [rAn] M■') [lerjXr 0] H !
[kaudirj] ® [koudirj] - kódování délkou běhu
scalable ['skei.la.bl] =í - škálovatelný
scriptable ['skřip.ta.bl] - skriptovatelný
shade [feid] ^ - odstín
shape [feip] ^) - tvar, útvar
smoothly ['smu:5.li] - hladce
strength [strerj9] - síla, intenzita
string [strirj] ^ - řetězec
to allow for [a'lau] ^ - počítat s čím
to associate with [a'sau.si.eit] ® [-sou-] 4
- spojovat s
to be made up of [meid] =i - být složen to capture ['kaep.tJV] ® [-tfa*] 4 - zachytit to convey [kan'vei] =J - sdělit, vyjádřit to differ In st ['dif.ar] ® [-a*] ^ - Ušit se v něčem to discard [dťskaid] © [-'skaird] *J - odhodit to distinguish [di'stin.gwif] =4 - rozlišit to embed st in st [im'bed] si - zapustit, zasadit do
to employ [im'pbi] si - použít, využít
to exploit [ik'spbit] ^1   ^ - využít, zužitkovat
to facilitate [fa'sil.i.teit] ^ - usnadnit
to focus on st ['fau.kas] ® ['fou-] si
- zaměřit se na
to challenge ['tfael.ind3] =i - zpochybnit, vznést námitky
to migrate [maťgreit] ^) ^')- přesunout, přemístit
to mitigate ['mit.i.geit] ® ['mít-] si - zmírnit
to neglect st [ni'glekt] si - zanedbat, opomenout
to preserve [pri'z3:v] ® [-'z?:v] ^ - uchovat
to render ['ren.dar] ® [-da*] *t - zobrazit
to retain [n'tein] A - uchovat
to round [raund] =i - zaokrouhlit
to sacrifice ['saek.n.fais] =i - obětovat
to settle on st ['set.I] ® ['set-] *J - rozhodnout
se pro
to shrink [frirjk] si -zmenšit
to stand for [staend] ^ ■) - znamenat
to tie [tai] =i - vázat se, souviset
to treat st as [tri it] si - považovat co za
tuple [tju:pal] - n-tice
unlike [An'laik] =i - na rozdíl
versatile [V3:.sa.tail] © [Vyi.sa.tH] ^ ) <')
-všestranný, univerzální
17
Phrases
in a zigzag fashion ['zig.zaeg] ^ ) ['faej>n] -klikatě, cikcak
in contrast to ['kon.traist] © ['kain.traest] =J -naproti tomu, na rozdíl
in parallel to ['paer.a.lel] © ['per-] ^') - souběžně in particular [pa'tik.ju.lar] © [p^'tik.ja.la^] ^')-obzvláště
widely used - hojně využívaný
with the aim of agreeing - s cílem shodnout se
18
Processor Cache
PETER BONCZ
CWI, Amsterdam, The Netherlands
5
Definition
To hide the high latencies of DRAM access, modern computer architecture now features a memory 10 hierarchy that besides DRAM also includes SRAM cache memories, typically located on the CPU chip. Memory access first checks these caches, which takes only a few cycles. Only if the needed data is not found, an expensive memory access is needed.
15
Key Points
CPU caches are SRAM memories located on the CPU chip, intended to hide the high latency of accessing off-
20 chip DRAM memory. Caches are organized in cache lines (typically 64 bytes). In a fully-associative cache, each memory line can be stored in any location of the cache. To make checking the cache fast, however, CPU caches tend to have limited associativity, such that
25 storage of a particular cache line is possible in only 2 or 4 locations. Thus only 2 or 4 locations need to be checked during lookup (these are called 2- resp. 4-way associative caches). The cache hit ratio is determined by the spatial and temporal locality of the memory
30   access generated by the running program(s).
Cache misses can either be compulsory misses (getting the cache lines of all used memory once), capacity misses (caused by the cache being too small to keep all multiply used lines in cache), or conflict misses (due to
35   the limited associativity of the cache).
Most modern CPUs have at least three independent caches: an instruction cache to speed up executable instruction fetch, a data cache to speed up data fetch and store, and a Translation Lookaside Buffer(TLB)
40 used to speed up virtual-to-physical address translation for both executable instructions and data. The TLB is not organized in cache lines; it simply holds pairs of (virtual, logical) page mappings, typically a fairly limited amount (e.g., 64). In practice, this means that
45 algorithms that repeatedly touch memory in more than 64 pages (whose size is often 4 KB) shortly after each other, run into TLB thrashing. This problem can sometimes be mitigated by setting a large virtual memory page size, or by using special large OS pages
50 (sometimes supported in the CPU with a separate, smaller, TLB for large pages).
Another issue is the tradeoff between latency and hit rate. Larger caches have better hit rates but longer latency. To address this tradeoff, many computers use 55 multiple levels of cache, with small fast caches backed up  by  larger  slower  caches.  Multi-level caches
generally operate by checking the smallest Level 1 (LI) cache first; if it hits, the processor proceeds at high speed. If the smaller cache misses, the next larger
60 cache (L2) is checked, and so on, before external memory is checked. As the latency difference between main memory and the fastest cache has become larger, some processors have begun to utilize as many as three levels of on-chip cache.
65 For multi-CPU and multi-core systems, the fact that some of the higher levels of cache are not shared, yet provide coherent access to shared memory, causes additional cache-coherency inter-core communication to invalid stale copies of cache lines on other cores
70 when one core modifies it. In multi-core CPUs, an important issue is that cache level is shared among all cores - this cache level is on the one hand a potential hot-spot for cache conflicts, on the other hand provides an opportunity for very fast inter-core data exchange.
75 In case of sequential data processing, the memory controller or memory chipset in modern computers often detects this access pattern and starts requesting the subsequent cache lines in advance. This is called hardware prefetching. Prefetching effectively allows
80 hiding compulsory cache misses. Without prefetching, the effective memory bandwidth would equate cache line size divided by memory latency (e.g., 64/50ns = 1.2 GB/s). Thanks to hardware prefetching, modern computer  architectures  reach  four times  that on
85 sequential access. Modern CPUs also offer explicit prefetching instructions, which a software writer can exploit to perform (non-sequential) memory access in advance, hiding their latency. In database systems, such software prefetching has successfully been used
90 in making hash-table lookup faster (e.g., in hash-join and hash aggregation).
In database systems, a series of cache-conscious data storage layouts (e.g., DSM and PAX) have been proposed to improve cache line usage. Also, a number 95 of cache-conscious query processing algorithms, such as cache-partitioned hash join and hash-join using memory prefetching, have been studied. In the area of data structures and theoretical computer science, there has recently been interest in cache-oblivious 100 algorithms that regardless of the exact parameters of the memory hierarchy (number of levels, cache size, cache line sizes and latencies) perform well.
(Abridged)
105
19
Answer the following questions:
1) What kinds of independent caches are mentioned in the text?
2) What is the cache hit ratio determined by?
3) What kinds of cache misses are mentioned in the text?
4) What are cache-oblivious algorithms?
5) How do multi-level caches operate?
6) What enables a faster hash-table lookup?
Match the following terms with their definitions:
1) hardware prefetching
2) cache miss
3) fully-associative cache
4) translation loookaside buffer
a) this type of cache speeds up virtual-to-physical address translation
b) requesting the subsequent memory lines in advance
c) a situation when the requested data does not occur in the cache
d) in this type of cache each memory line can be stored in any location of the cache
Mark the following statements as true or false:
1) Modern computer architecture does not include DRAM, only SRAM.
2) A big problem with DRAM is its high latency.
3) A translation lookaside buffer is organized in cache lines.
4) Larger caches have worse hit rates but lower latency.
Vocabulary
Coherent [kaa 'hia.rant] ©[ kou'hir.snt] ^0 - souvislý, soudržný
Compulsory [kam'pAl.ssr.i] © [ -sa*-] ^ - povinný conscious ['komfTas] © [ 'ka:n-]     - vědom si něčeho
fairly ['fea.li] © [ 'fer-] ^ - celkem, dost
hotspot [/?DtspDt] © [hDitspoit] - ohnisko
independent [,in.di'pen.dsnt] =J - nezávislý
invalid [in'vael.id] =J - neplatný
latency ['lei.t3nf .si] sj - zpoždění
oblivious [a'bliv.i.as]     - zapomnětlivý,
zapomínající, nedbající (něčeho)
ratio ['rei.Ji.au] © [ -ou]     - poměr
sequential [sťkwen.Jal] &J - následující, postupný
spatial ['spei.J3!] prostorový
stale [steil] ^') - zde zastaralý
subsequent ['sAb.si.kwant]     - následující, příští
table [Vei.bl] ^)) - tabulka
temporal ['tem.psr.3l] © [ -pa*.al] ^') - časový
thrashing [Oraejirj] - zahlcení paměti
to equate [ťkweit]     - rovnat se
to fetch [fetj]     - přinést
to mitigate ['mit.i.geit] © [ 'mít-]     - zmírnit
to multiply ['mAl.ti.plai] © [ -t i-] ^') - násobit
to propose [pra'pauz] © [ -'pouz] ^') - navrhnout
to tend to [tend] ^ ■) - mít tendenci k
to utilize ['jui.ti.laiz] © [ -t al.ai-] ^-') - využít,
upotřebit
trade-off ['treid.of] © [ -a:f] ^) - kompromis
Phrases
in advance [ad'va:nfs] © [-Vaenřs] H : - dopředu, předběžně, v předstihu
regardless of [rťga:d.las] © [-'ga:rd-] ^  - bez ohledu na
21
Spatial Data Mining
SHASHI SHEKHAR, JAMES KANG, VIJAY GANDHI 5   University of Minnesota, Minneapolis, MN, USA
Definition
Spatial data mining is the process of discovering 10 nontrivial, interesting, and useful patterns in large spatial datasets. The most common spatial pattern families are co-locations, spatial hotspots, spatial outliers, and location predictions.
15  Historical Background
Spatial data mining research began several decades ago when practitioners and researchers noticed that critical assumptions in classical data mining and statistics were
20 violated by spatial datasets. First, whereas classical datasets often assume that data are discrete, spatial data were observed to reside in continuous space. For example, classical data mining and statistical methods may use market-basket datasets  (e.g., history of
25 Walmart's transactions), where each item-type in a transaction is discrete. However, "transactions" are not natural in continuous spatial datasets, and decomposing space across transactions leads to loss of information  about  neighbor relationships between
30 items across transaction boundaries. In addition, spatial data often exhibits heterogeneity (i.e., no places on the Earth are identical), whereas classical data mining techniques often focus on spatially stationary global patterns   (i.e.,   ignoring   spatial   variations across
35 locations). Finally, one of the common assumptions in classical statistical analysis is that data samples are independently generated. However, this assumption is generally false when analyzing spatial data, because spatial data tends to be highly self-correlated. For
40 example, people with similar characteristics, occupation and background tend to cluster together in the same neighborhoods. In spatial statistics this tendency is called spatial auto-correlation. Ignoring spatial auto-correlation when analyzing data with
45 spatial characteristics may produce hypotheses or models that are inaccurate or inconsistent with the data set. Thus, classical data mining algorithms often perform poorly when applied to spatial data sets. Better methods are needed to analyze spatial data to detect
50   spatial patterns.
Foundations
The spatial data mining literature has focused on four 55   main types of spatial patterns: (i) spatial outliers, which are spatial locations showing a significant
difference from their neighbors; (ii) spatial co-locations, or subsets of event types that tend to be found more often together throughout space than other
60 subsets of event types; (iii) location predictions, that is, information that is inferred about locations favored by an event type based on other explanatory spatial variables; and (iv) spatial hotspots, unusual spatial groupings of events. The remainder of this section
65 presents a general overview of each of these pattern categories.
A spatial outlier is a spatially referenced object whose non-spatial attribute values differ significantly from
70 those of other spatially referenced objects in its spatial neighborhood. For example, consider spatial outliers, detected in traffic measurements for sensors on highway I-35W (North bound) in the Minneapolis-St. Paul area, for a 24-h time period. Station 9 may be
75 considered a spatial outlier as it exhibits inconsistent traffic flow compared with its neighboring stations. Once a spatial outlier is identified, one may proceed with diagnosis. For example, the sensor at Station 9 may be diagnosed as malfunctioning. Spatial attributes
80 are used to characterize location, neighborhood, and distance. Non-spatial attribute dimensions are used to compare a spatially referenced object to its neighbors. Spatial statistics literature provides two kinds of bipartite multidimensional tests, namely graphical tests
85 and quantitative tests. Graphical tests, such as Variogram clouds and Moran scatterplots, are based on the visualization of spatial data and highlight spatial outliers. Quantitative methods provide a precise test to distinguish spatial outliers from the remainder of data.
90
Spatial co-location pattern discovery finds frequently co-located subsets of spatial event types given a map of their locations. Spatial co-location is a generalization of a classical data mining pattern family 95 called association rules, since transactions are not natural in spatial datasets, and partitioning space across transactions leads to loss of information about neighbor relationships between items near transaction boundaries.   Additional   details   about co-location
100   interest measures, e.g. participation index and K functions, and mining algorithms are described in [2]. Location prediction is concerned with the discovery of a model to infer preferred locations of a spatial phenomenon from the maps of other explanatory
105 spatial features. For example, ecologists may build models to predict habitats for endangered species using maps of vegetation, water bodies, climate, and other related species. For example, consider an example of a dataset used in building a location prediction model for
110 red-winged blackbirds in the Darr and Stubble wetlands on the shores of Lake Erie in Ohio, USA. This dataset consists of nest location, distance to open
22
water, vegetation durability and water depth maps. Classical prediction methods may be ineffective in this
115 problem due to the presence of spatial auto-correlation. Spatial data mining techniques that capture the spatial autocorrelation of nest location such as the Spatial Autoregressive Model (SAR) [1] and Markov Random Fields based Bayesian Classifiers (MRF-BC) are used
120   for location prediction modeling.
Spatial Hotspots are unusual spatial groupings of events that tend to be much more closely related than other events. Examples of spatial hotspots can be
125   incidents of crime in a city or outbreaks of a disease.
Hotspot patterns have properties of clustering as well as anomalies from classical data mining. However, hotspot discovery [4] remains a challenging area of research due to variation in shape, size, density of
130 hotspots and underlying space (e.g., Euclidean or spatial networks such as roadmaps). Additional challenges arise from the spatio-temporal semantics such as emerging hotspots, displacement etc.
135  Key Applications
Spatial data mining and the discovery of spatial patterns has applications in a number of areas. Detecting spatial outliers is useful in many applications
140 of geographic information systems and spatial databases, including the domains of public safety, public health, climatology, and location-based services. As noted earlier, for example, spatial outlier applications may be used to identify defective or out of
145 the ordinary (i.e., unusually behaving) sensors in a transportation system. Spatial co-location discovery is useful in ecology in the analysis of animal and plant habitats to identify co-locations of predator-prey species, symbiotic species, or fire events with fuel and
150 ignition sources. Location prediction may provide applications toward predicting the climatic effects of El Nino on locations around the world. Finally, identification of spatial hotspots can be used in crime prevention and reduction, as well as in epidemiological
155   tracking of disease.
(Abridged)
Recommended Reading
160
1. Cressie N.A. Statistics for Spatial Data (Revised Edition). Wiley, New York, NY, 1993.
2. Huang Y., Shekhar S., and Xiong H. Discovering co-location patterns from spatial datasets: a general
165 approach. IEEE Trans. Knowl. Data Eng. (TKDE), 16(12): 1472-1485, 2004.
3. Shekhar S., Schrater P., Vatsavai R., Wu W., and Chawla   S.   Spatial   contextual   classification and
prediction models for mining geospatial data. IEEE 170   Trans.   Multimed.   (special   issue   on Multimedia Databases), 4(2):174-188, 2002.
4. US Department of Justice - Mapping and Analysis for Public Safety report. Mapping Crime: Understanding Hot Spots, 2005
175   (http://www.ncjrs.gov/pdffiles l/nij/209393.pdf).
5. Shekhar S. and Chawla S. A Tour of Spatial Databases. Prentice Hall, 2003.
6. Longley P.A., Goodchild M., Maquire D.J., and Rhind D.W. Geographic Information Systems and
180   Science. Wiley, 2005.
7. Shekhar S., Zhang P., Huang Y., and Vatsavai R. Trend in spatial data mining. In Data Mining: Next Generation Challenges and Future Directions, H. Kargupta, A. Joshi, K. Sivakumar, and Y. Yesha (eds.).
185   AAAI/MIT Press, 2003.
8. Solberg A.H., Taxt T., and Jain A.K. A Markov random field model for classification of multisource satellite imagery. IEEE
Trans. Geosci. Remote Sens., 34(1): 100-113, 1996. 190   9. Kou Y., Lu C.T., and Chen D. Algorithms for spatial outlier detection. In Proc. 2003 IEEE Int. Conf. on Data Mining, 2003, pp. 597-600.
10. Shekhar S., Lu C.T., and Zhang P. A unified approach to detecting spatial outliers. Geolnformatica,
195  7(2):139-166,2003.
11. Mamoulis N., Cao H., and Cheung D.W. Mining frequent spatio- temporal sequential patterns. In Proc. 2003 IEEE Int. Conf. on Data Mining, 2005, pp. 82-89.
23
Answer the following questions:
1) What is spatial data mining?
2) What are the basic differences between classical datasets and spatial datasets?
3) Why are classical data mining algorithms unsuitable for spatial datasets?
4) What does it mean that spatial data tends to be highly self-correlated? How would you explain spatial auto correlation?
5) Name the four main types of spatial patterns.
6) Explain a spatial outlier and give a simple example.
7) What are spatial and non-spatial attributes?
8) Which pairs are frequently co-located?
9) Give some examples of hotspots.
10) Name some areas where spatial data mining is applied.
Mark the following statements as true or false:
1) Many of the relationships on spatial data are implicit.
2) Spatial autocorrelation means that nearby things are more different than distant things.
3) In spatial data mining data samples are not independent.
4) Classical data mining techniques perform well when applied to spatial data sets.
5) Spatial data mining is based on the same assumptions as classical data mining.
Match the following terms with their definitions:
1) Spatial data mining (SDM)
2) Spatial outlier
3) Location prediction
4) Spatial co-location
5) Hotspot analysis
6) Spatial autocorrelation
7) Variogram clouds and Moran scatterplots
a) Presence of two or more spatial objects at the same location or at significantly close distances from each other
b) Process of finding unusually dense event clusters across time and space
c) Process of discovering interesting, useful, non-trivial patterns from large spatial datasets
d) Spatial outlier detection algorithms based on visualization
e) Observation which appears to be inconsistent with its neighborhood
f) Prediction of events occurring at particular geographic locations
g) Spatial data values are influenced by values in their immediate vicinity
24
Vocabulary
anomaly [a'rmm.a.li] © [-'nai.ma-]     - odchylka assumption [a'sAm/?.J"3n] ftj - předpoklad, domněnka
boundary ['baun.dar.i] [-dri] © [-daH H')
- hranice, mez
defective [dťfek.tiv] ^í> - vadný, chybný discrete [di'skriit] ^')- nespojitý, oddělený, samostatný
displacement [dťspleis.mant] přemístění, posun
endangered [in'dein.d3ad] © [-d3^d]
- ohrožený
grouping ['grui.pin] ^ ■) - seskupení habitat ['haeb.i.taet]        přirozené prostředí, domov
hypothesis (pi. hypotheses) [hai'pDfj.a.sis] © [-'pa:.9a-] 4>), pi. [hai'poe.a.siiz] 4>) -hypotéza, předpoklad, domněnka ignition [ig'nij>n] ^ i - vznícení, vzplanutí incident ['inř.si.dsnt] ^')- případ, incident, příhoda, událost
inconsistent [,in.kan'sis.t3nt] M - neslučitelný, jsoucí v rozporu
namely ['neim.li] ^) - a to, jmenovitě nontrivial [non'triv.i.al] - podstatný, důležitý out of the ordinary [aut] ^) [Di.di.na.ri] © ['D:r.d3n.er-] ^) - neobvyklý, zvláštní outbreak ['aut.breik] ^■) - vypuknutí outlier ['aut.lai.a] =J -co leží mimo pattern ['paet.3n] © ['paet .a^n] ^>)-vzorec, způsob, průběh
phenomenon (pi. phenomena) [fa'nom.i.nan] © [-'na:.ma.na:n] M'>, pi. [fa'nom.i.na]     - jev predator ['pred.a.tar] © [-t&] M■■) 4') - dravec, predator
prey [prei]     - kořist
remainder [ri'mein.d9r] © [-da^]     - zbytek species ['spi:./i:z] ^) - druh stationary ['stei./3n.sr.i] © [-/a.ner-] 4')
- neměnný, nehybný
throughout space [9ru:'aut] 4) [speis] po celém prostoru
to be concerned with st [kan's3:nd]
© [ -'s?:nd] - zabývat se čím
to cluster ['klAs.tar] © [-ta^] A - shlukovat se,
seskupit
to co-locate [kau.lau'keit] © [kou.lou'keit]
^))- vyskytovat se společně to decompose [,di:.kam'pauz] ®[-'pouz] 4) - rozložit
to favor ['fei.var] © [ -va^] M •) - podporovat to highlight ['hai.lait] M$>- zvýraznit, poukázat to infer [in'f3ir] © [ -'f?:] M% - dedukovat, vyvozovat
to note [naut] © [ nout] ^') - upozornit, poukázat to partition [pa:'tij>n] © [pair-] ^ - rozdělit to proceed [praa'siid] © [prou-] 4 - pokračovat, postupovat
to reference ['ref>r.snfs] © [-aH ^) - odkazovat to tend [tend] ^ í - mít sklon, tendenci, být náchylný
to track [traek] =J - sledovat
to violate [Vaia.leit] Hí- porušit, nedodržet
Phrases
due to - kvůli
either - or - buď - nebo
from the standpoint ['staend.pDint] z hlediska in infant stages ['in.fant] ^ - na počátku vývoje
25
Stemming
CHRIS D. PAICE
Lancaster University, Lancaster, UK
5
Definition
Stemming is a process by which word endings or other affixes are removed or modified in order that word 10 forms which differ in non-relevant ways may be merged and treated as equivalent. A computer program which performs such a transformation is referred to as a stemmer or stemming algorithm. The output of a stemming algorithm is known as a stem.
15
Historical Background
The need for stemming first arose in the field of information retrieval (IR), where queries containing
20 search terms need to be matched against document surrogates containing index terms. With the development of computer-based systems for IR, the problem immediately arose that a small difference in form between a search term and an index term could
25 result in a failure to retrieve some relevant documents. Thus, if a query used the term "explosion" and a document was indexed by the term "explosives," there would be no match on this term (whether or not the document would actually be retrieved would
30   depend on the logic and remaining terms of the query). The first stemmer for the English language to be fully described in the literature was developed in the late 1960s by Julie Beth Lovins [11]. This has now been largely superseded by the Porter stemmer [14], which
35 is probably the most widely used, and the Paice/Husk stemmer [12]. Stemmers have also been developed for a wide variety of other languages.
Foundations
40
Definitions
In an IR context, the process of taking two distinct words, phrases or other expressions and treating them as semantically equivalent is referred to as conflation.
45 The two expressions need not be precisely synonymous, but they must refer to the same core concept (compare "computed" and "computable"). In this article, the term "practically equivalent" is used to mean that, for the purposes of a particular
50 application, the words may as well be taken as equivalent. The term conflation is sometimes used as though it is equivalent to stemming, but it is in fact a much broader concept, since it includes (i) cases where the strings concerned are multi-word expressions, as in
55 "access time" and "times for access", and (ii) cases where the strings are not etymologically related, as in
"index term" and "descriptor". In case (i) special string matching techniques may be used, whereas in case (ii) reference to a dictionary or thesaurus is 60 necessary. The present account deals exclusively with the conflation of single etymologically related words. There are various possible approaches to word conflation, including the following.
65 1. Direct matching. In this method, the character sequences of two words are compared directly, and a similarity value is computed. The words are then considered to match if their mutual similarity exceeds a predefined threshold. To give a simple example, the
70 first six letters of the words "exceeds" and "exceeded" are the same, so these words together contain 12 matching letters out of 15. Hence, a similarity of 12/15 = 0.80 can be computed. Use of a threshold (say, 0.70) allows a decision as to whether
75 the words can be considered equivalent. With such a method, setting the threshold is problematic. Thus, the similarity between "exceeds" and "excess" is 0.62, which is below the stated threshold. However, allowing for this by lowering the threshold to 0.60
80 would cause "excess" and "except" (similarity 0.67) to be wrongly conflated.
2. Lexical conflation. In this case a thesaurus or dictionary is used to decide whether two words are
85 equivalent. Obviously, this method can be used even for etymologically unrelated words. A problem here is obtaining a suitably comprehensive and up-to-date thesaurus, and one which explicitly lists routine variants such as plurals.
90
3. Cluster-based conflation. This method, investigated by Xu and Croft [15], involves creating clusters of practically equivalent words by analyzing the word associations in a large representative text corpus. Each
95 query word is then supplemented by adding in the other words in its cluster. In contrast to method (2), the clusters created are specific to the text collection in question. However, the creation of the clusters can be very time-consuming.
100
4. N-gram conflation. In this method, each word is decomposed into a collection of N-letter fragments (N-grams), and a similarity is computed between the N-gram collections of two words; a threshold is then
105 applied to decide whether the words are equivalent. This approach was pioneered by Adamson and Boreham[l], who used sets of bigrams, where N = 2. For example, after eliminating duplicates and sorting into order, "exceeds" can be represented by the
110 bigram set {ce, ds, ed, ee, ex, xc} and "exceeded" by {ce, de, ed, ee, ex, xc}. Out of 7 distinct bigrams here,
26
5 are shared between the two words; hence a similarity of 5/7 = 0.712 can be computed.
115 5. Stemming. Stemming refers to the removal of any suffixes (and sometimes other affixes) from an input word to produce a stem. Two words are then deemed to be equivalent if their stems are identical. This method is much favored because it is fast: all words
120 can be reduced to stems on input to the system, and simple string matching used thereafter. The remainder of this article focuses on stemming in this narrow sense.
125  Prefixes and Infixes
In English, stemmers are usually designed for removing suffixes from words. The removal of "intimate" prefixes such as "intro-," "pro-" and 130 "con-" generally results in words being wrongly conflated (consider "intro-duction," "pro-duction" and "con-duction").
However, there may be a case for removing looser prefixes such as "hyper-" or "macro-." Also, prefix 135 removal may be desirable in certain domains with highly artificial vocabularies, such as chemistry and medicine. As explained below, there are some languages in which removal or replacement of prefixes, or even infixes, is in fact essential.
140
Performance and Evaluation
Since stemmers were originally developed to aid the operation of information retrieval systems, it was
145 natural that they were first assessed in terms of their effect on retrieval performance, as well as on "dictionary compression" rates. Researchers were frustrated to find that the effects on retrieval performance for English language material were small
150 and often negative [10]. Removal of "-s" and other regular inflectional endings might be modestly helpful, but use of heavier stemming could easily result in a loss of performance [7].
Stemming errors are of two kinds: understemming, in 155   which a pair of practically equivalent words are not conflated,    and   overstemming,    in   which two semantically distinct words are wrongly conflated.
Non-English Stemmers
160
Stemming is appropriate for most (though not all) natural languages, and appears to be especially beneficial for highly inflected languages [9]. There is neither space nor need to describe non-English 165 stemmers here, except to note that some languages exhibit much greater structural complexity, and this warrants special approaches. Thus, a typical Arabic
word consists of a root verb of three (or occasionally four or five) consonants (e.g., "k-t-b" for "to write"),
170   into which various prefixes, infixes and suffixes are inserted to produce specific variant forms ("katabna": "we wrote" and "kitab": "book"). Some researchers have concentrated on extracting the correct root from a word [3], but Aljlayl and Frieder
175 have demonstrated that better retrieval performance is obtained by using a simpler "light stemming" approach, in which only the most frequent suffixes and prefixes are removed [4]. Their results showed that extraction of roots causes unacceptable levels of
180 overstemming.
Key Applications
As noted earlier, stemmers are routinely used in 185 information retrieval systems to control vocabulary variability. They also find use in a variety of other natural language tasks, especially when it is required to aggregate mentions of a concept within a document or set of documents. For example, stemmers may be used 190 in constructing lexical chains within a text. Stemming can also have a role to play in the standardization of data for input to a data warehouse.
(Abridged)
195
Recommended Reading
1. Adamson G.W. and Boreham J. The use of an association measure based on character structure to
200 identify semantically related pairs of words and document titles. Inf. Process. Manage., 10(7/8):253-260, 1974.
2. Ahmad F., Yusoff M., and Sembok M.T. Experiments with a stemming algorithm for Malay
205 words. J. Am. Soc. Inf. Sci. Technol., 47(12):909-918, 1996.
3. Al-Sughaiyer LA. and Al-Kharashi LA. Arabic morphological analysis techniques: a comprehensive survey. J. Am. Soc. Inf. Sci. Technol., 55(3):189-213,
210 2004.
4. Aljlayl M. and Frieder O. On Arabic search: Improving the retrieval effectiveness via a light stemming approach. In , 2002, pp. 340-347.
5. Bacchin  M.,  Ferro  N.,   and  Melluci  M. A 215   probabilistic  model  for  stemmer  generation. Inf.
Process. Manage., 41(1):121-137, 2005.
6. Frakes W.B. and Fox C.J. Strength and similarity of affix removal stemming algorithms. SIGIR Forum, 37(l):26-30, 2003 (Spring 2003).
220 7. Harman D. How effective is suffixing? J. Am. Soc. Inf. Sci., 42(1):7-15, 1991.
27
8. Hull D. A Stemming algorithms: a case study for detailed evaluation. J. Am. Soc. Inf. Sci., 47(l):70-84, 1996.
225 9. Krovetz R. Viewing morphology as an inference process. Artificial Intelligence, 118(l/2):277-294, 2000.
10. Lennon M., Pierce D.S., Tarry B.D., and Willett P. An evaluation of some conflation algorithms for
230   information retrieval. J. Inf. Sci., 3:177-183, 1981.
11. Lovins J.B. Development of a stemming algorithm. Mech. Transl. Comput. Linguist., 11:22-31, 1968.
12. Paice CD. Another stemmer. SIGIR Forum, 24(3):56-61, 1990.
235   13. Paice CD. A method for the evaluation of
stemming algorithms based on error counting. J. Am.
Soc. Inf. Sci., 47(8):632-649, 1996.
14. Porter M.F. An algorithm for suffix stripping.
Program, 14(3):130-137, 1980. 240   15. Xu J. and CroftW.B. Corpus-based stemming using
co-occurrence of word variants. ACM Trans. Inf. Syst.,
16(1):61-81, 1998.
28
Answer the following questions:
1) How would you describe stemming? What is its purpose?
2) What often resulted in a failure to retrieve relevant documents during searches in the past?
3) What is conflation?
4) Is there any difference between conflation and stemming?
5) What tools have to be used when strings are not etymologically related?
6) Describe direct matching.
7) What does the term of threshold refer to in the text?
8) What is the disadvantage of cluster-based conflation?
9) What are bigrams?
4) Hyper- in hyperactive is a suffix.
5) The term affix covers both prefix and suffix.
6) Stemming appears beneficial for highly inflected languages.
7) The light-stemming approach is based on removing the least frequent affixes.
Match the following terms and their definitions:
1) lexical conflation
2) cluster-based conflation
3) N-gram conflation
4) understemming
5) overstemming
a) a method using a corpus of texts
b) a method based on bigrams
c) a situation where more-or-less equivalent words are not conflated
d) a method using a dictionary or thesaurus
e) a situation where two semantically distinct words are wrongly conflated
Mark the following statements as true or false:
1) During the conflation, the expressions need to be synonymous.
2) The words mother and father are etymologically related.
3) In stemming, two words are considered equivalent provided their stems are identical.
29
Vocabulary
account [a'kaunt] ^) - výčet; účet actual ['aek.tju.al] [-tju-] [-tful] M$>- vlastní actually ['aek.tju.9.li] [-tju-] [-tju.li] H:>) - vlastně affix ['aef.iks] ^')- affix (předpona, přípona) algorithm ['ael.ga.ri.ô3m] ^) - algoritmus bigram ['baigraem] - bigram (skupina dvou písmen, slabik či slov)
cluster ['klAs.t9r] © [-ta*]     - hrozen, skupina, klastr
comprehensive [.kDm.prťhenf.siv] © [,ka:m-]
- komplexní, obsáhlý
conflation [kan'fleif3n] Hi - spojování compression rate [kam'prej.3n] [reit]
- kompresivita
consonant ['kon.sa.nant] © ['ka:n-]
- souhláska
core [kDír] © [kDír] M') - jádro; jádřinec corpus ['kDi.pas] © ['kDír-] sj - korpus, tělo; soubor textů
distinct [dťstirjkt]     - různý rozdílný duplicate ['dju:.pli.kat] © ['du:-] 4 - duplikát; duplikovaný (srovnej výslovnost s „to duplicate" ['djui.pli.keit] ®['du:-] 4») inflectional [in'flekfanal] A - skloňovací, skloňující, skloňovatelný equivalent [i'kwiv.3l.ant] =í - ekvivalentní etymological [,et.i'mDl.a.d3ikal] © [-'ma:.la-] 4)
- etymologický, vztahující se k původu slova exclusive [ik'sklui.siv] sj -výhradní failure ['fei.ljar] © [-Ija*] 4> - neúspěch hence [henřs] 4>) - tudíž
however [,hau'ev.ar] © [-&] =J - však, avšak identical [aťden.ti.k3!] © [-ta-] sj - identický, stej ný
lexical ['lek.si.k3!] =J - lexikální
lexical chains ['lek.si.k3!] ^') [tjein] ^ ■) - lexikální
řetězce
loose [lu:s] 4>) - volný
mutual ['mjui.tju.al] 4)-vzájemný
predefined [prirdi'faind] © [pri:da'faind]
- předem definovaný
prefix ['prii.fiks] 4>) - předpona
query ['kwia.ri] © [ 'kwir.i] 4) 4) - dotaz
remainder [ri'mein.dar] © [-da*] &J - zbytek
root [ru:t] 4) 4) - kořen
routine [ru:'ti:n] 4>)- obvyklý
semantic [sťmaen.tik] © [-t ik] ^■) - sémantický,
významový
stem [stem] ^i> - kmen; stopka
surrogate ['sAr.a.gat] ®['s?:-] A1)- náhradník;
náhradní
suffix ['sAf.iks] 4>) - přípona
synonymous [sťnon.i.mas] © [-'na:.na-] 4>)
- podobného významu
thereafter [,5ea'ra:f.tar] © [,5er'aef.ta*] 4>) - poté
thesaurus [9a'so:ras] 4) - tesaurus
threshold ['9reJ./?auld] © [-/?ould] 4>) - práh
thus [5as] M■>) - tak, a tak
to aggregate st ['aeg.ri.geit] =J - (na)hromadit
něco
to aid st [eid] A S - napomáhat něčemu to arise (arise, arose, arisen) [a'raiz] 4>) -objevit se; vyvstat
to assess st [a'ses] =J - hodnotit něco
to conflate [kan'fleit] 4) - spojit, spojovat
to decompose st [,di:.kam'pauz] © [-'pouz] 4>)
- rozložit něco
to deem [di:m] A ■) - považovat
to duplicate st ['djui.pli.keit] © ['du:-] ^ 5
- duplikovat něco
to eliminate st [ťlim.i.neit] &J - eliminovat něco to exceed st [ik'siid] M •) - překročit něco to exhibit st [ig'zib.it] 4) - vykazovat něco to extract st [ik'straekt] =í - extrahovat, vytáhnout něco
to favor st ['fei.var] ® [-va*] =J - dávat něčemu přednost
30
to investigate st [in'ves.ti.geit] M S - vyšetřovat něco
to merge st [m3:d3] © [ m?:d3] =í - spojit něco, sloučit
to obtain st [ab'tein]     - získat něco
to pioneer [,paia'niar] © [-'nir] ^ ■) - razit cestu
to retrieve st [rťtriiv] M$ - vyhledat, vyzvednout
něco
to supersede st [,su:.pa'si:d] © [-paH ^') - nahradit něco
to supplement st ['sAp.li.mant] =J - doplnit něco
to treat st [triit] =í - zacházet s něčím
to warrant st [Wr.ant] © ['wDír-] M') - opravňovat
něco
variability [.vea.ri.abil .i.ti] © [,ver.i.a'bil.a.t i] -variabilita
variant [Vea.ri.ant] © [Ver.i-] ^ - varianta warehouse ['wea.haus] © ['wer-] M - skladiště whereas [wea'raez] © [wer'aez] M >) - kdežto
Phrases
In contrast to st ['kon.traist] © ['kain.traest] -Oproti něčemu
Obviously, ... [Db.vi.a.sli] © ['a:b-] 4>) -Samozřejmě
31
Storage Devices
KALADHAR VORUGANTI
Network Appliance, Sunnyvale, CA, USA
5
Definition
One of the goals of database, file and block storage systems is to store data persistently. There are many
10 different types of persistent storage devices technologies such as disks, tapes, DVDs, and Flash. The focus of this write-up is on the design trade-offs, from a usability standpoint, between these different types of persistent storage devices and not on the
15   component details of these different technologies.
Historical Background
From a historical standpoint, tapes were the first type 20 of persistent storage followed by disks, CDs, DVDs, and Flash. Newer types of memory technologies such as PRAM and MRAM are still in their infant stages. These newer non-volatile memory technologies promise DRAM access speeds and packaging 25 densities, but these technologies are still too expensive with respect to cost/gigabyte.
Foundations
30 - Tapes/Tape Libraries: Tape readers/tape head, tape library, tape robot, and tape cartridge are the key components of a tape subsystem. Tapes provide the best storage packaging density in comparison to other types of persistent storage devices. Tapes do not
35 provide random access to storage. Data on tapes can be stored either in compressed or uncompressed format. Unlike disks, tape cartridges can be easily transported between sites. Most organizations typically migrate data from older tape cartridges to newer tape cartridges
40 every 5 years to prevent data loss due to material degradation. One can employ disk based caches in front of tape subsystems in order to allow for tapes to handle bursty traffic. Tapes that provide Write-Once, Read Many (WORM) characteristics are also available.
45 WORM tapes are useful in data compliance environments where regulations warrant guarantees that a piece of data has not been altered. DLT and LTO are currently the two dominant tape technologies in the market. Technology-wise both these standards have
50 minor differences. Finally, from a pure media cost standpoint, tapes are less expensive (cost per gigabyte) than disks and other forms of persistent media.
- Disks/Storage Controllers/NAS Boxes: Disks are the 55   most widely used form of persistent storage media. Disks  are typically  accessed by enterprise level
applications when they are packaged as part of the processing server box (direct attached storage model), or are part of a network attached storage box (NAS)
60 and accessed via NAS protocols or, are packaged as part of a storage controller box and accessed via storage area network protocols (SAN). The current trend is for protocol consolidation, where the same storage controller provides support for both SAN and
65 NAS protocols. Typically, the size of the storage controllers can vary from a few terabytes to hundreds of terabytes (refrigerator sized storage controllers). A storage controller typically consists of redundant processors, protocol processing network cards, and
70 RAID processing adapter cards. The disks are connected to each other via either arbitrated loop or switched networks. Storage controllers also contain multi-gigabyte volatile caches. Disks are also packaged as part of laptops.
75 There is a marked difference in the manufacturing process, and testing process between the enterprise class disks and commodity laptop class disks. Disks vary in their form factor, rotational speed, storage capacity, number of available ports, and the protocols
80 used to access them. Currently, serial SCSI, parallel SCSI, serial ATA and parallel ATA, Fiber Channel, and SSA are the different protocols in use for accessing disks. Lower RPM and disk idle mode are new disk spin-down modes  that  allow disks to
85 consume less power when they are not actively being used.
- DVD/Juke Boxes: DVDs and CDs are optical storage media  that provide random  access  and WORM
90 capabilities. Only recently, the multiple erase capacity of an individual CD or DVD was less than the capacity of a single disk drive or tape cartridge. DVDs can store more data than a CD, and a high definition DVD can store more data than a DVD. There are numerous
95 competing standards for CDs, DVDs and high definition DVDs, however, format agnostic DVD players and DVD writers are emerging. Usage of DVDs is more prominent in the consumer space rather than in the enterprise space. A juke box system allows 100 one to access a library of CDs or DVDs. DVDs have slower access speeds than most types of disks.
- Flash/SSDs/Hybrid Disks: Flash is memory technology that has non-volatile characteristics. Flash memory has slower read times than DRAM. Moreover,
105   it has much slower write times than DRAM.
One has to perform an erase operation before one can re-use a flash memory location. One can only perform a limited number of erase operations. Thus, the number of write operations determines the Flash memory life.
110 SLC and MLC are the two different NAND flash technologies. SLC can be erased a greater number of times, and it has faster access times than MLC based
32
flash. NAND flash has faster write and erase times than NOR flash. NOR flash has faster read times than 115 NAND flash. NAND flash is used to store large amounts of data, whereas NOR flash is used to store executable code.
People are using MLC flash in cameras and digital gadgets, and are using SLC flash as part of solid state
120 disks (SSDs). SSDs provide block level access interface (SCSI), and they contain a controller that performs flash wear leveling and block allocation. Hybrid disks that contain a combination of disks and Flash are emerging. Hybrid disks provide a Flash
125 cache in front of the disk media. One typically can store meta-data or recently used data in the flash portion of hybrid disks to save on power consumption. That is, one does not have to spin up the disk. Flash storage provides much better random access speeds
130   than disk based storage.
Key Applications
Tapes are being used primarily for archival purposes 135   because they provide good sequential read/write times. Disks are the media of choice for most on-line applications. Optical media (CDs, DVDs) are popular in the consumer electronic space. Flash based SSDs are popular for those workloads that exhibit random IOs. 140   Disks are being used in laptops, desktops and storage servers (SANs, NAS, DAS). Tape based WORM media and content addressable based disk storage are providing WORM media capabilities in tape and disk technologies, and thus, these technologies can be used 145   to also store compliance/regulatory data.
(Abridged)
Recommended Reading
150
1. Anderson D., Dykes J., and Riedel E. More than an interface- SCSI versus ATA. In Proc. Second Annual Conf. on FAST. San Francisco, CA, 2003.
2. Toigo   J.   Holy   Grail   of  Network Storage 155   Management. Prentice Hall, Englewood Cliffs, NJ,
2003.
3. Voruganti K., Menon J., and Gopisetty S. Land below a DBMS. SIGMOD Rec, 33(l):64-70, 2004.
Answer the following questions:
Mark the following statements as true or false:
1)   Name some of the persistent storage devices
1)   Portable and inexpensive to purchase, tapes are often used for backing up or archiving data.
mentioned in the text. 2)   What are the advantages and disadvantages of
tapes?
3) Where are WORM tapes useful?
4) Name the two dominant tape technologies that
2)   Flash is memory technology that has volatile characteristics.
3)   WORM disks can be written only once.
are currently on the market. 5)   What is the most widely used form of
4)   NAND has significantly lower storage capacity than NOR.
use for accessing disks? 7)   What spin-down modes are used to save on
persistent storage medium? 6)   What are the protocols which are currently in
5)   MLC is used in solid state drives (SSDs) and SLC is used in consumer appliances like cameras, media players, cell phones, etc.
power consumption? 8)   What are the characteristics of some of the
6) An SSD (solid-state drive or solid-state disk) is a storage device that stores persistent data on solid-state flash memory.
optical storage media?
9) What is flash technology?
10) Where is flash memory used?
7)   In hybrid disks the flash memory handles the data most frequently written to or retrieved from storage.
Match the following terms with their definitions:
1) A library
2) A disk cache
3) Fibre channel
4) A storage area network (SAN)
5) DLT (digital linear tape)
6) A jukebox
a) A mechanism for improving the time it takes to read from or write to a hard disk
b) A unit in which optical disk drives are mounted
c) A high-speed special-purpose network (or sub-network) that interconnects different kinds of data storage devices
d) A collection of physical storage media such as tapes or disks and a way to access them
e) A form of magnetic tape and drive system used for computer data storage and archiving
f) A type of high speed interconnection
34
Vocabulary
access time ['aek.ses] ^') [taim] M') - přístupová doba
bursty traffic [baisti] [traefik] =J - shlukový
provoz, shluky dat
cache [kaef] ^■■) - skrýš, úkryt
competing [kam'piitin] - konkurenční
data compliance ['dei.ta]© [-ta] 4) 4')
[kam'plai.anfs] ^ i - ochrana dat před přepsáním
wear leveling [wear] © [wer] 4') ['lev.alin]
- strategie zápisu a mazání (vyrovnání opotřebení) focus on st ['fau.kas] © ['fou-] H 5 - zaměření format agnostic ['fDi.maet]© ffDir-] 4->) [aeg'nos.tik] © [-'nai.stik] 4')- přehrávající jakýkoli formát
gadget ['gaed3.1t] 43)- přístroj, zařízení
goal [gaul] © [goul] 4') - cíl
high definition [hai] 4) [.def.ťnif.an] 4') - vysoké
rozlišení
idle ['ai.dl] 4>) - nečinný
life [laif] 4-') - životnost
marked [ma:kt] © [ mairkt] 4') 41) - výrazný
non-volatile [nonVol.a.tail] © [non' vai.la.t al] 4')
- stálý, stabilní
numerous ['njui.ma.ras] © [ 'nu:-] =J - početný, četný
persistent [pa'sis.tsnt] © [paH 4-í - stálý, trvalý read time [rnd] =J [taim] 4í> - přístupová doba pro čtení
solid state disk(SSD) ['sol.id] © ['sai.lid] =J [steit] 4>) [disk] 4'J - pevný disk na bázi paměťových čipů
standpoint ['staend.pDint] 4)- stanovisko, hledisko
to access st ['aek.ses] 4á - dostat se k, mít přístup k
to allocate ['ael.a.keit] 4') - přidělit
to alter ['Dl.tar] © [ 'ail.t a*] 4>) - změnit
to compete [kam'piit] sj - soutěžit
to determine [dťt3:.min] ® [ -tr:-] =í - určovat,
udávat
to emerge [i'm3:d3] © [ -'m?:d3] 4í> - objevit se to handle ['haen.dl] 4) - zvládnout, vypořádat se to package ['paek.id3] 4>>- zabalit
to spin down [spin] ftj [daun] ^ ■) - zpomalit otáčení
to spin up [spin] ^') [Ap] 4í> - zrychlit otáčení to vary [Vea.ri] © [ Ver.i] 4:') - lišit se, různit se to warrant [Wr.3nt] © ['wDír-] 4') - zaručit trade-off ['treid.of] © [-a:f] =í - kompromis via [vaia] © [Vii.a] 4:') 4') - přes, prostřednictvím wise (suffix) [waiz] =J - pokud jde o, hovoříme-li o
write time [rait] 4') [taim] 4í> - přístupová doba zápisu
35
Storage Protection
KENICHIWADA
Hitachi Limited, Tokyo, Japan
5
Definition
Storage protection is a kind of data protection for data stored in a storage system. The stored data can be lost
10   or becomes inaccessible due to, mainly, a failure in storage component hardware (such as a hard disk drive or  controller),   a  disastrous  event,  an operator's mistake, or intentional alteration or erasure of the data. Storage protection provides the underlying foundation
15   for high availability and disaster recovery.
Historical Background
In 1956, IBM shipped the first commercial storage that 20 had a hard disk drive. To protect data from bit errors on disk platters, the hard disk drive commonly uses cyclic redundancy check (CRC) and an error-correcting code (ECC).
CRC and ECC cannot protect data from a whole disk 25   failure in which an entire disk becomes inaccessible (for example, because of a disk head crash). The IBM 3990, which was shipped in the 1980s, had the replication functionality in which two identical copies of data were maintained on separate media. This 30   approach protected data from this kind of failure. Replication functionality can be implemented in many other layers of the computer system. Most DBMS support database replication. Some file systems and Logical  Volume  Managers  have  file  or volume 35   replication   functionality.   Further,   many storage systems and storage virtualization appliances support volume replication functionality.
RAID (Redundant Array of Inexpensive Disks) is another technology for protecting data from whole disk
40 failure. D. Patterson et al. published a paper "A Case for Redundant Arrays of Inexpensive Disks (RAID)" in June 1988 at the SIGMOD conference [6]. This paper introduced a five level data protection scheme. The term RAID was adopted from this paper, but
45   currently RAID is an acronym for Redundant Arrays of Independent Disks. It is noted that the patent covering RAID level 5 technology was issued in 1978 [5]. RAID level 1 is a kind of replication. RAID level 2 to 5 can reduce the capacity required to protect data
50 against disk drive failure than replication, but it is limited to protect disk drive failure. Replication, on the other hand, can be used to protect databases, file systems and logical volume. Further replication can be used for disaster recovery, if data are replicated
55 remotely.
Foundations
Hard disk drives commonly use Reed-Solomon code
60 [7] to correct bit errors. Data in hard disk drives is usually stored in fixed length blocks. Controllers in hard disk drives calculate ECC for each block and record it associated with the original data. When data are read the controller checks data integrity using ECC.
65   CRC can be used with ECC for detecting bit errors and/or reducing the possibility of correction error. Most   DBMS   support   database   replication with master/slave relation between the original and the replica. The master process updates and transfers it to
70 the slave. This type of replication can provide high availability to the client of the DBMS in case of storage system failures as well as server failures. Another type of database replication is multi-master, which is mostly used to provide high performance
75 parallel processing. Both types can be either synchronous or asynchronous replication. In synchronous replication, updates made in original are guaranteed in the replica, note there may be some delay in asynchronous replication.
80 Volume replication by storage system is also widely accepted as data protection. There are synchronous and asynchronous replications, the same as database replication. Asynchronous volume replication is often used for long distance remote replication. It may
85 prevent performance degradation caused by replication delay, but could cause some data loss in case of recovery. Synchronous replication, on the other hand, may provide no data loss recovery, but may cause performance degradation due to replication delay.
90 Volume replication is also used within a local data center for online backup. Backup servers use replica volume for backup when original volume is online. To support this, a storage system can pause update delegation from original to replica volume.
95 RAID (Redundant Array of Independent Disks) is a set of disks from one or more commonly accessible disk subsystems, combined with a body of control software, in which part of the physical storage capacity is used to store redundant information about user data stored on 100 the remainder of the storage capacity. The term RAID refers to a group of storage schemes that divide and replicate data among multiple disks, to enhance the availability of data at desired cost and performance levels. A number of standard schemes have evolved 105 which are referred to as levels. Originally, five RAID levels were introduced [6], but many more variations have evolved. Currently, there are several sublevels as well as many non-standard levels. There are trade-offs among RAID levels in terms of performance, cost and 110 reliability.
36
Key Applications
115 Storage protection is essential to achieve business continuity and legal compliance with adequate performance, cost, and reliability.
(Abridged)
120
Recommended Reading
1. ANSI. NFPA1600 Standard on Disaster/Emergency Management and Business Continuity Programs. 125   2. BSI. BS25999; Business Continuity Management.
3. Houghton A. Error Coding for Engineers. Kluwer Academic Publications, Hingham, MA, 2001.
4. Keeton K., Santos C, Beyer D., Chase J., and Wilkes  J.  Designing  for disasters.  In Proc. 3rd
130 USENIX Conf. on File and Storage Technologies, 2004.
5. Ouchi N.K. (IBM Corporation). System for recovering data stored in failed memory unit. US Patent 4,092,732, 1978.
135 6. Patterson D., Gibson G, and Katz R. A case for redundant arrays of inexpensive disks (RAID). In 1988.
7. Sweeney P. Error Control Coding From Theory to Practice. Wiley, New York, 2002. 140   8. http://www.sec.gov/
Answer the following questions:
1) How can data loss occur?
2) What are CRC and ECC?
3) What does RAID stand for?
4) How many RAID levels were originally introduced?
5) What is Reed-Solomon code used for?
6) Besides disaster recovery, what is storage protection good for?
7) What is the basic difference between RAID 1 and the other RAID levels?
Match the following terms with their definitions:
1) synchronous replication
2) asynchronous replication
3) RAID level
4) data integrity check
a) in this approach, changes in the original can take some time to reflect in the replica
b) a way to determine whether data is corrupted
c) in this approach, changes in the original are immediately reflected in the replica
d) a specific strategy of distributing data over multiple disks
Mark the following statements as true or false:
1) When data is read, the controller checks data integrity using CRC.
2) Replication can also be used to protect databases and file systems.
3) Remote replication often employs the asynchronous volume replication approach.
Vocabulary
appliance [a'plai.anf s] =J - přístroj
array [a'rei] 4) - pole, sada
compliance [kam'plai.anf s] sj - shoda, dodržení
degradation [,deg.ra'dei.J"an] sj - pokles, zhoršení
disastrous [di'za:.stras] © [-'zaes.tras]
- katastrofální, neblahý
disk platter [disk] 4> ['plaet.ar] © ['plaet M 4>)
- disková plotna
erasure [i'rei.39r] © [-3^] M - smazání failure [fei.ljar] © [Ij*] 4* - selhání inaccessible [.in.ak'ses.i.bl] =J - nepřístupný, nedostupný
integrity [in'teg.ra.ti] © [-tj] 4% - celistvost, neporušenost
intentional [in'ten.J"an.al] sj - záměrný
loss [Ids] © [la:s] 4$ - ztráta
recovery [n'k/w^r.i] © [-&H =J - obnova
redundancy [ri'dAn.danf .si] =J - nadbytečnost
replica [Yep.li.ka] 4% - kopie, duplikát
scheme [ski:m] =J - schéma, soustava
to evolve [i'vdIv] © [-Va:lv/] 4$ - vyvinout se
volume [vDl.juim] © [Va:l-] =í - objem; svazek
Phrases
due to [dju:] © [du:] 4 í - kvůli; způsobený (čím)
39
Video
YINGLI
IBM T.J. Watson Research Center, Hawthorne, 5  NY, USA
Definition
Video, which means ' T see'' in Latin, is an electronic 10 representation of a sequence of images or frames, put together to simulate motion and interactivity. From the producer's perspective, a video delivers information created from the recording of real events to be processed simultaneously by a viewer's eyes and ears. 15 For most of time, a video also contains other forms of media such as audio.
Video is also referred to as a storage format for moving pictures as compared to image, audio, graphics and animation.
20
Historical Background
Video technology was first developed for television systems, but it has been further developed in many
25 formats to allow for consumer video recordings. Generally speaking, there are two main types of video: analog video and digital video. Analog videos are usually recorded as PAL (Phase Alternating Line) or NTSC   (National   Television   System Committee)
30 electric signals following the VHS (Video Home System) standard and stored in magnetic tapes. Digital videos, on the contrary, are usually captured by digital cameras and stored in digital video formats such as DVD (Digital Versatile Disc), QuickTime and MPEG-
35   4 (Moving Picture Experts Group).
Launched in September 1976, VHS became a standard format for consumer recording and viewing by the 1990s. Since then, it has dominated both home and commercial video markets. In March 1997, the DVD
40 format was introduced to American consumers, which gradually pulled consumers away from VHS in the following years due to its much better quality. In June 2003, the DVD's market share exceeded that of the VHS for the first time. Since then, it has been steadily
45 expanding its consumer market, and by July 2006, most major film studios have stopped releasing new movie titles in VHS format, opting for DVD-only releases. Now, VHS is gradually disappearing from both rental and retail stores, and DVD has dominated
50 the whole commercial market. Nevertheless, VHS is still popular for home recording of television programs, due to the large installed base and the lower cost of VHS recorders and tape.
For the last few decades, as video technology quickly 55   advances and the cost of storage devices rapidly decreases, digital videos have become widely available
in diverse application areas such as medicine, remote sensing,    entertainment,    education    and online information services. This has thus led to very active 60  research in various video-related areas.
Foundations
The last three decades have witnessed a significant
65 amount of research efforts on various aspects of video technologies. Roughly speaking, they fall into the following three general categories: video representation, video content analysis, and video application. Specifically, video representation deals
70 with the way a video is represented, in another word, the file format. Video content analysis, on the other hand, aims to automatically structure and ultimately understand the video by analyzing its underlying content. Due to the difficult nature of this problem,
75 such process usually involves the analysis of multiple media modalities including visual, audio and text information. Finally, video application applies what it has learned from the analysis engine, and facilitates various  types  of content  access  including video
80 browsing, summarization and retrieval. A brief discussion on each of these three research domains is given below.
Video Representation
85
A video sequence with accompanying sound track can occupy a vast amount of storage space when represented in digital format. As estimated in [6], a 1-min video clip could possibly occupy up to 448 MB. 90 Consequently, compression has been playing an important role in modern schemes for video representation.
A wide variety of methods have been proposed to compress the video stream. Nevertheless, almost all of 95 them build their approaches upon the fact that video data contains both spatial and temporal redundancy. Specifically, to reduce the spatial redundancy, an intra-frame compression is applied which registers differences between parts of a single frame. Such a
100 task is more closely related to image compression. Likewise, to reduce the temporal redundancy, an inter-frame compression is exploited which registers differences between neighboring frames. This involves discrete     Cosine     transform     (DCT), motion
105 compensation and other techniques. Some popular video compression mechanisms include H.261, H.263, H.264, MPEG-l,MPEG-2, MPEG-4 and MJPEG (Motion-Joint Photographic Experts Group). Specifically,     H.261      is     a      1990 ITU-T
110 (Telecommunication Standardization Sector of International Telecommunication Union) video coding standard originally designed for transmission over
40
ISDN lines. Later on, H.263 and H.264, which provide more   capabilities   and   mainly   target   at video-
115 conferencing applications, were standardized in 1995 and 2003, respectively. In 1998, the Moving Picture Experts Group (MPEG) was formed to establish an international standard for the coded representation of moving pictures and associated audio on digital storage
120 media. Currently, there have been three established MPEG standards from this effort: MPEG-1, MPEG-2, and MPEG-4. Each of them targets at different commercial applications. For instance, MPEG-1 is usually used as the Video CD (VCD) format, MPEG-2
125 for High Definition Television (HDTV), and MPEG-4 for streaming video applications. Finally, to facilitate mobile appliances such as digital cameras, MJPEG was developed in the 1990s which uses intra-frame coding technology that is very similar to those used in MPEG-
130 1 or MPEG-2. However, it does not use inter-frame prediction, which on the one hand, results in a loss of compression capability, yet on the other hand, it makes the degree of compression capability independent of the amount of motion in the scene. Moreover, it also
135 eases video editing as simple editing can now be performed at any frame.
Video Content Analysis
140 Video is a type of rich media as it often consists of other media types such as audio and text. Consequently, research on video content analysis can be grouped into three classes: visual content analysis, audio   content   analysis,   and   audiovisual content
145 analysis. A general goal of video content analysis is to extract the underlying video structure so as to facilitate convenient and nonlinear content access. Yet a more aggressive goal is to automatically understand video semantics so as to support applications such as video
150 summarization and retrieval that require an in-depth understanding of the video content.
Video Application
155 Besides the large amount of research efforts on video content analysis, there are also many attentions on studying various video applications. After all, making the bulky and unstructured video content convenient and efficient to access, present, share, search and
160 deliver is the ultimate goal of the entire research community in this area.
(Abridged)
165  Recommended Reading
1. Cheung S. and Zakhor A. Efficient video similarity
measurement with video signature. IEEE Trans. Circ. Syst. Video Tech., 13(l):59-74, 2003. 170 2. Li Y. and Dorai C. SVM-based audio classification for instructional video analysis. In Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 2004.
3. Li Y. and Kuo C.-C. Video Content Analysis Using 175   Multimodal    Information:    for    Movie Content
Extraction, Indexing and Representation. Kluwer, MA, USA, 2003.
4. Li Y., Narayanan S., and Kuo C.-C. Content-based movie analysis and indexing based on audiovisual
180 cues. IEEE Trans. Circ. Syst. Video Tech., 14(8): 1073-1085, 2004.
5. Mahmood T.S. and Srinivasan S. Detecting topical events in digital video. In Proc. 8th ACM Int. Conf. on Multimedia, 2000, pp. 85-94.
185 6. Mitchell J., Pennebaker W., Fogg C, and LeGall D. MPEG Video Compression Standard. Chapman & Hall, New York, NY, USA, 1992.
7. MPEG Requirements Group, MPEG-7 Applications Document v.8, ISO/MPEG N2860, MPEG Vancouver
190  Meeting, July 1999.
8. MPEG Requirements Group, MPEG-7 Context, Objectives and Technical Roadmap, ISO/MPEG N2861, MPEG Vancouver Meeting, July 1999.
9. MPEG     Requirements     Group, MPEG-7 195   Requirements Document V.15, ISO/MPEG N4317,
MPEG Sydney Meeting, July 2001.
10. Nock H., Adams W., Iyengar G., Lin C, Naphade M., Neti C, Tseng B., and Smith J. User-trainable video annotation using multimodal cues. In Proc. 26th
200 Annu. Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2003, pp. 403-404.
11. Oh J. and Hua K. Efficient and cost-effective techniques for browsing and indexing large video
205 databases. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2000, pp. 415^-26.
12. Pfeiffer S., Lienhart R., Fischer S., and Effelsberg W. Abstracting digital movies automatically. J. Vis. Comm. Image Represent., 7(4):345-353, 1996.
210 13. Yan R., Hauptmann A., and Jin R. Negative paeudo-relevance feedback in content-based video retrieval. In Proc. 11th ACM Int. Conf. on Multimedia, 2003, pp. 343-346.
14. Yeung M., Yeo B., and Liu B. Extracting story 215   units from long programs for video browsing and
navigation. In 1996, pp. 296-305.
15. Zhang T. and Kuo C.-C. Audio content analysis for on-line audiovisual data segmentation. IEEE Trans. Speech Audio Process., 9(4):441^157, 2001.
220 16. Zheng W., Li J., Si Z., Lin F., and Zhang B. and Using highlevel semantic features in video retrieval. In Image and Video Retrieval. Springer, Berlin Heidelberg, New York, 2006, pp. 370-379.
41
Answer the following questions:
1) What is video ?
2) What is the difference between analog and digital video?
3) What makes VHS still popular for home recording?
4) What is meant by remote sensing?
5) What do the terms video representation, video content analysis, and video application refer to?
6) What do most compression formats build on?
7) What is MPEG-1?
8) What does the term "rich media" refer to?
Match the following terms and their definitions:
1) video representation
2) video content analysis
3) intra-frame compression
4) inter-frame compression
5) likewise
a) a way of reducing spatial redundancy
b) deals with the file format
c) a way to reduce temporal redundancy
d) involves structuring the video
e) means the same as "similarly"
Mark the following statements as true or false:
1) Video was first developed for home use.
2) H.261 is a video coding standard originally designed for transmission over ISDN lines.
3) MPEG-4 is used as a high definition television standard.
4) MJPEG has been designed for use in mobile appliances.
5) MJPEG has nothing in common with MPEG-1 and MPEG-2 formats
6) A general goal of video content analysis is to facilitate convenient and linear content access.
Vocabulary
analysis [a'nael.a.sis] 4 i - analýza, pi. analyses appliance [a'plai.anfs] =J - zařízení audio ['di.di.au] © ['ai.di.ou] =J - audio, zvuk audiovisual [,D:.di.au'vi3.u.al] © [,a:.di.ou-]
- audiovizuální
capability [.kei.pa'bil.i.ti] © [-a.t i] 4) - schopnost compression [kam'prej>n] 4% - komprese consumer [kan'sjui.ma1'] © [-'sui.ma*] 4 5
- spotřebitel
discrete cosine transform [di'skri it] ftj ['kau.sain] © ['kou-] 4  [traens'fDim] © [-'fDirm] 4')
- diskrétní kosinová transformace diverse [dai'v3is] © [ di'vns] 4) - rozdílný domain [dai/'mein] © [dou-] sj -doména due to st [dju:] © [ du:] 4 - kvůli něčemu format ['fDi.maet] © f'fDir-] 4-b - formát frame [freim] 4$>- rám, rámec
linear ['lin.i.ar] © [-&] 4 - lineární market share ['ma:.kit] © ['mair-] 4') [/ear] © [Jer] 4 - podíl na trhu modality - modalita
motion ['mau./3n] © ['mou-]     - pohyb nevertheless [,nev.a.Sales] ©[-a*-] &J - nicméně non-linear [non'lin.i.a] - nelineární present ['prezent]     - současný (compare the pronunciation with that of verb to present [pri'zent] ^:J))
research [ri's3:tf] © [ 'ri:.s3:tj] 4') 4) - výzkum retrieval [n'triival] =J -vyhledávání, vyzvedávání sequence ['sii.kwanřs] =í - sekvence spatial and temporal redundancy [spei./9l] 4 ) ['tem.p3r.3l] ®[-pa*.al] 4   [ri'dAn.d3nt .si] 4')
- prostorová a časová nadbytečnost (redundance) steady ['sted.i]     - stálý
thus [5as] 4') - tak, a tak
to aim to do st [eim] 4% - snažit se něco dělat, být zaměřen na dělání něčeho
to browse st [brauz] 4$ - listovat, procházet něčím
to capture st ['kaep.tJV] ® [-tTa*] =J - zachytit něco
to deal with st [dial] 4 - zabývat se něčím to develop [di'vel.ap] 4$ - vyvinout, vyvíjet to disappear [,dis.a'piar] © [-'pír] 4$ - zmizet, mizet
to dominate ['dom.i.neit] © ['dai.ma-] 4 ■) - dominovat
to ease st [i:z] 4 i - udělat něco jednodušší, zjednodušit
to exceed st [iksiid] 4^ - přesahovat, překročit něco
to extract st [ik'straekt] . - extrahovat, vytáhnout něco
to facilitate st [fa'sil.i.teit] 4i> - zjednodušit něco, umožnit něco
to fall into st [fDil] © [ fail] 4$ - spadat do něčeho to involve st [in'volv] © [-Va:lv] 4 - zahrnovat něco
to launch st [binřj] © [ lainřj] 4 - vypustit něco, vydat něco
to occupy ['ok.ju.pai] © ['ai.kju-] 4$ - zabírat, okupovat
to opt for st [opt] © [a:pt] 4 'j - rozhodnout se pro něco, zvolit něco
to present st [pri'zent] &J - prezentovat něco to refer to st [ri'fa:] 4$ - odkazovat na něco, označovat něco
to release st [ri'liis]     - vypustit, vydat něco to simulate ['sim.ju.leit] © [-t id] 4$ - simulovat, napodobovat
to witness st ['wit.nas] 4 - být něčemu svědkem ultimate ['Al.ti.mat] © [-ta-] =S - konečný, nejzazší
underlying [.An.da'lai.in] © [-da*-] 4 5 - základní
43
Phrases
consequently ['kDnf.si.kwant.li] © ['kainř-] -následně
generally speaking - obecně řečeno independent of st [,in.di'pen.dsnt] ^■) - nezávislý na něčem
likewise [laik.waiz] =J - podobně
on the contrary ['kon.tra.ri] © ['kain.tre-] ^:)) -
naopak
on the one hand     on the other hand ... - na
jedné straně     na druhé straně ...
to play an important role in st - hrát důležitou
úlohu v něčem
44