Cluster 3.0 Manual for Windows, Mac OS X, Linux, Unix Michael Eisen; updated by Michiel de Hoon Software copyright c⃝ Stanford University 1998-99 This manual was originally written by Michael Eisen. It is only partially complete and is a work in progress. The manual was updated in 2002 by Michiel de Hoon, University of Tokyo, Human Genome Center. 1 This is the manual for Cluster 3.0. Cluster was originally written by Michael Eisen while at Stanford University. We have modified the k-means clustering algorithm in Cluster, and extended the algorithm for Self-Organizing Maps to include two-dimensional rectangular grids. The Euclidean distance and the city-block distance were added as new distance measures between gene expression data. The proprietary Numerical Recipes routines, which were used in the original version of Cluster/TreeView, have been replaced by open source software. Cluster 3.0 is available for Windows, Mac OS X, Linux, and Unix. November 5, 2002. Michiel de Hoon Human Genome Center, University of Tokyo. i Table of Contents 1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Loading, filtering, and adjusting data . . . . . . . . . 3 2.1 Loading Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Filtering Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 Adjusting Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3.1 Log transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3.2 Mean/Median Centering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3.3 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3 Distance/Similarity measures . . . . . . . . . . . . . . . . 10 3.1 Distance measures based on the Pearson correlation . . . . . . . . . . . 10 3.2 Non-parametric distance measures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.3 Distance measures related to the Euclidean distance . . . . . . . . . . . 12 3.3.1 Euclidean distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.3.2 City-block distance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.4 Missing values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.5 Calculating the distance matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 4 Clustering techniques . . . . . . . . . . . . . . . . . . . . . . . . . 14 4.1 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4.1.1 Centroid Linkage Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.1.2 Single Linkage Clustering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.1.3 Complete Linkage Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.1.4 Average Linkage Clustering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.1.5 Weighting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.1.6 Ordering of Output File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.1.7 Output Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.2 The \math k-means Clustering Algorithm . . . . . . . . . . . . . . . . . . . . . 20 4.3 Self-Organizing Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.4 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 5 Running Cluster 3.0 as a command line program. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 6 TreeView . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 7 Code Development Information . . . . . . . . . . . . . . 29 8 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Chapter 1: Introduction 2 1 Introduction Cluster and TreeView are programs that provide a computational and graphical environment for analyzing data from DNA microarray experiments, or other genomic datasets. The program Cluster can organize and analyze the data in a number of different ways. TreeView allows the organized data to be visualized and browsed. This manual is intended as a reference for using the software, and not as a comprehensive introduction to the methods employed. Many of the methods are drawn from standard statistical cluster analysis. There are excellent textbooks available on cluster analysis which are listed in the bibliography at the end. The bibliography also contains citations for recent publications in the biological sciences, especially genomics, that employ methods similar to those used here. Chapter 2: Loading, filtering, and adjusting data 3 2 Loading, filtering, and adjusting data Data can be loaded into Cluster by choosing Load data file under the File menu. A number of options are provided for adjusting and filtering the data you have loaded. These functions are accessed via the Filter Data and Adjust Data tabs. Chapter 2: Loading, filtering, and adjusting data 4 2.1 Loading Data The first step in using Cluster is to import data. Currently, Cluster only reads tab-delimited text files in a particular format, described below. Such tab-delimited text files can be created and exported in any standard spreadsheet program, such as Microsoft Excel. An example datafile can be found under the File format help item in the Help menu. This contains all the information you need for making a Cluster input file. By convention, in Cluster input tables rows represent genes and columns represent samples or observations (e.g. a single microarray hybridization). For a simple timecourse, a minimal Cluster input file would look like this: Each row (gene) has an identifier (in green) that always goes in the first column. Here we are using yeast open reading frame codes. Each column (sample) has a label (in blue) that is always in the first row; here the labels describe the time at which a sample was taken. The first column of the first row contains a special field (in red) that tells the program what kind of objects are in each row. In this case, YORF stands for yeast open reading frame. This field can be any alpha-numeric value. It is used in TreeView to specify how rows are linked to external websites. The remaining cells in the table contain data for the appropriate gene and sample. The 5.8 in row 2 column 4 means that the observed data value for gene YAL001C at 2 hours was 5.8. Missing values are acceptable and are designated by empty cells (e.g. YAL005C at 2 hours). It is possible to have additional information in the input file. A maximal Cluster input file would look like this: The yellow columns and rows are optional. By default, TreeView uses the ID in column 1 as a label for each gene. The NAME column allows you to specify a label for each gene that is distinct from the ID in column 1. The other rows and columns will be described later in this text. Chapter 2: Loading, filtering, and adjusting data 5 When Cluster 3.0 opens the data file, the number of columns in each row is checked. If a given row contains less or more columns than needed, an error message is displayed. Demo data A demo datafile, which will be used in all of the examples here, is available at http://rana.lbl.gov/downloads/data/demo.txt and is mirrored at http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/demo.txt. The datafile contains yeast gene expression data described in Eisen et al. (1998) [see references at end]. Download this data and load it into Cluster. Cluster will give you information about the loaded datafile. Chapter 2: Loading, filtering, and adjusting data 6 2.2 Filtering Data The Filter Data tab allows you to remove genes that do not have certain desired properties from your dataset. The currently available properties that can be used to filter data are • % Present >= X. This removes all genes that have missing values in greater than (100 − X) percent of the columns. • SD (Gene Vector) >= X. This removes all genes that have standard deviations of observed values less than X. • At least X Observations with abs(Val) >= Y. This removes all genes that do not have at least X observations with absolute values greater than Y . • MaxVal-MinVal >= X. This removes all genes whose maximum minus minimum values are less than X. These are fairly self-explanatory. When you press filter, the filters are not immediately applied to the dataset. You are first told how many genes would have passed the filter. If Chapter 2: Loading, filtering, and adjusting data 7 you want to accept the filter, you press Accept, otherwise no changes are made. 2.3 Adjusting Data From the Adjust Data tab, you can perform a number of operations that alter the underlying data in the imported table. These operations are • Log Transform Data: replace all data values x by log2 (x). • Center genes [mean or median]: Subtract the row-wise mean or median from the values in each row of data, so that the mean or median value of each row is 0. • Center arrays [mean or median]: Subtract the column-wise mean or median from the values in each column of data, so that the mean or median value of each column is 0. • Normalize genes: Multiply all values in each row of data by a scale factor S so that the sum of the squares of the values in each row is 1.0 (a separate S is computed for each row). Chapter 2: Loading, filtering, and adjusting data 8 • Normalize arrays: Multiply all values in each column of data by a scale factor S so that the sum of the squares of the values in each column is 1.0 (a separate S is computed for each column). These operations are not associative, so the order in which these operations is applied is very important, and you should consider it carefully before you apply these operations. The order of operations is (only checked operations are performed): • Log transform all values. • Center rows by subtracting the mean or median. • Normalize rows. • Center columns by subtracting the mean or median. • Normalize columns. 2.3.1 Log transformation The results of many DNA microarray experiments are fluorescent ratios. Ratio measurements are most naturally processed in log space. Consider an experiment where you are looking at gene expression over time, and the results are relative expression levels compared to time 0. Assume at timepoint 1, a gene is unchanged, at timepoint 2 it is up 2-fold and at timepoint three is down 2-fold relative to time 0. The raw ratio values are 1.0, 2.0 and 0.5. In most applications, you want to think of 2-fold up and 2-fold down as being the same magnitude of change, but in an opposite direction. In raw ratio space, however, the difference between timepoint 1 and 2 is +1.0, while between timepoint 1 and 3 is -0.5. Thus mathematical operations that use the difference between values would think that the 2-fold up change was twice as significant as the 2-fold down change. Usually, you do not want this. In log space (we use log base 2 for simplicity) the data points become 0,1.0,-1.0.With these values, 2-fold up and 2-fold down are symmetric about 0. For most applications, we recommend you work in log space. 2.3.2 Mean/Median Centering Consider a now common experimental design where you are looking at a large number of tumor samples all compared to a common reference sample made from a collection of celllines. For each gene, you have a series of ratio values that are relative to the expression level of that gene in the reference sample. Since the reference sample really has nothing to do with your experiment, you want your analysis to be independent of the amount of a gene present in the reference sample. This is achieved by adjusting the values of each gene to reflect their variation from some property of the series of observed values such as the mean or median. This is what mean and/or median centering of genes does. Centering makes less sense in experiments where the reference sample is part of the experiment, as it is many timecourses. Centering the data for columns/arrays can also be used to remove certain types of biases. The results of many two-color fluorescent hybridization experiments are not corrected for systematic biases in ratios that are the result of differences in RNA amounts, labeling efficiency and image acquisition parameters. Such biases have the effect of multiplying ratios for all genes by a fixed scalar. Mean or median centering the data in log-space has the effect of correcting this bias, although it should be noted that an assumption is being made in correcting this bias, which is that the average gene in a given experiment is expected to have a ratio of 1.0 (or log-ratio of 0). Chapter 2: Loading, filtering, and adjusting data 9 In general, I recommend the use of median rather than mean centering, as it is more robust against outliers. 2.3.3 Normalization Normalization sets the magnitude (sum of the squares of the values) of a row/column vector to 1.0. Most of the distance metrics used by Cluster work with internally normalized data vectors, but the data are output as they were originally entered. If you want to output normalized vectors, you should select this option. A sample series of operations for raw data would be: • Adjust Cycle 1) log transform • Adjust Cycle 2) median center genes and arrays • repeat (2) five to ten times • Adjust Cycle 3) normalize genes and arrays • repeat (3) five to ten times This results in a log-transformed, median polished (i.e. all row-wise and column-wise median values are close to zero) and normal (i.e. all row and column magnitudes are close to 1.0) dataset. After performing these operations you should save the dataset. Chapter 3: Distance/Similarity measures 10 3 Distance/Similarity measures The first choice that must be made is how similarity (or alternatively, distance) between gene expression data is to be defined. There are many ways to compute how similar two series of numbers are. Cluster provides eight options. 3.1 Distance measures based on the Pearson correlation The most commonly used similarity metrics are based on Pearson correlation. The Pearson correlation coefficient between any two series of numbers x = {x1, x2, . . . , xn} and y = {y1, y2, . . . , yn} is defined as r = 1 n n∑ i=1 ( xi − x σx ) ( yi − y σy ) , where x is the average of values in x, and σx is the standard deviation of these values. There are many ways of conceptualizing the correlation coefficient. If you were to make a scatterplot of the values of x against y (pairing x1 with y1, x2 with y2, etc), then r reports how well you can fit a line to the values. The simplest way to think about the correlation coefficient is to plot x and y as curves, with r telling you how similar the shapes of the two curves are. The Pearson correlation coefficient is always between -1 and 1, with 1 meaning that the two series are identical, 0 meaning they are completely uncorrelated, and -1 meaning they are perfect opposites. The correlation coefficient is invariant under linear transformation of the data. That is, if you multiply all the values in y by 2, or add 7 to all the values in y, the correlation between x and y will be unchanged. Thus, two curves that have identical shape, but different magnitude, will still have a correlation of 1. Cluster actually uses four different flavors of the Pearson correlation. The textbook Pearson correlation coefficient, given by the formula above, is used if you select Correlation (centered) in the Similarity Metric dialog box. Correlation (uncentered) uses the following modified equations: r = 1 n n∑ i=1 ( xi σ (0) x ) ( yi σ (0) y ) , in which σ(0) x = 1 n n∑ i=1 (xi) 2 ; σ(0) y = 1 n n∑ i=1 (yi) 2 . This is basically the same function, except that it assumes the mean is 0, even when it is not. The difference is that, if you have two vectors x and y with identical shape, but which are offset relative to each other by a fixed value, they will have a standard Pearson Chapter 3: Distance/Similarity measures 11 correlation (centered correlation) of 1 but will not have an uncentered correlation of 1. The uncentered correlation is equal to the cosine of the angle of two n-dimensional vectors x and y, each representing a vector in n-dimensional space that passes through the origin. Cluster provides two similarity metrics that are the absolute value of these two correlation functions, which consider two items to be similar if they have opposite expression patterns; the standard correlation coefficients consider opposite genes to be very distant. 3.2 Non-parametric distance measures The Spearman rank correlation and Kendall’s τ are two additional metrics, which are nonparametric versions of the Pearson correlation coefficient. These methods are more robust against outliers. The Spearman rank correlation calculates the correlation between the ranks of the data values in the two vectors. For example, if we have two data vectors x = {2.3, 6.7, 4.5, 20.8} ; y = {2.1, 5.9, 4.4, 4.2} , then we first replace them by their ranks: x = {1, 3, 2, 4} ; y = {1, 4, 3, 2} . Now we calculate the correlation coefficient in their usual manner from these data vectors, resulting in rSpearman = 0.4. In comparison, the regular Pearson correlation between these data is r = 0.2344. By replacing the data values by their ranks, we reduced the effect of the outlier 20.8 on the value of the correlation coefficient. The Spearman rank correlation can be used as a test statistic for independence between x and y. For more information, see Conover (1980). Kendall’s τ goes a step further by using only the relative ordering of x and y to calculate the correlation (Snedecor & Cochran). To calculate Kendall’s τ, consider all pairs of data points (xi, yi) and (xj, yj). We call a pair concordant if • xi < xj and yi < yj; or • xi > xj and yi > yj, and discordant if • xi < xj and yi > yj; or • xi > xj and yi < yj. We can represent this by a table: − (2.3, 2.1) (6.7, 5.9) (4.5, 4.4) (20.8, 4.2) (2.3, 2.1) − << << << (6.7, 5.9) >> − >> <> (4.5, 4.4) >> << − <> (20.8, 4.2) >> >< >< − Chapter 3: Distance/Similarity measures 12 From this table, we find that there are four concordant pairs and two discordant pairs: nc = 4; nd = 2; Kendall’s τ is calculated as τ = Nc − Nd N (N − 1) /2, which in this case evaluates as 0.33. In the C Clustering Library, the calculation of Kendall’s τ is corrected for the possibility that two ranks are equal. As in case of the Spearman rank correlation, we may use Kendall’s τ to test for independence between x and y. 3.3 Distance measures related to the Euclidean distance 3.3.1 Euclidean distance A newly added distance function is the Euclidean distance, which is defined as d ( x, y ) = 1 n n∑ i=1 (xi − yi) 2 . The Euclidean distance takes the difference between two gene expression levels directly. It should therefore only be used for expression data that are suitably normalized, for example by converting the measured gene expression levels to log-ratios. In the sum, we only include terms for which both xi and yi are present, and divide by n accordingly. Unlike the correlation-based distance measures, the Euclidean distance takes the magnitude of changes in the gene expression levels into account. An example of the Euclidean distance applied to k-means clustering can be found in De Hoon, Imoto, and Miyano (2002). 3.3.2 City-block distance The city-block distance, alternatively known as the Manhattan distance, is related to the Euclidean distance. Whereas the Euclidean distance corresponds to the length of the shortest path between two points, the city-block distance is the sum of distances along each dimension: d = n∑ i=1 |xi − yi| . This is equal to the distance you would have to walk between two points in a city, where you have to walk along city blocks. The city-block distance is a metric, as it satisfies the triangle inequality. Again we only include terms for which both xi and yi are present, and divide by n accordingly. As for the Euclidean distance, the expression data are subtracted directly from each other, and we should therefore make sure that they are properly normalized. 3.4 Missing values When either x or y has missing values, only observations present for both x and y are used in computing similarities. Chapter 3: Distance/Similarity measures 13 3.5 Calculating the distance matrix With any specified metric, the first step in the hierarchical clustering routines described below is to compute the distance (the opposite of similarity; for all correlation metrics distance = 1.0 - correlation) between all pairs of items to be clustered (e.g. the set of genes in the current dataset). This can often be time consuming, and, except for pairwise single-linkage clustering, memory intensive (the maximum amount of memory required is 4 × N × N bytes, where N is the number of items being clustered). The algorithm for pairwise single-linkage hierarchical clustering is less memory-intensive (linear in N). Chapter 4: Clustering techniques 14 4 Clustering techniques The Cluster program provides several clustering algorithms. Hierarchical clustering methods organizes genes in a tree structure, based on their similarity. Four variants of hierarchical clustering are available in Cluster. In k-means clustering, genes are organized into k clusters, where the number of clusters k needs to be chosen in advance. Self-Organizing Maps create clusters of genes on a two-dimensional rectangular grid, where neighboring clusters are similar. For each of these methods, one of the eight different distance meaures can be used. Finally, in Principal Component Analysis, clusters are organized based on the principal component axes of the distance matrix. 4.1 Hierarchical Clustering The Hierarchical Clustering tab allows you to perform hierarchical clustering on your data. This is a powerful and useful method for analyzing all sorts of large genomic datasets. Many published applications of this analysis are given in the references section at the end. Cluster currently performs four types of binary, agglomerative, hierarchical clustering. The basic idea is to assemble a set of items (genes or arrays) into a tree, where items are joined by very short branches if they are very similar to each other, and by increasingly longer branches as their similarity decreases. The first step in hierarchical clustering is to calculate the distance matrix between the gene expression data. Once this matrix of distances is computed, the clustering begins. Agglomerative hierarchical processing consists of repeated cycles where the two closest Chapter 4: Clustering techniques 15 remaining items (those with the smallest distance) are joined by a node/branch of a tree, with the length of the branch set to the distance between the joined items. The two joined items are removed from list of items being processed and replaced by a item that represents the new branch. The distances between this new item and all other remaining items are computed, and the process is repeated until only one item remains. Note that once clustering commences, we are working with items that are true items (e.g. a single gene) and items that are pseudo-items that contain a number of true items. There are a variety of ways to compute distances when we are dealing with pseudo-items, and Cluster currently provides four choices, which are called centroid linkage, single linkage, complete linkage, and average linkage. Note that in older versions of Cluster, centroid linkage was referred to as average linkage. 4.1.1 Centroid Linkage Clustering If you click Centroid Linkage Clustering, a vector is assigned to each pseudo-item, and this vector is used to compute the distances between this pseudo-item and all remaining items or pseudo-items using the same similarity metric as was used to calculate the initial similarity matrix. The vector is the average of the vectors of all actual items (e.g. genes) contained within the pseudo-item. Thus, when a new branch of the tree is formed joining together a branch with 5 items and an actual item, the new pseudo-item is assigned a vector that is the average of the 6 vectors it contains, and not the average of the two joined items (note that missing values are not used in the average, and a pseudo-item can have a missing value if all of the items it contains are missing values in the corresponding row/column). Note that from a theoretical perspective, Centroid Linkage Clustering is peculiar if it is used in combination with one of the distance measures that are based on the Pearson correlation. For these distance measures, the data vectors are implicitly normalized when calculating the distance (for example, by subtracting the mean and dividing by the standard deviation when calculating the Pearson correlation. However, when two genes are joined and their centroid is calculated by averaging their data vectors, no normalization is applied. This may lead to the surprising result that distances may decrease when we go up in the tree representing the hierarchical clustering result. For example, consider this data set: Exp1 Exp2 Exp3 Exp4 Gene1 0.96 0.07 0.97 0.98 Gene2 0.50 0.28 0.29 0.77 Gene3 0.08 0.96 0.51 0.51 Gene4 0.14 0.19 0.41 0.51 Performing pairwise centroid-linkage hierarchical clustering on this data set, using the Pearson distance as the distance measure, produces the clustering result • Gene 1 joins Gene 2 at distance 0.47 • (Gene 1, Gene 2) joins Gene 4 at distance 0.46 • (Gene 1, Gene 2, Gene 4) joins Gene 3 at distance 1.62 This may result in ill-formed dendrograms. For an example, see the Java TreeView manual. A solution is to use the Euclidean or the city-block distance, or to use one of the other hierarchical clustering routines, which don’t suffer from this issue regardless of the distance measure being used. Chapter 4: Clustering techniques 16 4.1.2 Single Linkage Clustering In Single Linkage Clustering the distance between two items x and y is the minimum of all pairwise distances between items contained in x and y. Unlike centroid linkage clustering, in single linkage clustering no further distances need to be calculated once the distance matrix is known. In Cluster 3.0, as of version 1.29 the implementation of single linkage clustering is based on the SLINK algorithm (see Sibson, 1973). Whereas this algorithm yields the exact same clustering result as conventional single-linkage hierarchical clustering, it is much faster and more memory-efficient (being linear in the memory requirements, compared to quadratic for the conventional algorithm). Hence, single-linkage hierarchical clustering can be used to cluster large gene expression data sets, for which centroid-, complete-, and average-linkage fail due to lack of memory. 4.1.3 Complete Linkage Clustering In Complete Linkage Clustering the distance between two items x and y is the maximum of all pairwise distances between items contained in x and y. As in single linkage clustering, no other distances need to be calculated once the distance matrix is known. 4.1.4 Average Linkage Clustering In average linkage clustering, the distance between two items x and y is the mean of all pairwise distances between items contained in x and y. 4.1.5 Weighting Weighting: By default, all of the observations for a given item are treated equally. In some cases you may want to give some observations more weight than others. For example, if you have duplicate copies of a gene on your array, you might want to downweight each individual copy when computing distances between arrays. You can specify weights using the ‘GWEIGHT’ (gene weight) and ‘EWEIGHT’ (experiment weight) parameters in the input file. By default all weights are set to 1.0. Thus, the actual formula, with weights included, for the Pearson correlation of x = {x1, x2, . . . , xn} and y = {y1, y2, . . . , yn} with observation weights of w = {w1, w2, . . . , wn} is r = 1 ∑N i=1 wi N∑ i=1 wi ( Xi − X σX ) ( Yi − Y σY ) Note that when you are clustering rows (genes), you are using column (array) weights. It is possible to compute weights as well based on a not entirely well understood function. If you want to compute weights for clustering genes, select the check box in the Genes panel Chapter 4: Clustering techniques 17 of the Hierarchical Clustering tab. This will expose a Weight Options dialog box in the Arrays panel (I realize this placement is a bit counterintuitive, but it makes sense as you will see below). The idea behind the Calculate Weights option is to weight each row (the same idea applies to columns as well) based on the local density of row vectors in its vicinity, with a high density vicinity resulting in a low weight and a low density vicinity resulting in a higher weight. This is implemented by assigning a local density score L (i) to each row i. L (i) = ∑ j with d(i,j)