PV251 Visualization
Autumn 2024
Study material
Lecture 3: Data preprocessing
In this lecture, we will present different approaches to data preprocessing that prepare the
input dataset for the subsequent visualization (e.g., handles the erroneous or missing data).
It is generally preferred to display original data without any modification. The main reason is
fear of losing important information or adding undesirable artifacts. An example is medical
visualization. Viewing rough data often identifies problem locations in data logs, such as
missing data or significantly different records that may indicate bad calculations.
On the other hand, some of the preprocessing methods are necessary for other types of
data.
Data preprocessing techniques:
1. Metadata and statistics
2. Missing values and data „cleaning“
3. Normalization
4. Segmentation
5. Sampling and interpolation
6. Dimension reduction
7. Data aggregation
8. Smoothing and filtration
9. Conversion of raster data to vectors
1. Metadata and statistics
Information about data sets – metadata and statistical analysis can provide invaluable
information for pre-processing of data. Metadata provides information that can help in
interpreting data (for example, the format of individual records). In addition, they may
contain a reference point from which the data field records are measured, the unit of such
measurement, the symbol or number used to indicate the missing value and the resolution
at which the data was acquired. Such information may be important for selecting the
appropriate preprocessing method and setting its parameters.
Different methods of statistical analysis can provide a useful data preview.
• Outlier detection can identify data with wrong data fields.
• Cluster analysis can help in data segmentation into groups according to their
similarity.
• Correlation analysis enables to remove the redundant fields or to highlight the
association between dimensions, which does not have to be clear on the first
sight.
2. Missing values and data „cleaning“
A frequent phenomenon in the analysis and visualization of real data sets is the absence of
some data records or their erroneous value. The reason for the absence of data can be, for
example, an error in the data sensor or a blank check box in the questionnaire. The reason
for erroneous data is most often a human factor and is difficult to detect. In both cases,
however, when analyzing such data, it is necessary to choose a strategy to deal with these
errors. Some of these strategies, especially those that link to visualization, will now be
addressed.
1) Removing wrong records. This seemingly drastic method of removing all missing or
bad fields is in fact one of the most common. Of course, the quality of the
information displayed is questionable. It can lead to a significant loss of information
because it is reported that in some sectors up to 90% of records contain at least one
wrong field. Missing data records may be some of the most interesting ones.
2) Assigning a defined value. Each variable in the dataset determines the constant
value we assign to it if we find the disputed value in the record. For example, for a
variable with a range of values between 0 and 100, we can choose a missing value -5,
for example. Then, when visualizing such data at a glance, we see which data was
problematic. It is obvious that for the subsequent statistical analysis it is necessary to
omit these artificially added values.
3) Assigning an average value. A simple strategy to deal with missing data is to replace
them with the average value of a given variable or dimension. The advantage of this
approach is to minimize the overall statistics calculated for the given variable. The
disadvantage of this approach is that the selected value may not always be "correct".
Another drawback is the fact that we get rid of potentially interesting places in the
data.
4) Assigning a value derived from the nearest neighbor. Better approximation is to find
the most similar record. Similarity is based on the analysis of differences in all other
variables. The principle is as follows. Consider the entry A, which lacks the input for
variable i. Also, take a record B that is closer to A than all other records in the
dataset, taking into account all other variables except for i. In this case, in the
variable even in the record A, we put the value of the variable from the record B. The
problem with this solution is the fact that the variable i can only depend on a small
subset of the other record variable (s), so the "best neighbor" strategy may not
always find the best solution.
5) Calculating of imputation. A very complex process, with long-term research behind
it. The process of calculating the refund of missing or missing data is known as
imputation = substitution.
3. Normalization
Normalization is the process of transforming an input data set in such a way that the
transformed data matches predetermined statistical properties. A typical simple example of
transformation is the conversion of the range of input data values only to the interval [0.0,
1.0]. Other forms of normalization convert data in such a way that each variable (dimension)
has a mean and standard deviation. Normalization is a very useful operation because it
makes it possible to compare seemingly incomparable variables. It is also very important in
the visualization process when individual graphical attributes only acquire certain possible
values, so it is necessary to map the input data to these attributes. This leads to the
conversion of the data range to the range of graphical attributes.
For example, if we take dmin and dmax the minimum and maximum values in the variables,
we can normalize all values in the range from 0.0 to 1.0 using the formula:
dnormalized = (doriginal - dmin)/(dmax - dmin)
In some cases, we set the scale and offset to simplify data interpretation in such a way as to
match intuitive minimum and maximum values. An example may be a dataset whose values
fall within a percentage range of 40 and 90. In this case, it is more intuitive to scale the data
to 0-100%.
Normalization can also include trimming at threshold values where threshold values above
that threshold are closed to this threshold (they are assigned a threshold value). This is
advantageous, for example, in mapping such data to graphical attributes in visualization.
4. Segmentation
In many cases, data may be divided into continuous areas where each area corresponds to a
certain classification of data. For example, magnetic resonance data originally has 256 values
that a given data point can take and then such a point can be segmented into one of the
specified categories, such as bone, muscle, fat, skin, etc. Simple segmentation can be done
by simply mapping discontinuous ranges of data values into these given categories.
However, in most cases, such mapping is ambiguous. In such cases, it is necessary to look at
the classification of adjacent points, which will improve the quality of the entire
classification. Another possibility is probabilistic segmentation where each point is assigned
a probability that belongs to that category.
A typical segmentation problem is so-called subsegmentation (large areas of unconnected
regions) or over-segmentation (a large number of small regions that are related). The
solution is iterative repetition of the split-and-merge segmentation process.
Split-and-merge
▪ similarThresh = defines the similarity of two regions with given
characteristics
▪ homogeneousThresh = defines the region homogeneity (uniformity)
do {
changeCount = 0;
for each region {
compare region with neighboring ones and find the most
similar one;
if the most similar one is within similarThresh of the
current region {
connect these two regions;
changeCount++;
}
evaluate the homogeneity of the region;
if homogeneity of region is smaller than
homogeneousThresh {
split the region to two parts;
changeCount++;
}
} until changeCount == 0
The iterative split-and-merge algorithm works as follows. At the beginning, thresholds
similarThresh are defined, defining the similarity of two regions and homogeneousThresh,
which defines homogeneity within a region. Then, the algorithm itself runs through each
region in the given input dataset. Such a region compares with neighboring regions and,
according to a predefined similarity, finds the most similar. If this neighbor is found within a
tolerance defined by a similarThresh threshold, these two regions are joined together. In this
way, regions are merged. The second option that can occur is the distribution of a region
that is not homogenous enough (defined by the homogeneousThresh threshold). At each
change (casting or distribution), the change is recorded in the algorithm using the
changeCount variable. If no change occurs in the new iteration, the value of this variable is
not changed and the algorithm ends.
The most difficult parts of the algorithm are:
1) Determining the similarity of two regions. The easiest method is to compare the
average values in each region.
2) Evaluation of region homogeneity. The solution can be to evaluate the histogram of
values within this region and determine whether it is unimodal or multimodal.
3) Division of the region. A typical algorithm creates two (or more) subregions at points
(points) where the most different values are found. These points are then marked as
seeds, and the filling process is started when the regions "grow" until all points of the
input set are allocated somewhere.
The algorithm may be prone to the problem of creating an infinite loop where we constantly
divide and connect the same region. A simple solution is to change the threshold of similarity
or homogeneity. More sophisticated algorithms include other region properties such as
smoothness of boundaries or region size and shape.
5. Sampling and interpolation
Often, it is necessary to transform a data set defined with a certain spatial layout into
another data set with a different resolution. For example, let's give an image that we want to
reduce or enlarge. Another example is an image where only a few input point samples are
known and we want to add values for the points placed between these samples. In both
cases, we assume that the data we have at the input represent discrete samples of
continuous space, so we can predict values at other points by examining the closest
specified points. The addition of such data is referred to as interpolation. This is a commonly
used method that is used in many areas, including visualization. The most common
interpolation techniques are as follows:
• linear
• bilinear
• non-linear
Linear interpolation
Suppose we have the value of the variable d in two places, A and B. Using linear
interpolation, we estimate the value of this variable at the C point between the A and B
points.
First, we calculate the percentage C between A and B. This is then used in conjunction with
the magnitude of the change in the value of the variable between A and B, and the value d in
C. Assuming that all three points lie in the x-axis, the following equation applies:
(xC – xA)/(xB – xA) = (dC – dA)/(dB - dA) or
dC = dA + (dB – dA)*(xC – xA)/(xB – xA)
This is similar to the normalization we discussed earlier. To remove dependence on the xaxis,
we can use parametric equations that define the change of position and the change of
value between A and B. From them we calculate the value of the parameter in the equation
that defines the point C and this number is used to calculate the value in point C. Parametric
form of equations is the following:
P(t) = PA + Vt, where V = PB – PA
Note that we have never defined the number of dimensions of the space in which we are
working. This is because these calculations work in the space of any dimension.
If we put the PC on the left side of the equation, we can calculate the value t and then use it
in calculating the value change:
d(t) = dA + Ut, where U = dB – dA.
Bilinear interpolation
We can extend the previous concept to two or more dimensions by repeating the procedure
for each dimension. For example, a common task in 2D space is to calculate the value of d in
the position (x, y) in the uniform grid (point distribution is uniform in both axes). This is the
case for 2D images. If the position (x, y) for which we look for a value corresponds to a point
in the grid, then the value sought is exactly the value of the given grid point. However, if the
search position lies between the points in the grid, the value must be calculated. To do this,
it is necessary to find four adjacent grid points that surround the point (x, y). Assuming that
the lattice point positions are represented by integer values and that the distance between
the points is 1.0 and the search value (x, y) has both fragments, then the bounding box
formed by the grid points containing the point sought (x, y) (i, j + 1) and (i + 1, j + 1) where i is
the largest integer smaller than x and j is the largest integer smaller than y.
Now we will interpolate in this space. First we perform horizontal interpolation, then
vertical. Using the linear interpolation described above, we calculate the percentage point x
between points i and i + 1. We can denote this value as s. Now we can calculate the value d
in positions (x, j) and (x, j + 1) using values at four border points. Similarly, we calculate the
percentage of the y point between points j and j + 1, we denote it as t. Finally, we calculate
the value in position (x, y) by interpolating the calculated values obtained from the
horizontal interpolation and the value t, so:
dx,y = dx,j + t * (dx,j+1 – dx,j)
dx,y is the weighted average of the four boundary values with respect to the position of the
point (x, y).
Non-linear interpolation
One of the main problems of linear interpolation is the fact that while local value changes
have a smooth transition, changes on the opposite side of the grid point can be significantly
different. In fact, the continuity at the grid point is 0. This can be improved by using a
different type of interpolation – using higher order polynomials. This includes quadratic and
cubic curves – splines. Their main purpose is smooth interpolation in points, giving the
control points of the curve and the blending function. You can see an example of a smooth
connection of the quadratic and cubic Bezier curves.
To remind:
Continuity C0 = the endpoint of the first segment is the starting point of the second segment.
Continuity C1 = the tangent vector at the endpoint of the segment is equal to the tangent
vector at its starting point.
Continuity C2 = Equation of the vector of the first and second derivatives is required.
etc.
Catmull-Rom spline
Let’s have 4 control points (p0, p1, p2, p3). Then the cubic Catmull-Rom spline passing through
these point can be defined as:
q(t) = 0.5 * ((2*p1) + (-p0 + p2) * t + (2*p0 – 5*p1 + 4*p2 – p3) * t2 + (-p0 + 3*p1 - 3*p2 + p3) * t3)
The result of the interpolation application using Catmull-Rom curves can be shown in the
following example. The task is to magnify the original 24x24 pixels. The result is a
comparison of the results using a cubic B-spline filter and Catmull-Rom.
Original image (24x24 pixels)
cubic B-spline filter Catmull-Rom
Resampling methods
Depending on the input data density, two operations are possible: reducing the size of the
input data (for example, reducing the input image) or replicating the data to increase the
input data set (as in the case shown on the slide). Subsampling – the first case – can have a
very simple solution. For example, we select only every n-th record from the input dataset.
However, it is clear that some important aspects of input data may be lost. As an example,
let us list every fourth point on the map. We do not guarantee that important information
(e.g., path) is not contained in the three points removed. Better approaches include
neighboring averaging, median selection, or random data selection in a subregion defined
around the surveyed value.
Another frequently used resampling method is "data subsetting". This method is mainly
used for processing large datasets, where visualization of such a set can lead to unpleasant
visual artefacts (e.g., when overloading memory, etc.). The user can specify the set of
queries that will be used to filter the input set. An example may be a filter that returns data
taken only within a certain time period. Thus, the threshold is set to a particular attribute.
Subsetting can be used not only during the preprocessing phase but also during the
visualization itself, when the user interacts with displayed data (highlighting, drawing,
selection, etc.). Interactive subsetting is generally more effective than query-based
subsetting because the user can directly control and decide on input data filtering. On the
other hand, the advantage of the query-based approach is that we do not have to pull the
entire input set into the program.
6. Dimension reduction
If the dimensionality of the input data exceeds the capabilities or capabilities of the
visualization technique, it is necessary to find a way to reduce this dimensionality. Of course,
it is necessary to keep as much information as possible in the input set. Approaches to
solving this problem can be divided into two groups:
1) Manual approach where the user has to enter those dimensions that are
indispensable for him in the given situation.
2) Automatic calculation techniques such as PCA (principal component analysis), MDS
(multidimensional scaling) and SOMs (Kohonen self-organizing maps). All of these
methods are capable of preserving most of the essential properties of an input set,
such as clusters, patterns, or contours. However, the output of these methods is
strongly dependent on input configuration and calculation parameters, resulting in
other results.
The figure shows the result of reducing the four-dimensional data set to 2D space using the
PCA method. The individual records are shown here using a pictogram representing 4
variables (one for each dimension). The size of the value in the variables is given by the size
of the line with the beginning at the center of the pictogram and the end at the appropriate
vertex.
PCA (Principal Component Analysis)
PCA creates additional attributes that are a linear combination of the original data variables.
These new attributes define a subspace of variables that minimizes the average error of the
lost information. PCA has the following steps:
1) Assume the input data has m dimensions / attributes. From each item of each
record, the average of the dimension is subtracted. The result is a dataset with a
mean value equal to zero.
2) Calculate the covariance matrix (a square matrix describing the dependence of a
set of random variables).
3) We calculate the eigenvectors and the eigenvalues of this covariance matrix.
4) We sort our eigenvectors based on their eigenvalues - from the largest to the
smallest.
5) We select the first mr of custom vectors, where mr is the number of dimensions
on which we want to reduce the input data.
6) Create an array of these eigenvectors, where the vectors form the matrix rows and
the first eigenvector represents the first line of this matrix.
7) For each data record we create a vector of its values, transpose it and multiply
with the previous matrix. This will result in the resulting transformation of the record
– each record is now converted into a reduced space.
PCA more intuitively:
1) We select a line in space visualizing n-dimensional data. This line covers the most of
the input data items and is called the first principal component (PC).
2) We select a second line perpendicular to the first PC, this forms the second PC.
3) We repeat this until we proces all PC dimensions or until we reach a desired number
of principle components.
Here's an example of PCA usage. Figure 1 shows a hypothetical example of the result of a
measurement of three different animal species X1, X2 and X3. In this example, it is difficult
to see that types X1 and X2 are in some relationship. However, it is almost impossible to
evaluate any relationship between X3 and the remaining two species from this
representation.
In the first phase (Figure 2), we rotate the data set by subtracting the mean deviation and
dividing the standard deviation. This is called standardization. Thus, the center of gravity of
the entire dataset is set to zero. Standardized axes are designated as S1, S2 and S3. The
relative position of individual data remains the same. From this representation, it is already
possible to observe a gradient going from the lower front corner to the upper rear. It can be
seen that along this gradient the values for species X1 and X2 increase.
The PCA method selects the first PCA line (axis) as the line passing through the center of
gravity, which at the same time minimizes the square of the distance of each of the points to
that line. In other words, this line is closest to all of the data as it goes.
The second PCA axis must also pass through the center of gravity, and must meet the same
conditions, but only the limitation of its position relative to the first PCA axis (for example, it
must be perpendicular).
If we now rotate the coordinate system defined by PCA1 and PCA2, when we define that
PCA1 corresponds to the x axis and the PCA2 axis y, we obtain the diagram in Figure 4.
MDS (Multidimensional Scaling)
Another method for reducing the input data dimension is multidimensional scaling (MDS),
also known as the gradient descent approach. MDS tries to find a representation of input
data in a lower dimension that best preserves the distances between the individual input
data points of the original set. In other words, the goal for each pair of points (i, j) is to
reduce the difference between di,j (the distance between i and j in the original n-dimensional
space) and deltai,j (the distance between these points in the reduced space). This distance
difference is referred to as stress.
The algorithm works as follows:
1) We calculate the distances between all pairs of data points in the original space. If we
have n points as an input, this step requires n(n – 1)/2 operations.
2) We transfer all input data points to points in the reduced dimension space (often
randomly).
3) We calculate stress, i.e., difference in distance between points in the original and
reduced space. This can be done using different approaches.
4) If the average and cummulated stress value is smaller than the user-defined
threshold, the algorithm ends and returns the result…
5) If the stress value is higher than the threshold, for each point we calculate a
directional vector pointing to the desired shift direction in order to reduce stress
between this point and the other points. This is determined as the weighted average
of vectors between this point and its neighbors and its weight is derived from stress
value calculated between individual pairs. Positive stress value repulses the points,
negative one attracts them. The higher the absolute value of stress, the bigger
movement of point.
6) Based on these calculations we transform tha data points to the target reduced
dimension, according to the calculated vectors. Return to step 3 of the algorithm.
Potential problems of this algorithm are the creation of endless loop or getting stuck in a
non-local optimum. This can be avoided by repeating the algorithm with different input
parameters or allowing "non-standard" movement in a different direction than the
calculated directional vector.
Many visualization and statistical graphical packages use both methods – PCA and MDS.
Some even use a combination of both, when the PCA is first used to calculate the initial point
positions to which the MDS method is then applied. This leads to a large reduction in the
number of iterations necessary to achieve the lowest stress configuration.
A practical example of using the MDS algorithm can be to reorganize a photo album by
similarity – in this case, similarity is defined by color. The image is the output of Yorg
(http://lear.inrialpes.fr/src/yorg/doc/index.html).
Conversion of raster data to vectors
There are many areas where one or more dimensions of a dataset contains nominal values.
To process such data, it can be chosen from several strategies, depending on the
characteristics of these data - the number of such dimensions, how wide the values can take
values, or whether the values can be sorted or defined. The key is to find a way of mapping
data to graphical entities or attributes that do not relate to data that are not present in the
original data.
Take an example of a car database. The data set containing the information about the
individual cars contains the "manufacturer" and "car model" nominal values. How should we
map these values to a chart? A possible way is to assign an integer value for each of the
nominal values - alphabetically, for example. But this may lead to mistaken relationships, for
example Honda will be closer to Ford than to Toyota in the chart. There is no such
relationship in fact.
If there is one nominal variable in the input dataset, there are several possible techniques
that we can use. The simplest is to use this variable as the label of the graphic element we
display. This method is suitable for small datasets, as it grows quickly becomes unusable.
Modern techniques are able to locate this method, which means, for example, displaying a
label only for variables near the cursor and the like.
When searching for a method for mapping a nominal variable to a number, we can control
the similarity between the numerical variables associated with these nominal ones in a
single data record. Then it is clear that records with similar properties should be the nominal
values of these records mapped to a similar numerical value. If we have a similarity between
all pairs of data records, we can use MDS, for example, and map nominal values to positions
in one dimension. This is a simplified technique called corespondence analysis that is used in
statistics. This technique can also be used in cases where all dimensions of the input data set
are nominal. Then this technique is called multiple correspondence analysis.
7. Data aggregation
When processing large sets of input data, it is very useful to aggregate these data into
clusters according to the similarity of their values and / or position. These groups are then
represented by a much smaller amount of data. This can be achieved by a simple averaging
of values, but new records may include more descriptive data such as the number of
members of each group or the range of their positions or values. This is called aggregation
(clustering).
Therefore, the aggregation method has two components: the point aggregation method and
the display method of the resulting groups. Clustering can be done in a variety of ways, such
as neighboring points, space dividing, or iterative split-and-merge methods. An important
aspect of all methods is the calculation of the distance between data points and the quality
of clustering, including the definition of separation of individual clusters.
When visualizing clusters, it is necessary to provide the user with enough information to
decide whether or not they need to examine the particular clusters. Just displaying one
representative of a given cluster is usually not enough to understand the variability of data
inside the cluster.
8. Smoothing and filtration
Signal processing is a common process of smoothing input data. The goal is to reduce noise
and blur sharp discontinuities. Typically, smoothing is performed using a so-called
convolution. For our purposes, it is sufficient if we look at the convolution as the weighted
average of the neighbors surrounding the given point. In a one-dimensional space, the
convolution can be applied using the following formula:
where each pi corresponds to a processed point.
After applying this operation to a given point, its original value, which differs significantly
from its neighbors, is replaced by a value much more similar to that of the neighbor.
Changing the scales or the shape and size of the surrounding area can change the resulting
image.
9. Conversion of raster data into vector ones
In computer graphics, objects are typically represented by a set of polygons formed by
vertices and edges. The task is to create a raster representation of such objects at the pixel
level, define their surface properties, define their interaction with light and other objects.
In certain cases, it is advantageous to extract linear structures from a raster data set (e.g.,
from the figure). The reasons for this conversion can be as follows:
- Content compression, e.g., for transmission. The list of vertices and edges is almost
always a more compact expression than a raster representation.
- Compare the content of two or more images. It is easier to compare higher order
attributes than individual pixels.
- Data transformation. Afin transformation, such as rotation and scaling, is easier to
apply to a vector representation than a raster.
- Segmentation of data. Isolation of regions by highlighting their borders is an
effective tool for interactive evaluation and modeling.
Image processing and computer vision are areas where a number of techniques have
been developed to convert raster images into their vector form. Let's name some of
them:
• Thresholding. The principle is the definition of one or more limit values, by which
the input set is then divided into several regions. Once borders are determined,
individual edges and peaks can be generated. Threshold values can be defined by
the user or calculated based on an image histogram analysis. Adaptive
thresholding allows you to assign a given threshold to only the area of the image.
• Region growing. We start from so-called seeds in the image that are defined by
the user or calculated by scanning data. Using seed-filling, we merge pixels into
clusters according to their similarity. The main problem is to determine a suitable
definition of pixel similarity.
424
11 +−
++=
iii ppp
pi
• Boundary-detection. This is done by using convolutions of the input image using
the corresponding matrix – the convolutional kernel. Each pixel and its neighbors
are multiplied by the appropriate value in the convolutional kernel corresponding
to their position relative to the processed pixel at the center of the convolutional
kernel. These multiples are then added together and the value is assigned to the
processed pixel. Various forms of the convolutional kernel allow for various image
transformations. For boundary detection, we use a convolution matrix that
highlights horizontal, vertical, or diagonal boundary lines and, on the other hand,
suppresses pixels whose value is very similar to those in their neighbors.
• Thinning. Also used is the convolution process, the goal being to reduce wide
linear lines, such as arteries, into lines of one pixel width. These resulting pixels
form the central axis of regions that have been thinned.
All the techniques presented in this lecture serve to increase the effectiveness of their
visualization and may lead to the discovery of new facts hiding in these data. However, it
should be noted that the target user should be informed that the data has been modified in
this way. Understanding the different types of transformations that have been applied to the
data can significantly help in interpreting them.