Spotlight

Try VizRank online - You can now experiment with VizRank online. Find interesting data projections of your own data sets.
[This now works again]


FRI > Biolab > Supplements > VizRank
VizRank method details

VizRank Method Details

Here we describe details of the VizRank method.

Introduction

VizRank is an approach, which can be applied on classified data to automatically find most useful data projections (projections where clusters with different class values are well separated and non-overlapping). It can be used with any visualization method that maps attribute values to points in a two-dimensional space. Interestingness of a specific projection is estimated by inducing a k-nearest neighbor classifier and evaluating its classification accuracy on a data set that consists of x and y positions of the projected points and their class information. This score will provide a good estimate of projection usefulness since projections with well separated classes will be associated with high classification accuracy while projections with overlapping classes will score lower. VizRank can assess usefulness of all possible projections and presents to the user a short list of best scored projections, that will hopefully reveal biologically relevant patterns.

k-Nearest Neighbor Algorithm

k-nearest neighbor algorithm (k-NN) is a well-known machine learning method that predicts class value for a new unlabeled example by analysing its k neighboring examples. Each of the k neighbors votes for its class value and votes are weighted according to the distance from the example. The result of the voting is a probabilistic class prediction and the example is usually labeled with the most probable class value. To measure the distance between examples we used Euclidean metric. As for parameter k, the number of neighbors used in class prediction, we want to use a large value to obtain a reliable prediction, while at the same time we want to keep it small enough so that we only consider nearby examples in prediction. In our experiments we used k = square_root(N), where N is the number of data set instances, which proved to be a generally useful rule of thumb.

We show the above figure to demonstrate how 5-NN classifier would compute class probabilites for the example marked with questionmark ("?"). Since four of its nearest neighbors belong to the red class and only one to the blue, the classifier would predict that the probability of the example belonging to the red class is 80% (if neighbors are not weighted according to the distance to the example - otherwise the value would be slightly different).

Evaluating Usefulness of a Projection

There are many scoring functions that measure the performance of a classifier. A measure that is often used in machine learning is classification accuracy. It is defined as the portion of cases when the classifier correctly predicted the class value. However, the classification accuracy only measures if the prediction was correct and is therefore not very sensitive in the case of probabilistic classification: for classification accuracy it does not matter if the classifier predicted the correct class with the probability of 1 or with the probability of, say, 0.51. Since in our case the main goal is not to evaluate the classifier but to evaluate the projection, it makes more sense to work with the predicted probabilities. For a probabilistic classifier, which k-NN is, we therefore used a more useful measure which is defined as an average of predicted probabilities assigned to the correct class:

where N is the number of examples, yi is the correct class value for data example xi and Pf(yi | xi) is the probability assigned to the correct class value yi by the classifier f. This measure was also the one we used in all our experiments. A similar measure that can also be used is Brier score (Brier, 1950), but we found its prediction value more difficult to interpret.

Search Heuristic

VizRank can in principle evaluate all possible projections of a given data set. However, with high dimensionality of data sets in functional genomics, the computational complexity of such an algorithm becomes problematic. Using scatterplot method, this problem might not be obvious, since there are "only" m*(m-1)/2 possible projections with m attributes in the data set. The number of possible projections is, however, much larger when using radviz method, that can concurrently visualize more than two attributes. When we wish to evaluate possible radviz projections that show l attributes of total m, the l attributes can be selected in different ways and each selection of l attributes can produce (l-1)!/2 different radviz projections. For illustration, for the budding yeast data set, which contains 79 attributes, there are 79,079 different radviz projections with 3 attributes and 4,507,503 projections with 4 attributes. Since it would be unreasonable to try to evaluate all possible projections, we developed a simple and fast heuristic that is able to roughly estimate the usefulness of all projections and then select only a small subset of potentially most interesting projections for subsequent assessment using the VizRank method. Our heuristic first evaluates the "quality" of each feature in the data set using the ReliefF measure (Kononenko, 1994). Heuristic then estimates usefulness of each projection as the sum of ReliefF values for the attributes that participate in the projection. We can then rank possible projections by this estimate and assess only a reasonable subset of best ranked projections using VizRank. Our experiments with this heuristic on different functional genomics data sets indicate that to find the top rated projections, it suffices for VizRank to inspect only a fraction (typically, 1% - 3%) of possible projections.


References:

Brier, G.W. (1950) Verification of forecasts expressed in terms of probabilities. Monthly Weather Review, 78,1�3.

Kononenko, I. (1994) Estimating attributes: analysis and extensions of RELIEF, Proc European Conference on Machine Learning (ECML), 171�182.