Spotlight

Try VizRank online - You can now experiment with VizRank online. Find interesting data projections of your own data sets.
[This now works again]


FRI > Biolab > Supplements > VizRank
Comparison with other methods

Comparison with other methods

Of the approaches that enable us to find interesting plots of multidimensional, principal component analysis and discriminant analysis are perhaps best known and most widely used. On this page we First provide a short description of the two methods and show their use to find interesting data projections on our test data sets. These examples also help us to illustrate strengths and weaknesses of these approaches, which are discussed in comparison to VizRank at the end of the page.

We will use the following notation:

  • d will denote the dimensionality of the data (number of features),
  • N will be the number of data instances (examples) in the data set, and
  • c will represent the number of classes which are used for labelling data instances.

Principal component analysis

Principal component analysis (PCA) is a well-known statistical method that can be used to reduce dimensionality of the data by finding a subset of most interesting directions (principal components) in the data set. Interesting direction as found by PCA is that for which the variance of data along it is high. PCA can be computed by finding eigenvectors and eigenvalues of covariance matrix of centered data (data, from which we subtracted the mean value). Eigenvectors with the largest eigenvalues are the most important (containing the most of the data variance). To find the most interesting k dimensional subspace, we simply select k eigenvectors with the largest eigenvalues and project the data to this k-dimensional subspace. To show the highest variance in the data using the scatterplot, we would therefore project the data on to first and second principal component.

We can not assume that PCA will find projections in the data set that will be the best for discrimination between different classes, as PCA is not able to take into account the class information. However, in cases when classes are well separated in its original N dimensional space and examples in each class are not too scattered, first few principal components can often provide a projection with well separated classes.

For comparison with VizRank, we computed PCA on both datasets used in our paper. Following graphs show scatterplots that include projections on the first and second principal component for each of our example data sets:

S. cerevisiae metabolic example
S. cerevisiae cell-cycle example

From the left scatterplot (metabolic example) we can see that although ribosomic functional group appears well separated, other two groups largely overlap. In the right scatterplot (cell-cycle example), the class separation is excellent using the first factor (Factor 1), where the second factor (Factor 2) appears to be useless for discrimination.

Discriminant analysis

In a similar way to PCA, discriminant analysis finds interesting directions in the data. Unlike PCA, where the most useful directions are those with the highest variance, the discriminant analysis searches for directions that are most important for discrimination between different classes. There are different types of discriminant analysis that include linear, quadratic, multiple discriminant analysis. Without going into details, we can say that they all find directions in the original N dimensional space, where the distance between class-characteristic mean values when data is projected to the the direction is maximized, and at the same time minimizing the scatter of examples within each class.

Unlike PCA, which can project in d dimensional space (same dimension as original space), discriminant analysis can project only into c-1 dimensional space. In our cell-cycle example, where we have only two classes, we therefore had to visualize the projected data in a histogram.

S. cerevisiae metabolic example
S. cerevisiae cell-cycle example

Results of linear discriminant analysis on our data sets are very good. In metabolic example, classes are perfectly separated, while in cell-cycle example there is a small overlap - similar to overlap seen in the PCA projection.

Comparison: Strengths and Weaknesses

Both described methods are very useful when we have data with large number of features and we wish to project it to some "interesting" low-dimensional subspace. Discriminant analysis will find a projection with best separation of the classes. The computational cost to find these projection by either of the two methods is low - there is no evaluation of possible projections, just some matrix manipulation.

PCA and discriminant analysis also have some drawbacks. In comparison to VizRank, a major one is interpretability. Each factor upon which we project the original data is a linear combination of original features. For instance, the two factors computed using discriminant analysis for metabolic examples were:

Factor 1 = 0.339 * alpha 0 - 0.176 * alpha 7 - 0.165 * alpha 14 - 0.089 * alpha 21 + 0.173 * alpha 28 - 0.105 * alpha 35 - 0.262 * alpha 42 - 0.006 * alpha 49 - 0.055 * alpha 56 + 0.131 * alpha 63 - 0.015 * alpha 70 + 0.012 * alpha 77 + 0.064 * alpha 84 - 0.121 * alpha 91 + 0.064 * alpha 98 - 0.020 * alpha 105 + 0.138 * alpha 112 - 0.010 * alpha 119 - 0.161 * Elu 0 + 0.065 * Elu 30 + 0.049 * Elu 60 + 0.056 * Elu 90 + 0.214 * Elu 120 + 0.066 * Elu 150 + 0.011 * Elu 180 + 0.092 * Elu 210 + 0.098 * Elu 240 - 0.017 * Elu 270 + 0.242 * Elu 300 - 0.306 * Elu 330 - 0.009 * Elu 360 - 0.127 * Elu 390 - 0.012 * cdc15 10 + 0.226 * cdc15 30 - 0.122 * cdc15 50 - 0.043 * cdc15 70 - 0.086 * cdc15 90 + 0.045 * cdc15 110 + 0.086 * cdc15 130 - 0.005 * cdc15 150 - 0.032 * cdc15 170 + 0.058 * cdc15 190 + 0.054 * cdc15 210 + 0.044 * cdc15 230 - 0.057 * cdc15 250 + 0.060 * cdc15 270 + 0.038 * cdc15 290 + 0.064 * spo 0 + 0.004 * spo 2 + 0.020 * spo 5 - 0.191 * spo 7 - 0.055 * spo 9 + 0.279 * spo 11 - 0.028 * spo5 2 - 0.076 * spo5 7 + 0.001 * spo5 11 + 0.014 * spo- early - 0.140 * spo- mid + 0.027 * heat 0 + 0.029 * heat 10 - 0.068 * heat 20 - 0.086 * heat 40 + 0.042 * heat 80 - 0.031 * heat 160 - 0.020 * dtt 15 - 0.167 * dtt 30 - 0.025 * dtt 60 - 0.087 * dtt 120 - 0.017 * cold 0 - 0.129 * cold 20 + 0.017 * cold 40 - 0.052 * cold 160 + 0.068 * diau a - 0.072 * diau b + 0.060 * diau c - 0.094 * diau d + 0.033 * diau e + 0.004 * diau f - 0.164 * diau g
Factor 2 = - 0.089 * alpha 0 + 0.002 * alpha 7 - 0.095 * alpha 14 + 0.147 * alpha 21 - 0.062 * alpha 28 + 0.017 * alpha 35 + 0.257 * alpha 42 - 0.023 * alpha 49 + 0.271 * alpha 56 + 0.075 * alpha 63 + 0.064 * alpha 70 + 0.037 * alpha 77 + 0.100 * alpha 84 + 0.049 * alpha 91 - 0.074 * alpha 98 - 0.138 * alpha 105 + 0.012 * alpha 112 + 0.022 * alpha 119 + 0.167 * Elu 0 - 0.025 * Elu 30 + 0.033 * Elu 60 + 0.080 * Elu 90 - 0.294 * Elu 120 - 0.298 * Elu 150 + 0.003 * Elu 180 + 0.229 * Elu 210 + 0.054 * Elu 240 - 0.102 * Elu 270 - 0.012 * Elu 300 + 0.029 * Elu 330 - 0.109 * Elu 360 + 0.037 * Elu 390 - 0.006 * cdc15 10 + 0.026 * cdc15 30 - 0.035 * cdc15 50 + 0.139 * cdc15 70 - 0.010 * cdc15 90 - 0.137 * cdc15 110 - 0.086 * cdc15 130 - 0.157 * cdc15 150 + 0.018 * cdc15 170 - 0.132 * cdc15 190 + 0.176 * cdc15 210 - 0.015 * cdc15 230 - 0.033 * cdc15 250 - 0.026 * cdc15 270 + 0.030 * cdc15 290 + 0.056 * spo 0 + 0.177 * spo 2 + 0.120 * spo 5 - 0.213 * spo 7 - 0.124 * spo 9 + 0.212 * spo 11 - 0.083 * spo5 2 - 0.052 * spo5 7 + 0.074 * spo5 11 - 0.057 * spo- early + 0.132 * spo- mid - 0.035 * heat 0 - 0.062 * heat 10 - 0.137 * heat 20 - 0.039 * heat 40 + 0.194 * heat 80 + 0.036 * heat 160 - 0.022 * dtt 15 + 0.052 * dtt 30 + 0.071 * dtt 60 - 0.134 * dtt 120 + 0.009 * cold 0 + 0.002 * cold 20 + 0.002 * cold 40 + 0.055 * cold 160 - 0.065 * diau a + 0.164 * diau b - 0.134 * diau c + 0.012 * diau d - 0.059 * diau e - 0.043 * diau f - 0.129 * diau g

By observing the above coefficients we were unable to draw conclusions about the importance of specific features for discrimination between classes. For instance, we were unable to see that diauxic shift experiments can characterize two out of three functional groups - cytoplasmic ribosomes and respiration. Using the projection generated by discriminant analysis, we could only determine if the classes are linearly separable; although such finding is important, the list of coefficients did not help us to gain additional insight.

Missing values, often present in expression data sets, also pose problems for the two methods. For instance, metabolic example data set which has measurements for 186 genes includes only 70 genes without missing values. Since factors generated using PCA or discriminant analysis need values of all attributes we have two choices - either we impute the missing data or we remove examples with missing values (regression-tree based imputation was used in our analysis above).

A problem also arises with discriminant analysis when the dimensionality of the data d is larger than the number of examples N. In such cases the inverse of the covariance matrix does not exist (since covariance matrix is singular) and therefore the factors can not be computed.

Another disadvantage of the discriminant analysis is also the assumption on the data distribution. Discriminant analysis assumes that data in each class is unimodal, has gaussian distribution and has the same covariance matrix. In cases when this does not hold (which is in most real-life data sets), the discriminant analysis can often find projections that are not good for class discrimination.

Conclusion

VizRank, PCA and discriminant analysis have their strengths and weaknesses. The main strength of VizRank compared with the other two techniques is easy interpretation of visualizations, as they use original set of features rather then their transformation. Another advantage is ability to obtain a ranked list of projections. Also, VizRank does not assume any data distribution. On the other hand, methods like PCA and discriminant analysis can be very fast at finding a single projection with best separation.

Despite, or even inspite of comparative differences, one can always benefit from using different analysis tools on the same data. Using a discriminant analysis first may for instance indicate that the classes are separable, while continuing the analysis with VizRank and selected visualization (e.g. scatter plot or radviz) can provide for additional insight and identification of interesting features and patterns.

References:

Duda, R. O., P. E. Hart, et al. (2001). Pattern Classification, John Wiley and Sons, Inc.

Johnson, R. A. and D. W. Wichern (1998). Applied Mulitvariate Statistical Analysis, Prentice-Hall, Inc.