Comparison of Canonical Correlation and Partial Least Squares analyses of simulated and empirical data
In this paper, we compared the general forms of CCA and PLS on three simulated and two empirical datasets, all having large sample sizes. We took successively smaller subsamples of these data to evaluate sensitivity, reliability, and reproducibility. In null data having no correlation within or between blocks, both methods showed equivalent false positive rates across sample sizes. Both methods also showed equivalent detection in data with weak but reliable effects until sample sizes drop below n=50. In the case of strong effects, both methods showed similar performance unless the correlations of items within one data block were high. For PLS, the results were reproducible across sample sizes for strong effects, except at the smallest sample sizes. On the contrary, the reproducibility for CCA declined when the within-block correlations were high. This was ameliorated if a principal components analysis (PCA) was performed and the component scores used to calculate the cross-block matrix. The outcome of our examination gives three messages. First, for data with reasonable within and between block structure, CCA and PLS give comparable results. Second, if there are high correlations within either block, this can compromise the reliability of CCA results. This known issue of CCA can be remedied with PCA before cross-block calculation. This, however, assumes that the PCA structure is stable for a given sample. Third, null hypothesis testing does not guarantee that the results are reproducible, even with large sample sizes. This final outcome suggests that both statistical significance and reproducibility be assessed for any data.
READ FULL TEXT