## Mathematics & Statistics Theses & Dissertations

Spring 2007

Dissertation

#### Degree Name

Doctor of Philosophy (PhD)

#### Department

Mathematics and Statistics

#### Program/Concentration

Computational and Applied Mathematics

Dayanand N. Naik

N. Rao Chaganty

Larry Lee

Edward Markowski

#### Abstract

Assessing the relationship between two sets of multivariate vectors is an important problem in statistics. Canonical correlation coefficients are used to study these relationships. Canonical correlation analysis (CCA) is a general multivariate method that is mainly used to study relationships when both sets of variables are quantitative. When the variables are qualitative (categorical), a technique called correspondence analysis (CA) is used. Canonical correspondence analysis (CCPA) is used to deal with the case when one set of variables is categorical and the other set is quantitative. By exploiting the interrelationships between these three techniques we first provide a theoretical basis for CCPA.

Next, in this dissertation, we have generalized each of these three techniques to analyze the relationships between two sets of repeatedly or longitudinally observed data. When the two vectors are quantitative, we use a block Kronecker product matrix to model dependency of the variables over time. We then apply canonical correlation analysis on this matrix to obtain canonical correlations and canonical variables. When the variables are qualitative, the data are summarized in the form of a contingency table. It is generally not straightforward to model dependency of contingency tables over time. However, we have proposed fitting correlated linear models to the summary statistics obtained by performing the usual correspondence analysis at each time period. We have shown that the most useful summary measure for this purpose is the first singular value of the correspondence matrix, which is essentially the matrix of relative frequencies obtained from the given contingency table. Our method is a reasonable approach to analyze repeated contingency table data. Finally, to deal with the case when one set of variables is categorical and the other set is quantitative, we have proposed combining the two approaches to deal with quantitative and qualitative variables. We have illustrated and studied the performances of our methods my implementing them on simulated data sets.

High dimensional data are now common due to the Internet, genomics, proteomics, and the like. Although, correspondence analysis and other methods considered in this dissertation are general techniques for analyzing multivariate data their usefulness for analyzing very high dimensional data have not been compared with the other more modern machine learning methods. In the last chapter of this dissertation, we provide a brief introduction to a machine learning method that is used to analyze very high dimensional and sparse contingency table data from the field of language processing or information retrieval, named latent semantic analysis (LSA). We then propose certain criteria to compare the performance of LSA with the correspondence analysis. Based on these criteria we find that under certain situations correspondence analysis performs better.

#### DOI

10.25777/3yg2-7c87

9780549069508

COinS