In the era of data deluge, a major challenge is to handle large amounts of data which are produced at a high rate and are characterized by association structures changing over time. As modern applications become increasingly scalable, efficient approaches are needed for dealing with high-dimensional categorical data. Multiple Correspondence Analysis (MCA) is a popular method for reducing the dimensionality of categorical data while preserving the most essential information. MCA is typically implemented via the eigenvalue decomposition (EVD) or the singular value decomposition (SVD) of a suitably transformed matrix. Because of the high computational and memory requirements of ordinary EVD and SVD, MCA is essentially unfeasible with massive data sets or data streams that change rapidly and have to be processed on the fly. We distinguish two main families of methods that can be efficiently used to incrementally compute the dominant eigenvalues and eigenvectors of a covariance matrix, i) stochastic approximation and ii) heuristic incremental EVD/SVD. A general algorithmic framework is presented to embed these methods in the MCA context and provide incremental dimension reduction of categorical data. The methods are compared on artificial data, in terms of the similarity between ordinary and incremental MCA configurations. Results do not clearly support the superiority of one method over another. However, methods that allow for block-based updates outperform vector-based approaches. The method of choice may be decided on the basis of the most desirable balance of speed and accuracy.
A Framework for the Incremental Update of the MCA Solution
Iodice D'Enza, Alfonso
2018-01-01
Abstract
In the era of data deluge, a major challenge is to handle large amounts of data which are produced at a high rate and are characterized by association structures changing over time. As modern applications become increasingly scalable, efficient approaches are needed for dealing with high-dimensional categorical data. Multiple Correspondence Analysis (MCA) is a popular method for reducing the dimensionality of categorical data while preserving the most essential information. MCA is typically implemented via the eigenvalue decomposition (EVD) or the singular value decomposition (SVD) of a suitably transformed matrix. Because of the high computational and memory requirements of ordinary EVD and SVD, MCA is essentially unfeasible with massive data sets or data streams that change rapidly and have to be processed on the fly. We distinguish two main families of methods that can be efficiently used to incrementally compute the dominant eigenvalues and eigenvectors of a covariance matrix, i) stochastic approximation and ii) heuristic incremental EVD/SVD. A general algorithmic framework is presented to embed these methods in the MCA context and provide incremental dimension reduction of categorical data. The methods are compared on artificial data, in terms of the similarity between ordinary and incremental MCA configurations. Results do not clearly support the superiority of one method over another. However, methods that allow for block-based updates outperform vector-based approaches. The method of choice may be decided on the basis of the most desirable balance of speed and accuracy.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.