Many papers refer to Tukey’s (1977) treatise on exploratory data analysis as the contribution that transformed statistical thinking. In actual fact, new ideas introduced by Tukey prompted many statisticians to give a more prominent role to data visualization and more generally to data. However, J.W. Tukey in 1962 had already begun his daring provocation when at the annual meeting of the Institute of Mathematical Statistics he gave his talk entitled “The Future of Data Analysis” (Tukey, 1962). At the same time, the French statistician J.P. Benzécri brought his paradigm to the attention of the international scientific community in his paper “L’ Analyse des Données” (Benzécri, 1976). As with Tukey’s ideas, it appeared totally revolutionary with respect to the “classical” statistical approaches for two reasons: i) the absence of any a priori model and ii) the prominent role of graphical visualization in the analysis of output. Unfortunately most of Benzécri’s papers were written in French. Michael Greenacre, in the preface to his well-known book Theory and Application of Correspondence Analysis (Greenacre, ), wrote: “In ‘ I was invited to give a paper on correspondence analysis at an international conference on multidimensional graphical methods called ‘Looking at Multivariate Data’ in Sheffield, England. [There] . . . I realized the tremendous communication gap between Benzécri’s group and the Anglo- American statistical school.” These simultaneous and independent stimuli for statistical analysis mainly based on visualization did not occur by chance but as a consequence of extraordinary developments in information technology. In particular, technological innovations in computer architecture permitted the storage of ever larger volumes of data and allowed one to obtain even higher-quality graphical visualization (on screen and paper). These two elements contributed to giving a prominent role to data visualization. The growth of data volume, on the other hand, determined the need for preliminary (exploratory) analyses; graphical methods quickly proved their potential in this kind of analysis. The performance of graphics cards permitted one to obtain more detailed visualization, and the developments in dynamic and 3-D graphics have opened new frontiers. A posteriori we can state that at that time statisticians became conscious of the potential of graphical visualization and of the need for exploratory analysis. However, it appears quite strange that these two giants of the statistics world, Tukey and Benzécri, are very rarely mentioned together in data analysis papers. Their common starting point was the central role of data in statistical analysis; both of them were strong believers that, in the future, the amount of available data would increase tremendously, although the current abundance of data might be more than even they expected! In light of this historical background, the title of the present contribution should appear more clear to the reader. Our idea is to present visualization in the modern computer age following the precepts of data analysis theorists. Moreover, note that the basic principles of data analysis are inspired by the elementary notions of geometry. A key element in the success of data analysis is the strong contribution of visualization: it exploits the human capability to perceive the 3-D space. On the other hand, the role of the geometric approach in mathematics has a centuries-old story. Let us take into account that many theorems were first enunciated in geometric notation and mathematically formalized many, perhaps hundreds of years later. The Pythagorean theorem is a well-known example. Our perception of the real world is the result of a geometric space characterized by orthogonal axes, the concept of distance, and the effects of light. The combination of the first two elements defines a metric space. Cartesian spaces permit one to visualize positions of a set of dimensionless points. Exploiting capabilities of current graphics cards acting on brightness, points are enriched by point markers, characterized by different sizes, shapes, and colors, that add information, helping the user interpret results more easily and quickly. Our mathematics based on the decimal system is clearly the result of having ten fingers; similarly, it is obvious that our geometry, Euclidean geometry, is based on a system of orthogonal axes due to our perception of the horizon line. As the binary and hexadecimal numerical systems represent possible alternatives to the decimal system, similarly there exist different geometries based on nonorthogonal systems where parallel lines converge in a finite space. However, even if alternative geometries exist, Euclidean geometry remains the only geometry that we apply in the solution of real-world problems. The concepts of far and close are native concepts. It is not necessary to be a mathematician to understand them. Distance represents the measure of closeness in space. This contribution will introduce the concept of factorial space and of dendrograms; it intends to furnish guidelines for giving a correct representation of displayed data. It will also show how it is possible to obtain enhanced representation where, thanks to modern graphics cards, it is possible to obtain millions of colors, transparencies, and man–machine interactions.
Huge multidimensional data visualization: back to the virtue of principal coordinates and dendrograms in the new computer age
VISTOCCO, Domenico;
2008-01-01
Abstract
Many papers refer to Tukey’s (1977) treatise on exploratory data analysis as the contribution that transformed statistical thinking. In actual fact, new ideas introduced by Tukey prompted many statisticians to give a more prominent role to data visualization and more generally to data. However, J.W. Tukey in 1962 had already begun his daring provocation when at the annual meeting of the Institute of Mathematical Statistics he gave his talk entitled “The Future of Data Analysis” (Tukey, 1962). At the same time, the French statistician J.P. Benzécri brought his paradigm to the attention of the international scientific community in his paper “L’ Analyse des Données” (Benzécri, 1976). As with Tukey’s ideas, it appeared totally revolutionary with respect to the “classical” statistical approaches for two reasons: i) the absence of any a priori model and ii) the prominent role of graphical visualization in the analysis of output. Unfortunately most of Benzécri’s papers were written in French. Michael Greenacre, in the preface to his well-known book Theory and Application of Correspondence Analysis (Greenacre, ), wrote: “In ‘ I was invited to give a paper on correspondence analysis at an international conference on multidimensional graphical methods called ‘Looking at Multivariate Data’ in Sheffield, England. [There] . . . I realized the tremendous communication gap between Benzécri’s group and the Anglo- American statistical school.” These simultaneous and independent stimuli for statistical analysis mainly based on visualization did not occur by chance but as a consequence of extraordinary developments in information technology. In particular, technological innovations in computer architecture permitted the storage of ever larger volumes of data and allowed one to obtain even higher-quality graphical visualization (on screen and paper). These two elements contributed to giving a prominent role to data visualization. The growth of data volume, on the other hand, determined the need for preliminary (exploratory) analyses; graphical methods quickly proved their potential in this kind of analysis. The performance of graphics cards permitted one to obtain more detailed visualization, and the developments in dynamic and 3-D graphics have opened new frontiers. A posteriori we can state that at that time statisticians became conscious of the potential of graphical visualization and of the need for exploratory analysis. However, it appears quite strange that these two giants of the statistics world, Tukey and Benzécri, are very rarely mentioned together in data analysis papers. Their common starting point was the central role of data in statistical analysis; both of them were strong believers that, in the future, the amount of available data would increase tremendously, although the current abundance of data might be more than even they expected! In light of this historical background, the title of the present contribution should appear more clear to the reader. Our idea is to present visualization in the modern computer age following the precepts of data analysis theorists. Moreover, note that the basic principles of data analysis are inspired by the elementary notions of geometry. A key element in the success of data analysis is the strong contribution of visualization: it exploits the human capability to perceive the 3-D space. On the other hand, the role of the geometric approach in mathematics has a centuries-old story. Let us take into account that many theorems were first enunciated in geometric notation and mathematically formalized many, perhaps hundreds of years later. The Pythagorean theorem is a well-known example. Our perception of the real world is the result of a geometric space characterized by orthogonal axes, the concept of distance, and the effects of light. The combination of the first two elements defines a metric space. Cartesian spaces permit one to visualize positions of a set of dimensionless points. Exploiting capabilities of current graphics cards acting on brightness, points are enriched by point markers, characterized by different sizes, shapes, and colors, that add information, helping the user interpret results more easily and quickly. Our mathematics based on the decimal system is clearly the result of having ten fingers; similarly, it is obvious that our geometry, Euclidean geometry, is based on a system of orthogonal axes due to our perception of the horizon line. As the binary and hexadecimal numerical systems represent possible alternatives to the decimal system, similarly there exist different geometries based on nonorthogonal systems where parallel lines converge in a finite space. However, even if alternative geometries exist, Euclidean geometry remains the only geometry that we apply in the solution of real-world problems. The concepts of far and close are native concepts. It is not necessary to be a mathematician to understand them. Distance represents the measure of closeness in space. This contribution will introduce the concept of factorial space and of dendrograms; it intends to furnish guidelines for giving a correct representation of displayed data. It will also show how it is possible to obtain enhanced representation where, thanks to modern graphics cards, it is possible to obtain millions of colors, transparencies, and man–machine interactions.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.