Using the NAG Toolbox for MATLAB - Part 3 - The G03 chapter - Multivariate Methods

The G03 chapter - Multivariate Methods

A multivariate data set contains several variables measured for a number of objects. Examples of such data sets arise in all branches of science, and the routines in the G03 chapter of the NAG Library can be used to study them. For example, environmental scientists who want to find out how many species of water vole (genus Arvicola) there are in the UK have made observations of 300 vole skulls, looking at the presence or absence of 13 characteristic measurements. Each observation was made in one of 14 geographical regions, distributed between the UK and continental Europe. The data from Europe is already classified into two species (A. terrestis and A. sapidus); and the scientist's task is to determine to which species the UK data belongs.

The treatment of the data starts by averaging the measurements within each region, giving 14 observations, each of 13 variables. This can be thought of as 14 points in 13-dimensional space, and a standard way of analysing such a data set is principal components analysis (as offered by, for example, the g03aa routine). This is a method to reduce the dimensionality of the data set to some smaller value; the structure of the derived points within this lower dimensional space can be considered in place of the original set of points, as long as they are adequately represented by the derived set. For the vole data set however, consideration of the first three principal components only explains 65% of the variance in the original data, which is insufficient.

An alternative technique which can be used in such circumstances is known as metric scaling. Here, the first step is to construct a 14 by 14 dissimilarity matrix whose elements are the distances between each pair of points in the original 13-dimensional space. The g03ea routine calculates the dissimilarity matrix. Here's the code:

   
% These are the region labels.
label = char( 'Surrey', 'Shropshire', 'Yorkshire', ...
'Perthshire', 'Aberdeen', 'Eilean Gamhna', 'Alps', ...
'Yugoslavia', 'Germany', 'Norway', 'Pyrenees I', ...
'Pyrenees II', 'North Spain', 'South Spain' );

% This is the vole data. 14 rows (one for each region),
% 13 columns (one for each variable).
data0 = [
[48.5 89.2 7.9 42.1 92.1 100.0 100.0 ...
35.3 11.4 100.0 71.9 31.6 2.8];
[67.7 67.0 2.0 23.0 93.0 100.0 86.0 ...
44.0 14.0 99.0 97.0 31.0 17.0];
[51.7 84.8 0.0 31.3 88.2 100.0 94.1 ...
18.8 25.0 100.0 83.3 33.3 5.9];
[42.9 50.0 0.0 50.0 77.3 100.0 90.9 ...
36.4 59.1 100.0 100.0 38.9 0.0];
[18.1 79.6 4.1 44.9 79.6 100.0 77.6 ...
16.7 37.1 100.0 90.4 9.8 0.0];
[65.0 81.8 9.1 31.8 81.8 100.0 59.1 ...
20.0 30.0 100.0 100.0 5.0 9.1];
[57.1 76.2 21.4 38.1 66.7 97.6 14.3 ...
23.5 9.5 100.0 91.4 11.8 17.5];
[26.7 53.1 23.5 38.2 44.1 94.1 11.8 ...
11.8 18.2 100.0 94.9 12.5 5.9];
[38.5 67.9 17.9 21.4 82.1 100.0 60.7 ...
35.7 24.0 100.0 91.7 37.5 0.0];
[33.3 83.3 27.8 29.4 86.1 100.0 63.9 ...
53.8 18.8 100.0 83.3 8.3 34.3];
[47.6 92.9 26.7 10.0 36.7 100.0 50.0 ...
14.3 7.4 100.0 86.4 90.9 3.3];
[60.0 90.9 13.6 68.2 40.9 100.0 18.2 ...
100.0 5.0 80.0 90.0 50.0 0.0];
[53.8 88.1 7.1 33.3 88.1 100.0 19.0 ...
85.7 9.8 73.8 72.2 73.7 2.4];
[29.2 74.0 16.0 46.0 86.0 100.0 18.0 ...
88.0 16.3 72.0 80.4 69.6 4.0];
];

% Invoke this standard transform to remove the effect
% of unequal variances between ranges.
data1 = asin(1.0 - 0.02*data0);

% Initialze D to zero before adding distances to D.
update = 'I';

% Distances are squared Euclidean.
dist = 'S';

% No scaling for the variables.
scal = 'U';

% Number of observations (regions in which data is gathered).
nobs = int32(14);

% Number of variables (measurements in each region).
nvar = int32(13);

% Which variables are to be included in the distance computation?
isx = ones(1, nvar, 'int32');

% Scaling parameter for each variable (not used, but must
% be supplied).
s = ones(1, nvar, 'double');

% Input distance matrix (not used, but must be supplied).
d = zeros(nobs*(nobs-1)/2, 1, 'double');

% Call the NAG routine to calculate the dissimilarity matrix.
% The string at the end of the parameter list signals an optional
% parameter of this name (m), whose value (nvar) follows it.
[sOut, dOut, ifail] = g03ea(update, dist, scal, nobs, data1, ...
isx, s, d, 'm', nvar);

The second step is to find a new matrix which represents the distances amongst a new set of points in a space of lower dimensionality (say, three); the projection of the old matrix onto the new one is done using metric scaling. Here's the continuation of the code, in which the g03fa routine is invoked to perform the metric scaling:

   
% Standardize the matrix by the number of variables.
dOut = dOut ./ double(nvar);

% Now correct for the bias caused by the different number of
% observations in each region.
ns = [19 50 17 11 49 11 21 17 14 18 16 11 21 25];

jend = nobs-1;
for j = 1:jend
istart = j+1;
for i = istart:nobs
index = (i-1)*(i-2)/2 + j;
dOut(index) = abs(dOut(index) - (1.0/ns(i) + 1.0/ns(j)));
end
end

% Now prepare for the metric scaling from nvar to ndim dimensions.
roots = 'L';
ndim = int32(3);
[x, eval, ifail] = g03fa(roots, nobs, dOut, ndim);

Finally, the new set of points in 3D can be displayed in a scatter plot:

   
% Do the scatter plot of the 3D data.
scatter3(x(1:nobs), x(nobs+1:2*nobs), x(2*nobs+1:3*nobs), ...
40.0, colours, 'filled');

% Add the labels, coloured according to species.
dims = size(label);
for i = 1 : nobs
text(x(i), x(nobs+i), x(2*nobs+i), label(i,1:dims(2)), ...
'Color', [colours(i), colours(nobs+i), colours(2*nobs+i)], ...
'FontSize', 13);
end

The resulting display is shown in Figure 3, from which it can be seen that the UK data (black points) are closer to the blue points (A. terrestis) than the red (A. sapidus), implying that those voles belong to that species.


Figure 3: Scatter plot from metric scaling of vole species data set.

The MATLAB script for this demo is available as the file NAGToolboxDemos/Multivariate_methods/g03fa_demo.m, distributed in this archive.