Distances from Data

This procedure calculates a distance matrix from one or more columns of a rectangular data matrix. The result is an n × n distance matrix, where n is the number of data points (equal to the number of rows in the input matrix). Each value in the output matrix, dij, is a measure of the distance between the ith and jth rows.

Menu:	Create→Distances→Distances for Data
Button:
Batch:	DataDistances

The Make Distances from Data window.

PASSaGE allows you to choose from among many distance measures. In all of the following i and j refer to rows of the data matrix, m refers to the number of columns (variables), while xik refers to the value in the kth column of row i to be included in the distance measure.

Euclidean distance

The straight-line measure through multivariate space between the data points.

Squared Euclidean distance

The square of the straight-line measure through multivariate space between the data points.

Scaled Euclidean distance

Similar to the Euclidean distance except that each variable is scaled by its variance, s2.

Manhattan/City Block distance

The sum of the absolute difference between the points in each individual dimension.

Minkowski distance

A general distance measure, which includes an additional parameter λ. Increasing λ has the effect of exaggerating more dissimilar values relative to more similar values.

Euclidean and Manhattan/City Block distances are special cases of the Minkowski distance when λ = 2 and 1, respectively.

Mahalanobis distance

A common multivariate distance measure where

In this case xi and xj are the vector columns representing points i and j and S is the sample variance-covariance matrix.

Canberra distance

Czekanowski distance

Cosine distance

Correlation distance

This is 1 – the correlation of the variables, or

It is useful when a correlation coefficient of –1 represents the maximum disagreement between the two variables.

Squared Correlation distance

This is 1 – the squared correlation of the variables, or

It is useful when the sign of the correlation is unimportant and correlation coefficients of +1 and –1 are treated as logically identical, showing maximum agreement between the variables.

Hamming distance

The proportion of values that differ between the two points. Useful for categorical data.

Jaccard distance

Similar to the Hamming distance, but only cases where xik and xjk are not both zero are counted.