


Consider a hypothetical data set containing 5 variables, where one variable is an exact duplicate of one of the others. Second, the Euclidean distance is blind to correlated variables. The scales of these variables are not comparable. Modeling problems might deal with variables which have very different scales, such as age, height, weight, etc. With other data, though, this is likely not the case. In geometric situations, all variables are measured in the same units of length.
#Mahalanobis distance spss code#
This distance measure has a straightforward geometric interpretation, is easy to code and is fast to calculate, but it has two basic drawbacks:įirst, the Euclidean distance is extremely sensitive to the scales of the variables involved. % Euclidean distance between vectors 'A' and 'B', linear algebra style % Euclidean distance between vectors 'A' and 'B', original recipeĮuclideanDistance = sqrt(sum( (A - B).

The Euclidean distance is simple to calculate: square the difference in each dimension (variable), and take the square root of the sum of these squared differences. The Euclidean distance is the geometric distance we are all familiar with in 3 spatial dimensions. One very useful distance measure, the Mahalanobis distance, will be explained and implemented here. Some modeling algorithms, such as k-nearest neighbors or radial basis function neural networks, make direct use of multivariate distances. Many data mining and pattern recognition tasks involve calculating abstract "distances" between items or collections of items.
