Erlang Central

# String Similarity (Levenshtein)

## Problem

You need to compare two strings and get an index of how similar they are.

## Solution

This module implements the Levenshtein edit distance algorithm (described more here). In short, it calculates the number of edit steps that are needed to transform the source string to the target string. The lesser the more similiar.

```%%%=============================================================================
%%% @author Adam Lindberg <adam@erlang-consulting.com>
%%%			Fredrik Svensson
%%% @doc This module contains string metric operations.
%%%
%%% Currently only the function {@link levenshtein/2}.
%%% @end
%%%=============================================================================
-module(string_metrics).

-export([levenshtein/2]).

%%------------------------------------------------------------------------------
%% @spec levenshtein(StringA, StringB) -> integer()
%%		StringA = StringB = string()
%% @doc Calculates the Levenshtein distance between two strings
%% @end
%%------------------------------------------------------------------------------
levenshtein(Samestring, Samestring) -> 0;
levenshtein(String, []) -> length(String);
levenshtein([], String) -> length(String);
levenshtein(Source, Target) ->
levenshtein_rec(Source, Target, lists:seq(0, length(Target)), 1).

%% Recurses over every character in the source string and calculates a list of distances
levenshtein_rec([SrcHead|SrcTail], Target, DistList, Step) ->
levenshtein_rec(SrcTail, Target, levenshtein_distlist(Target, DistList, SrcHead, [Step], Step), Step + 1);
levenshtein_rec([], _, DistList, _) ->
lists:last(DistList).

%% Generates a distance list with distance values for every character in the target string
levenshtein_distlist([TargetHead|TargetTail], [DLH|DLT], SourceChar, NewDistList, LastDist) when length(DLT) > 0 ->
Min = lists:min([LastDist + 1, hd(DLT) + 1, DLH + dif(TargetHead, SourceChar)]),
levenshtein_distlist(TargetTail, DLT, SourceChar, NewDistList ++ [Min], Min);
levenshtein_distlist([], _, _, NewDistList, _) ->
NewDistList.

% Calculates the difference between two characters or other values
dif(C, C) -> 0;
dif(_, _) -> 1.
```

You can use the levenshtein function to compare two strings.

```2> string_metrics:levenshtein("Aloha!", "Alhoa!").
2
3> string_metrics:levenshtein("adam", "Adam").
1
4> string_metrics:levenshtein("adam", "Assam").
3
5> string_metrics:levenshtein("teh", "the").
2
6> string_metrics:levenshtein("the", "the").
0
7> string_metrics:levenshtein("the", "").
3
```

Note that the function is not case insensitive (and the algorithm isn't either), though you can always use the httpd_util library to_lower or to_upper functions to put the two strings on an equal footing:

```8> string_metrics:levenshtein(httpd_util:to_lower("Adam"), "adam").
0
```