Erlang Central

# String Similarity (Levenshtein)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

## Problem

You need to compare two strings and get an index of how similar they are.

## Solution

This module implements the Levenshtein edit distance algorithm (described more here).

```%%%=============================================================================
%%%			Fredrik Svensson
%%% @doc This module contains string metric operations.
%%%
%%% Currently only the function {@link levenshtein/2}.
%%% @end
%%%=============================================================================
-module(string_metrics).

-export([levenshtein/2]).

%%------------------------------------------------------------------------------
%% @spec levenshtein(StringA, StringB) -> integer()
%%		StringA = StringB = string()
%% @doc Calculates the Levenshtein distance between two strings
%% @end
%%------------------------------------------------------------------------------
levenshtein(Samestring, Samestring) -> 0;
levenshtein(String, []) -> length(String);
levenshtein([], String) -> length(String);
levenshtein(Source, Target) ->
levenshtein_rec(Source, Target, lists:seq(0, length(Target)), 1).

%% Recurses over every character in the source string and calculates a list of distances
levenshtein_rec(SrcTail, Target, levenshtein_distlist(Target, DistList, SrcHead, [Step], Step), Step + 1);
levenshtein_rec([], _, DistList, _) ->
lists:last(DistList).

%% Generates a distance list with distance values for every character in the target string
levenshtein_distlist([TargetHead|TargetTail], [DLH|DLT], SourceChar, NewDistList, LastDist) when length(DLT) > 0 ->
Min = lists:min([LastDist + 1, hd(DLT) + 1, DLH + dif(TargetHead, SourceChar)]),
levenshtein_distlist(TargetTail, DLT, SourceChar, NewDistList ++ [Min], Min);
levenshtein_distlist([], _, _, NewDistList, _) ->
NewDistList.

% Calculates the difference between two characters or other values
dif(C, C) -> 0;
dif(_, _) -> 1.
```

You can use the levenshtein function to compare two strings.

```2> string_metrics:levenshtein("Aloha!", "Alhoa!").
2
1
3
5> string_metrics:levenshtein("teh", "the").
2
6> string_metrics:levenshtein("the", "the").
0
7> string_metrics:levenshtein("the", "").
3
```

Note that the function is not case insensitive (and the algorithm isn't either), though you can always use the httpd_util library to_lower or to_upper functions to put the two strings on an equal footing:

```8> string_metrics:levenshtein(httpd_util:to_lower("Adam"), "adam").
0
```