Erlang Central

String Similarity (Levenshtein)

Revision as of 16:14, 27 October 2006 by 213.171.204.166 (Talk)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Problem

You need to compare two strings and get an index of how similar they are.

Solution

This module implements the Levenshtein edit distance algorithm (described more here).

%%%=============================================================================
%%% @author Adam Lindberg <adam@erlang-consulting.com>
%%%			Fredrik Svensson
%%% @doc This module contains string metric operations.
%%%
%%% Currently only the function {@link levenshtein/2}.
%%% @end
%%%=============================================================================
-module(string_metrics).

-export([levenshtein/2]).

%%------------------------------------------------------------------------------
%% @spec levenshtein(StringA, StringB) -> integer()
%%		StringA = StringB = string()
%% @doc Calculates the Levenshtein distance between two strings
%% @end
%%------------------------------------------------------------------------------
levenshtein(Samestring, Samestring) -> 0;
levenshtein(String, []) -> length(String);
levenshtein([], String) -> length(String);
levenshtein(Source, Target) ->
	levenshtein_rec(Source, Target, lists:seq(0, length(Target)), 1).

%% Recurses over every character in the source string and calculates a list of distances
levenshtein_rec([SrcHead|SrcTail], Target, DistList, Step) ->
	levenshtein_rec(SrcTail, Target, levenshtein_distlist(Target, DistList, SrcHead, [Step], Step), Step + 1);
levenshtein_rec([], _, DistList, _) ->
	lists:last(DistList).

%% Generates a distance list with distance values for every character in the target string
levenshtein_distlist([TargetHead|TargetTail], [DLH|DLT], SourceChar, NewDistList, LastDist) when length(DLT) > 0 ->
	Min = lists:min([LastDist + 1, hd(DLT) + 1, DLH + dif(TargetHead, SourceChar)]),
	levenshtein_distlist(TargetTail, DLT, SourceChar, NewDistList ++ [Min], Min);
levenshtein_distlist([], _, _, NewDistList, _) ->
	NewDistList.

% Calculates the difference between two characters or other values
dif(C, C) -> 0;
dif(_, _) -> 1.

You can use the levenshtein function to compare two strings.

2> string_metrics:levenshtein("Aloha!", "Alhoa!").
2
3> string_metrics:levenshtein("adam", "Adam").
1
4> string_metrics:levenshtein("adam", "Assam").
3
5> string_metrics:levenshtein("teh", "the").
2
6> string_metrics:levenshtein("the", "the").
0
7> string_metrics:levenshtein("the", "").
3

Note that the function is not case insensitive (and the algorithm isn't either), though you can always use the httpd_util library to_lower or to_upper functions to put the two strings on an equal footing:

8> string_metrics:levenshtein(httpd_util:to_lower("Adam"), "adam").
0