Erlang Central

Difference between revisions of "How to write an RSS aggregator"

From ErlangCentral Wiki

m
m (Reverted edits by DydFkp (Talk); changed back to last version by Vladdu)
Line 19: Line 19:
  
 
{{CodeSnippet|Code listing 1.1: Getting the RSS info|<pre>
 
{{CodeSnippet|Code listing 1.1: Getting the RSS info|<pre>
1
+
1&gt; {ok,_StatusCode,_Headers,B} = ibrowse:send_req("http://rss.slashdot.org/Slashdot/slashdot", [], get).
 +
{ok,"200",
 +
    [{"Age","2"},
 +
    {"Transfer-Encoding","chunked"},
 +
    {"Date","Thu, 07 Sep 2006 13:07:50 GMT"},
 +
    {"Content-Type","text/xml;charset=utf-8"},
 +
    {"Server",
 +
      "Apache/2.0.54 (Debian GNU/Linux) mod_fastcgi/2.4.2 mod_jk/1.2.15"},
 +
    {"Last-Modified","Thu, 07 Sep 2006 12:54:15 GMT"},
 +
    {"ETag","MiaYBqfDcpuUu6jqri59Oyhorvc"},
 +
    {"P3P","CP=\"ALL DSP COR NID CUR OUR NOR\""}],
 +
    "<?xml version=\"1.0\" encoding..."}
 +
</pre>}}
 +
 
 +
===Parsing the XML content.===
 +
We continue by parsing the XML content of the retrieved file. This time we make use of xmerl. file.
 +
 
 +
{{CodeSnippet| Code listing 1.2: Parsing the XML content|<pre>
 +
2&gt; {Doc,Misc} = xmerl_scan:string(B).
 +
{#xmlElement{name = 'rdf:RDF',
 +
            parents = [],
 +
            pos = 1,
 +
            attributes = [#xmlAttribute{name = 'xmlns:rdf',
 +
                            parents = [],
 +
                            pos = 1,
 +
                            language = [],
 +
                            expanded_name = [],
 +
                            nsinfo = [],
 +
                            namespace = {"xmlns","rdf"},
 +
                            value = "http://www.w3.org/1999/02/22-rdf-syntax-ns#"},
 +
              #xmlAttribute{name = xmlns,
 +
                            parents = [],
 +
                            .....
 +
</pre>}}
 +
<table class="ncontent" width="100%" border="0" cellspacing="0" cellpadding="0"><tr><td bgcolor="#bbffbb"><p class="note"><b>Note: </b>Note that we have made use of the fantastic shell command <b>rr/1</b>
 +
(as in rr(xmerl_scan)) before issuing the call to xmerl. This gives
 +
us the output in a nice record format.</p></td></tr></table>
 +
 
 +
===Printing out the RSS information.===
 +
Now we have to extract the information we need from the parse tree. We will write some simple code to do this. But first, let us see at how the result looks like.
 +
 
 +
{{CodeSnippet|Code listing 1.3: Printing out the RSS info|<pre>
 +
3&gt; myxml:printItems(myxml:getElementsByTagName(Doc, item)).
 +
title: United Kingdom Leads the World in TV Downloads
 +
link: http://slashdot.org/article.pl?sid=05/02/18/0324238from=rss
 +
description: SumDog writes "The UK is known for many things, great food,
 +
a wonderful climate and beautiful women. However, according to a story on the
 +
Guardian, a new study puts the UK ahead in one more category: it leads the
 +
world in TV piracy, accounting for 38.4% of the world's TV downloads, with
 +
Australia coming in second at 15.6% and the US in third at a pitiful 7.3%"
 +
date: 2005-02-18T09:31:00+00:00
 +
author: CowboyNeal
 +
 
 +
title: Skype-Ready Phones From Motorola
 +
link: http://slashdot.org/article.pl?sid=05/02/18/0314225from=rss
 +
description: Hack Jandy writes "Seamlessly integrating VoIP and GSM might
 +
not be a fantasy after all, as Motorola announced their decision to build cell
 +
phones and handsets that have Skype Internet Telephony integrated into the devices.
 +
Obviously, one could use Skype for outgoing calls near wi-fi hotspots (essentially
 +
free) but default on GSM for outgoing calls in areas that lack coverage."
 +
date: 2005-02-18T06:09:00+00:00
 +
author: CowboyNeal
 +
 
 +
title: London Nuke Plant Loses 30 Kilos of Plutonium
 +
link: http://science.slashdot.org/article.pl?sid=05/02/18/0027246from=rss
 +
...........
 +
</pre>}}
 +
Our first function will extract all <b>item</b> elements.
 +
To do this we create a function <b>getElementsByTagName/2</b> which
 +
takes the XML parse tree and the Tag that we want to find.
 +
 
 +
{{CodeSnippet|Code listing 1.4: getElementsByTagName/2|<pre>
 +
getElementsByTagName([H|T], Item) when H#xmlElement.name == Item -&gt;
 +
    [H | getElementsByTagName(T, Item)];
 +
getElementsByTagName([H|T], Item) when record(H, xmlElement) -&gt;
 +
    getElementsByTagName(H#xmlElement.content, Item) ++
 +
      getElementsByTagName(T, Item);                                                                 
 +
getElementsByTagName(X, Item) when record(X, xmlElement) -&gt;
 +
    getElementsByTagName(X#xmlElement.content, Item);
 +
getElementsByTagName([_|T], Item) -&gt;
 +
    getElementsByTagName(T, Item);
 +
getElementsByTagName([], _) -&gt;
 +
    [].
 +
</pre>}}
 +
 
 +
Next we want to print each entry. The function printItems/1 walks through each item, exctracts and prints the info we are interested in.
 +
 
 +
{{CodeSnippet|Code listing 1.5: printItems/2|<pre>
 +
printItems(Items) -&gt;
 +
    F = fun(Item) -&gt; printItem(Item) end,
 +
    lists:foreach(F, Items).
 +
 
 +
printItem(Item) -&gt;
 +
    io:format("title: ~s~n", [textOf(first(Item, title))]),
 +
    io:format("link: ~s~n", [textOf(first(Item, link))]),
 +
    io:format("description: ~s~n", [textOf(first(Item, description))]),
 +
    io:format("date: ~s~n", [textOf(first(Item, 'dc:date'))]),
 +
    io:format("author: ~s~n", [textOf(first(Item, 'dc:creator'))]),
 +
    io:nl().
 +
</pre>}}
 +
 
 +
The last two functions to implement <b>first/2</b> and
 +
<b>textOf/1</b>
 +
 
 +
{{CodeSnippet|Code listing 1.6: printItems/2|<pre>
 +
first(Item, Tag) -&gt;
 +
    hd([X || X &lt;- Item#xmlElement.content,
 +
    X#xmlElement.name == Tag]).
 +
 
 +
textOf(Item) -&gt;
 +
    lists:flatten([X#xmlText.value || X &lt;- Item#xmlElement.content,
 +
      element(1,X) == xmlText]).
 +
</pre>}}
 +
 
 +
===The RSS aggregator.===
 +
To go from here to a RSS aggregator is easy. You just have to extend the code above with the functionality to retreive info from several RSS feeds. You may also want to present the info in some other format, e.g HTML via a Yaws page. This however, is left as an exercise for the reader to do.
 +
 
 +
==Download xml==
 +
[http://wiki.trapexit.erlang-consulting.com/upload/howto/howto_rss_aggregator.xml howto_rss_aggregator.xml]
 +
 
 +
[[Category:HowTo]]

Revision as of 09:12, 18 April 2007

Contents

Author

Tobbe

How to write an RSS aggregator

Introduction

In the article: What is RSS, they describe the various RSS formats and create a simple RSS aggregator written in Python. Inspired by this I decided to do the same in Erlang.

In this example I have been using OTP-R10B-3 release and the jerl Jungerl start script. By using the Jungerl start script I automatically get www_tools in my path. This will probably also make it possible to make the example work in older Erlang releases since Jungerl also contains xmerl before it was added into OTP.

Getting information from an RSS feed.

Let us use the RSS feed at Slashdot in this example. The RSS info can be reteived from the URL: http://rss.slashdot.org/Slashdot/slashdot. We make use of a function in the ibrowse package to retrieve the file.

Code listing 1.1: Getting the RSS info

1> {ok,_StatusCode,_Headers,B} = ibrowse:send_req("http://rss.slashdot.org/Slashdot/slashdot", [], get).
{ok,"200",
    [{"Age","2"},
     {"Transfer-Encoding","chunked"},
     {"Date","Thu, 07 Sep 2006 13:07:50 GMT"},
     {"Content-Type","text/xml;charset=utf-8"},
     {"Server",
      "Apache/2.0.54 (Debian GNU/Linux) mod_fastcgi/2.4.2 mod_jk/1.2.15"},
     {"Last-Modified","Thu, 07 Sep 2006 12:54:15 GMT"},
     {"ETag","MiaYBqfDcpuUu6jqri59Oyhorvc"},
     {"P3P","CP=\"ALL DSP COR NID CUR OUR NOR\""}],
    "<?xml version=\"1.0\" encoding..."}

Parsing the XML content.

We continue by parsing the XML content of the retrieved file. This time we make use of xmerl. file.

Code listing 1.2: Parsing the XML content

2> {Doc,Misc} = xmerl_scan:string(B).
{#xmlElement{name = 'rdf:RDF',
             parents = [],
             pos = 1,
             attributes = [#xmlAttribute{name = 'xmlns:rdf',
                            parents = [],
                            pos = 1,
                            language = [],
                            expanded_name = [],
                            nsinfo = [],
                            namespace = {"xmlns","rdf"},
                            value = "http://www.w3.org/1999/02/22-rdf-syntax-ns#"},
              #xmlAttribute{name = xmlns,
                            parents = [],
                            .....

Note: Note that we have made use of the fantastic shell command rr/1

(as in rr(xmerl_scan)) before issuing the call to xmerl. This gives

us the output in a nice record format.

Printing out the RSS information.

Now we have to extract the information we need from the parse tree. We will write some simple code to do this. But first, let us see at how the result looks like.

Code listing 1.3: Printing out the RSS info

3> myxml:printItems(myxml:getElementsByTagName(Doc, item)).
title: United Kingdom Leads the World in TV Downloads
link: http://slashdot.org/article.pl?sid=05/02/18/0324238from=rss
description: SumDog writes "The UK is known for many things, great food, 
a wonderful climate and beautiful women. However, according to a story on the 
Guardian, a new study puts the UK ahead in one more category: it leads the 
world in TV piracy, accounting for 38.4% of the world's TV downloads, with 
Australia coming in second at 15.6% and the US in third at a pitiful 7.3%"
date: 2005-02-18T09:31:00+00:00
author: CowboyNeal

title: Skype-Ready Phones From Motorola
link: http://slashdot.org/article.pl?sid=05/02/18/0314225from=rss
description: Hack Jandy writes "Seamlessly integrating VoIP and GSM might 
not be a fantasy after all, as Motorola announced their decision to build cell 
phones and handsets that have Skype Internet Telephony integrated into the devices. 
Obviously, one could use Skype for outgoing calls near wi-fi hotspots (essentially 
free) but default on GSM for outgoing calls in areas that lack coverage."
date: 2005-02-18T06:09:00+00:00
author: CowboyNeal

title: London Nuke Plant Loses 30 Kilos of Plutonium
link: http://science.slashdot.org/article.pl?sid=05/02/18/0027246from=rss
...........

Our first function will extract all item elements. To do this we create a function getElementsByTagName/2 which takes the XML parse tree and the Tag that we want to find.

Code listing 1.4: getElementsByTagName/2

getElementsByTagName([H|T], Item) when H#xmlElement.name == Item ->
    [H | getElementsByTagName(T, Item)];
getElementsByTagName([H|T], Item) when record(H, xmlElement) ->
    getElementsByTagName(H#xmlElement.content, Item) ++
      getElementsByTagName(T, Item);                                                                  
getElementsByTagName(X, Item) when record(X, xmlElement) ->
    getElementsByTagName(X#xmlElement.content, Item);
getElementsByTagName([_|T], Item) ->
    getElementsByTagName(T, Item);
getElementsByTagName([], _) ->
    [].

Next we want to print each entry. The function printItems/1 walks through each item, exctracts and prints the info we are interested in.

Code listing 1.5: printItems/2

printItems(Items) ->
    F = fun(Item) -> printItem(Item) end,
    lists:foreach(F, Items).

printItem(Item) ->
    io:format("title: ~s~n", [textOf(first(Item, title))]),
    io:format("link: ~s~n", [textOf(first(Item, link))]),
    io:format("description: ~s~n", [textOf(first(Item, description))]),
    io:format("date: ~s~n", [textOf(first(Item, 'dc:date'))]),
    io:format("author: ~s~n", [textOf(first(Item, 'dc:creator'))]),
    io:nl().

The last two functions to implement first/2 and textOf/1

Code listing 1.6: printItems/2

first(Item, Tag) ->
    hd([X || X <- Item#xmlElement.content,
	     X#xmlElement.name == Tag]).

textOf(Item) ->
    lists:flatten([X#xmlText.value || X <- Item#xmlElement.content,
				      element(1,X) == xmlText]).

The RSS aggregator.

To go from here to a RSS aggregator is easy. You just have to extend the code above with the functionality to retreive info from several RSS feeds. You may also want to present the info in some other format, e.g HTML via a Yaws page. This however, is left as an exercise for the reader to do.

Download xml

howto_rss_aggregator.xml