Erlang Central

Difference between revisions of "How to write an RSS aggregator"

From ErlangCentral Wiki

(Getting information from an RSS feed.: Converted to a mediawiki template)
Line 15:Line 15:
 
file.
 
file.
  
<table class="ntable" width="100%" cellspacing="0" cellpadding="0" border="0">
+
{{CodeSnippet|Code listing 1.1: Getting the RSS info|<pre>
<tr><td class="infohead" bgcolor="#7a5ada"><p class="caption">
+
            Code listing 1.1: Getting the RSS info</p></td></tr>
+
<tr><td bgcolor="#ddddff"><pre>
+
 
1&gt; {ok,B} = url:raw_get_url("http://slashdot.org/index.rss", 5000).
 
1&gt; {ok,B} = url:raw_get_url("http://slashdot.org/index.rss", 5000).
 
{ok,&lt;&lt;60,63,120,...&gt;&gt;}
 
{ok,&lt;&lt;60,63,120,...&gt;&gt;}
</pre></td></tr>
+
</pre>}}
</table>
+
  
 
===Parsing the XML content.===
 
===Parsing the XML content.===

Revision as of 22:12, 31 July 2006

Contents

How to write an RSS aggregator

Introduction

In the article: is RSS, they describe the various RSS formats and create a simple RSS aggregator written in Python. Inspired by this I decided to do the same in Erlang.

In this example I have been using OTP-R10B-3 release and the jerl Jungerl start script. By using the Jungerl start script I automatically get www_tools in my path. This will probably also make it possible to make the example work in older Erlang releases since Jungerl also contains xmerl before it was added into OTP.

Getting information from an RSS feed.

Let us use the RSS feed at Slashdot in this example. The RSS info can be reteived from the URL: http://slashdot.org/index.rss. We make use of a function in the www_tools package to retrieve the file.

Code listing 1.1: Getting the RSS info

1> {ok,B} = url:raw_get_url("http://slashdot.org/index.rss", 5000).
{ok,<<60,63,120,...>>}

Parsing the XML content.

We continue by parsing the XML content of the retrieved file. This time we make use of xmerl. file.

Code listing 1.2: Parsing the XML content

2> {Doc,Misc} = xmerl_scan:string(binary_to_list(B)).
{#xmlElement{name = 'rdf:RDF',
             parents = [],
             pos = 1,
             attributes = [#xmlAttribute{name = 'xmlns:rdf',
                            parents = [],
                            pos = 1,
                            language = [],
                            expanded_name = [],
                            nsinfo = [],
                            namespace = {"xmlns","rdf"},
                            value = "http://www.w3.org/1999/02/22-rdf-syntax-ns#"},
              #xmlAttribute{name = xmlns,
                            parents = [],
                            .....

Note: Note that we have made use of the fantastic shell command rr/1

(as in rr(xmerl_scan)) before issuing the call to xmerl. This gives

us the output in a nice record format.

Printing out the RSS information.

Now we have to extract the information we need from the parse tree. We will write some simple code to do this. But first, let us see at how the result looks like.

Code listing 1.3: Printing out the RSS info

3> myxml:printItems(myxml:getElementsByTagName(Doc, item)).
title: United Kingdom Leads the World in TV Downloads
link: http://slashdot.org/article.pl?sid=05/02/18/0324238from=rss
description: SumDog writes "The UK is known for many things, great food, 
a wonderful climate and beautiful women. However, according to a story on the 
Guardian, a new study puts the UK ahead in one more category: it leads the 
world in TV piracy, accounting for 38.4% of the world's TV downloads, with 
Australia coming in second at 15.6% and the US in third at a pitiful 7.3%"
date: 2005-02-18T09:31:00+00:00
author: CowboyNeal

title: Skype-Ready Phones From Motorola
link: http://slashdot.org/article.pl?sid=05/02/18/0314225from=rss
description: Hack Jandy writes "Seamlessly integrating VoIP and GSM might 
not be a fantasy after all, as Motorola announced their decision to build cell 
phones and handsets that have Skype Internet Telephony integrated into the devices. 
Obviously, one could use Skype for outgoing calls near wi-fi hotspots (essentially 
free) but default on GSM for outgoing calls in areas that lack coverage."
date: 2005-02-18T06:09:00+00:00
author: CowboyNeal

title: London Nuke Plant Loses 30 Kilos of Plutonium
link: http://science.slashdot.org/article.pl?sid=05/02/18/0027246from=rss
...........

Our first function will extract all item elements. To do this we create a function getElementsByTagName/2 which takes the XML parse tree and the Tag that we want to find.

Code listing 1.4: getElementsByTagName/2

getElementsByTagName([H|T], Item) when H#xmlElement.name == Item ->
    [H | getElementsByTagName(T, Item)];
getElementsByTagName([H|T], Item) when record(H, xmlElement) ->
    getElementsByTagName(H#xmlElement.content, Item) ++
      getElementsByTagName(T, Item);                                                                  
getElementsByTagName(X, Item) when record(X, xmlElement) ->
    getElementsByTagName(X#xmlElement.content, Item);
getElementsByTagName([_|T], Item) ->
    getElementsByTagName(T, Item);
getElementsByTagName([], _) ->
    [].

Next we want to print each entry. The function printItems/1 walks through each item, exctracts and prints the info we are interested in.

Code listing 1.5: printItems/2

printItems(Items) ->
    F = fun(Item) -> printItem(Item) end,
    lists:foreach(F, Items).

printItem(Item) ->
    io:format("title: ~s~n", [textOf(first(Item, title))]),
    io:format("link: ~s~n", [textOf(first(Item, link))]),
    io:format("description: ~s~n", [textOf(first(Item, description))]),
    io:format("date: ~s~n", [textOf(first(Item, 'dc:date'))]),
    io:format("author: ~s~n", [textOf(first(Item, 'dc:creator'))]),
    io:nl().

The last two functions to implement first/2 and textOf/1

Code listing 1.6: printItems/2

first(Item, Tag) ->
    hd([X || X <- Item#xmlElement.content,
	     X#xmlElement.name == Tag]).

textOf(Item) ->
    lists:flatten([X#xmlText.value || X <- Item#xmlElement.content,
				      element(1,X) == xmlText]).

The RSS aggregator.

To go from here to a RSS aggregator is easy. You just have to extend the code above with the functionality to retreive info from several RSS feeds. You may also want to present the info in some other format, e.g HTML via a Yaws page. This however, is left as an exercise for the reader to do.

Download xml

howto_rss_aggregator.xml