Hacking RSS: Filtering & Processing Obscene Amounts of Information

Loose notes from SXSW 2011 session Hacking RSS: Filtering & Processing Obscene Amounts of Information

Information overload is less about having too much information and more about not having the right tools and techniques to filter and process information to find the pieces that are most relevant for you. This presentation will focus on showing you a variety of tips and techniques to get you started down the path of looking at RSS feeds in a completely different light. The default RSS feeds generated by your favorite blog or website are just a starting point waiting to be hacked and manipulated to serve your needs. Most people read RSS feeds, but few people take the time to go one step further to hack on those RSS feeds to find only the most interesting posts. I combine tools like Yahoo Pipes, BackTweets, PostRank and more with some simple API calls to be able to find what I need while automatically discarding the rest. You start with one or more RSS feeds and then feed those results into other services to gather more information that can be used to further filter or process the results. This process is easier than it sounds once you learn a few simple tools and techniques, and no “real” programming experience is required to get started. This session will show you some tips and tricks to get you started down the path of hacking your RSS feeds.

Dawn Foster
MeeGo Community Mgr

Slides at http://fastwonderblog.com/

295 Exabytes of data in the world in 2007. Amount doubles every 3 yrs 4 mos.

600+ exabytes today (1 exabyte = 1,073,741,824 gigabytes)

A lot of that data is crap / irrelevant to you. The key is to find that little bit of data that you care about – find that needle in the haystack.

RSS is great, but even within your chosen feeds, most of it you don’t care about.
– Do you care about everything in each feed?
– What about the feeds you aren’t subscribed to?
– Can you keep up with what you have?

Prioritize your reader
– Put things you care about at the top
– Categorize
– Don’t try to read everything

Outsource / Crowdsource new sources


tweetedtimes.com (used to be twittertimes) takes the links posted by people you follow and displays it in a newspaper-like format.

TechMeme also excellent.
Also: google news, stumbleupon, reddit

The real magic is in filtering RSS. From broad to narrow:
– Complete crap
– Interesting
– Maybe relevant
– Yay!

Filter for analyst research on whatever you’re working on.

Yahoo Pipes can filter on any data – keyword, geolocation, url, title, description, author, etc. Downside: Can be a bit flaky. Takes time to learn.

Other options: FeedRinse. Readers like FeedDemon with their own filtering built in. Or you can code your own. Some of the smaller ones have gone out of business – this is bandwidth-intensive, resource-intensive stuff.

Dawn has several Yahoo Pipes videos on her blog – good way to learn.

Working with Pipes:

Input (e.g.):
– WebWorker Daily
– ReadWriteWeb

Filter by content (e.g.):
– Collaborate
– Collaboration
– Collaborative

– 1 RSS feed
– Matching three keywords

– Finds best posts in a feed
– Ranked on engagement (linked to, sharing, comments)
– Can get output as RSS feed
– Feed includes postrank number as a field

What’s in a Feed? Content in feeds varies wildly depending on site.
– Common: Title, author, pubdate, link, content, description.
– Site specific: Postrank, lat/long, image links, username, twitter source (most RSS readers don’t show all this data – you have to inspect it)
– APIs usually have additional data and can be output as RSS
– If it’s in the feed you can use it (even if not displayed)

When building a Yahoo Pipe to digest RSS feeds, try sorting it by its PostRank rating, to get the critical stuff at the top of the feed.

Don’t be satisfied with the default RSS feed formats!
If you do a Twitter search, you get the author link, author pic, etc. But if you get the RSS from the search results, it’s different. So you can build a Yahoo pipe to reformat a Twitter search feed however you like.

You could the number of followers of each contributor right in front of the title of each line of output.

API calls are basically URLs, and you can construct the URLs to include the data you want. See the API docs to see how. Watch out for rate limiting – most of the time you can just sit back and relax, check back later and you’ll have a fully populated, customized feed.

BackType API / BackTweets
Gives you data about any link posted to Twitter regardless of the shortening service used. No RSS feeds, but you can use their API to build your own with Yahoo Pipes. If you want to see what people are saying about your blog, you want to find any tweet mentioning your URL. So we’ll use both the Twitter API and the BackTweets API. Backtype API provides the tweet ID from Twitter (not humanly useful). To do:

– Take WebWorderDaily Author feed
– Use WWD URLs to build URLs for BackType API call
– Fetch data from BackType URLs to get Tweet ID


– Use BackType tweet ID to build URL for Twitter API
– Fetch data about Tweet and user from twitter API
– Output

You could also add an additional filter to only show tweets from people with more than 1,000 followers. This should increase relevancy.

How Should / Shouldn’t You Use All of This?

– Personal productivity
– Play around and understand the possibilities
– Create prototypes for something you might want to build

– Use in critical or production environments – Yahoo Pipes goes DOWN
– All of this stuff can be done in any programming language – DIY, with cached results

YQL – Yahoo Query Language – Very powerful custom query language for the web.

You can upload a CSV file full of keywords and use that as search input to Yahoo Pipes.

Leave a Reply

Your email address will not be published. Required fields are marked *