Chrome, webdev

Dart Crawler Example

In the Dart hackathon I got few questions about applications on the server. The best way was to try and give the hackers a code sample… It’s by definition a very simple code but I’m sure that you can take it to the next level without any problem.


#import('dart:io');
#import('dart:uri');
#import('dart:json');
// Dart Hackathon TLV 2012
//
// A simple example to fetch RSS/JSON feed and parse it on the server side
// This is a good start for a crawler that fetch info and parse it.
//
// Author: Ido Green | greenido.wordpress.com
// Date: 28/4/2012
//
class Crawler {
String _urlToFetch = "http://feeds.feedburner.com/html5rocks";
String _dataFileName = "webPageData.json";
HttpClient _client;
var rssItems;
//Ctor.
Crawler() {
_client = new HttpClient();
}
// Fetch the page and save the data locally
// in a file so we could process it later
fetchWebPage() {
// Get all the updates of h5r
Uri pipeUrl = new Uri.fromString(_urlToFetch);
// open a GET connection to fetch this data
var conn = _client.getUrl(pipeUrl);
conn.onRequest = (HttpClientRequest request) {
request.outputStream.close();
};
conn.onResponse = (HttpClientResponse response) {
print("status code:" + response.statusCode);
var output = new File(_dataFileName).openOutputStream();
response.inputStream.pipe(output);
// In case you want to print the data to your console:
// response.inputStream.pipe(stdout);
};
}
// Read a file and return its content.
readFile() {
File file = new File(_dataFileName);
if (!file.existsSync()) {
print ("Err: Could not find: " + _dataFileName);
return;
}
InputStream file_stream = file.openInputStream();
StringInputStream lines = new StringInputStream(file_stream);
String data = "";
lines.onLine = () {
String line;
while ((line = lines.readLine()) != null) {
//print ("== "+line);
data += line;
}
};
lines.onClosed = () {
print ("Got to the end of: "+_dataFileName);
print ("This is our file content:\n" + data);
parsePage(data);
};
}
//
// Basic (real basic) parsing
//
parsePage(data) {
// cut the intersting part of the feed
int start = data.indexOf("<title>");
int end = data.lastIndexOf("</channel>");
var feed = data.substring(start, end);
// put the items in an array
rssItems = feed.split("<title>");
for (var item in rssItems) {
print("\n** Item: " +item);
}
}
} // End of class
//
// Start the party
//
void main() {
Crawler crawler = new Crawler();
crawler.fetchWebPage();
crawler.readFile();
}

this example could be consider version 0.01 of a real crawler. You do need to add to the real first version features like:

  • Discovery – Be able to get links from the current page and jump into them. This is much harder then it sounds, as you want to make sure it won’t continue forever.
  • Parsing – parse the information on the page. Try to gain the meta data and add it to the ‘real’ content (which is based on your goals from the crawler).
  • Analyze – Meaning, normalize the information of the page and put it in a storage (DB, file, a cloud solution etc’).
  • Logging &Monitoring – As this server side process will run while you are sleeping… It’s best to have some good ‘watch-dog’ on it. The start will be with some simple logging and analyzing of the logs. The second step will be to use a tool to monitor the action.

Key lessons:

  • There is a real need to libraries that will make the parsing better. xPath, DOM to Map (or Array) etc’.
  • The debugging in the editor could improved… and as a first step you might want to use a logging library that will give you a lot of information for each step.
  • The editor making the development phase very nice with warnings on (almost) every issue that you might do. I found it very productive to be back in the good hands of ‘IDE’.
  • I guess that in the near future we will see some good examples that use Dart VM on the server – It’s going to be interesting to profile their performance and see where do we stand vis a vis other modern languages like: Scala.

Discover more from Ido Green

Subscribe to get the latest posts sent to your email.

Standard

One thought on “Dart Crawler Example

  1. Pingback: Weekly Dart community update – week ending 05 May 2012 » DartWatch - Watching Google Dart

Comments are closed.