Thursday, 31 May 2012

Screen scaping with node.js and node.io

I'm doing an integration with a 3rd party for one of my side projects, and it's funny because they have api calls to get the information I want, given that I know the business id. Of course they don't offer that via their api, so I have to go through the front end interface to grab it.

I wanted something that I can use with jQuery selectors, and I came across a very cool node js plugin called node.io. The plugin is a screen scaper records the source and lets you query using jQuery like selectors.

Let's have a look at the code shall we?

File: server.js



 var http = require('http'); 

 var URL = require('url'); 

 var nodeio = require('node.io'); 

 var port = 8081; 

 var gigParkSearchJobOptions = {timeout : 10}; 

 var gigParkSearchJob = { 

   input: false, 

   run: function (keyword) { 

        var phone = this.options.args[0]; 

     this.getHtml("http://www.gigpark.com/?search=" + phone + "&commit=Search&city", function (err, $) { 

       var results = $('h3 .search-results b').text.toLowerCase(); 

       this.emit('Results found: ' + results); 

     }); 

   } 

 }; 

 /** 

  * Main http serve method. 

  */ 

 function serve() { 

      var serv = http.createServer(function(req, res) { 

           var url = URL.parse(req.url); 

           if (req.method == 'GET' && url.pathname == '/gigpark_search') { 

                res.statusCode = 200; 

                //query looks like this: 'q=1234567890, so we split it. 

                var phoneNum = url.query.split('=')[1]; 

                console.log('query is: ' + phoneNum); 

                nodeio.start(new nodeio.Job(gigParkSearchJobOptions, gigParkSearchJob), [phoneNum], function(err, output) {  

                     console.log('result is: ' + output); 

                     res.write(JSON.stringify(output)); 

                     res.end(); 

                     }, true); 

           } else { 

                res.statusCode = 404; 

                res.end("Sorry, page not found."); 

           } 

      }); 

      serv.listen(port); 

 } 

 console.log("Starting server..."); 

 serve();  



Basically this code create an http server with a path of /gigpark_search which takes a query parameter. We then parse the parameter value, and then send it in as a paramter to a node.io job. It then fetches the search page and parses the html and gives me the value of var results = $('h3 .search-results b').text, which is an integer.

So if one makes a query with this url: http://localhost:8081/gigpark_search?q=7788911331

One would see a response like this: ["Results found: 1"].

Of course, my next step is to make it into a JSON response, but for illustration I just wanted to put some result in.

No comments:

Post a Comment