In this post we’ll look at scraping Websites on a regular basis using Azure WebJobs & Node.js (jQuery / cheerio) and returning the data as JSON.
In the example below we’ll each week grab the apps that are on sale as part of the “Red Stripe Deals” in the Windows Store. The resulting app details will be put into a JSON file that can be consumed for example by a mobile app.
Crawling & Scraping: be responsible
There’s already enough content out there about the ethics of scraping data from the web, so I won’t get into that at this point. However, we really do not want to break the site we try to get the data from. Especially when using powerful tools like Azure we should make sure not to DOS the server.
That’s why we’re using tools like Async.js and the zlib Library (which is part of Node.js) to limit bandwidth usage.
Tools
First things first
To create a new WebJob we first need an Azure Web App for it to run in. The free F1 tier will do just fine, especially for testing purposes.
Create a Node.js project
To create a new project we’ll create an empty folder that will eventually contain all the necessary files and which we’ll be able to upload as a .zip-Archive to Azure later on.
Inside the new folder open the command prompt (Shift + right click inside the folder) and install the required Node.js packages through npm.
npm install request
request is an HTTP request client we use to download the web pages.
npm install cheerio
cheerio is a subset of jQuery created for Node.js. Using this we can easily extract data from the DOM just as we would do in regular jQuery.
npm install async
async will help us to throttle requests by using it’s queue function.
Additionally we’ll make use of the following two packages, which are already part of Node.js:
zlib lets us download and extract Websites using gzip which dramatically reduces the overall amount of data transferred.
Using fs we can eventually save our data as a .json-Document to the hard drive.
Now let’s create an empty app.js file in our folder in which we load all modules:
var request = require('request'); var cheerio = require('cheerio'); var fs = require('fs'); var async = require('async'); var zlib = require('zlib');
Additionally we’ll declare some configuration variables and also an output array that we’ll push our app infos into.
var output = [fusion_builder_container hundred_percent="yes" overflow="visible"][fusion_builder_row][fusion_builder_column type="1_1" background_position="left top" background_color="" border_size="" border_color="" border_style="solid" spacing="yes" background_image="" background_repeat="no-repeat" padding="" margin_top="0px" margin_bottom="0px" class="" id="" animation_type="" animation_speed="0.3" animation_direction="left" hide_on_mobile="no" center_content="no" min_height="none"][]; var dealsUri = "https://www.microsoft.com/en-us/store/collections/redstripedeals/pc"; var baseUri = "http://microsoft.com"; var output = []; var outputFilename = "redstripedeals.json"; //var outputFilename = "d:\\home\\site\\wwwroot\\redstripedeals2.json";
dealsUri our entry page that contains links to all the apps we want to get.
baseUri is required for calling the absolute path to the app details pages.
outputFilename sets the location for our JSON output. The location varies depending on whether we test locally or if the Script is running inside an Azure WebJob since it’ll be copied into a temporary directory before being executed.
The Crawler
We want our script to do the following steps:
- Get the main Red-Stripe-Deals page with all the links to the individual apps using gzip:
var loadDeals = function () { var options = { url: dealsUri, port: 443, headers: { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko', 'Accept-Language': 'en-us', 'Content-Language': 'en-us', 'Accept-Encoding': 'gzip' }, timeout: 0, encoding: null }; request(options, function (err, resp, body) { if (err) { console.log("error loading page"); } if (!err) { if (resp.headers['content-encoding'] == 'gzip') { zlib.gunzip(body, function (err, dezipped) { redstripedealsParser(dezipped.toString()); }); } else { redstripedealsParser(body); } } }); }
- Extract all app links using cheerio (jQuery) and add them to the queue qDetails:
var redstripedealsParser = function (body, lang) { $ = cheerio.load(body); $('figure h4 a').each(function () { var appUri = baseUri + $(this).attr("href"); qDetails.push(appUri); }); }
- Process the queue one by one getting the app details pages and handing them to our parser.
We can choose how many tasks should run at the same time by changing the last parameter in async.queue() – 1 in this case:var qDetails = async.queue(function (task, callback) { console.log("get details: " + task); var options = { url: task, headers: { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko', 'Accept-Language': 'en-us', 'Content-Language': 'en-us', 'Accept-Encoding': 'gzip' }, timeout: 0, encoding: null }; request(options, function (err, resp, body) { if (err) { console.log("error: " + task); throw err; } if(resp.headers['content-encoding'] == 'gzip'){ zlib.gunzip(body, function(err, dezipped) { GetDetails(dezipped.toString(), task, callback); }); } else { GetDetails(body, task, callback); } }); }, 1);
- Extract all relevant app details, again using cheerio. We then push the app into the output array.
Don’t forget the callback() telling the async-queue to proceed:var GetDetails = function (body, task, callback) { console.log("parse details: " + task); $ = cheerio.load(body); var app = {}; app["title"] = $('#page-title').text(); app["image"] = "http:" + $('.ph-logo > img').first().attr('src').toString(); app["description"] = $('.showmore > p').first().text(); app["price"] = $('.srv_price > span').first().text(); app["ratingValue"] = $('.srv_ratingsScore.win-rating-average').first().text(); app["ratingCount"] = $('.win-rating-total').first().text().replace(/\D/g, ''); app["packageSize"] = $('.metadata-list-content > div').eq(2).text().replace("\r\n", "").trim(); app["publisher"] = $('.metadata-list-content > div').eq(0).text().replace("\r\n", "").trim(); output.push(app); callback(); }
- When the queue is empty save the output-Array as a JSON document in our output location:
qDetails.drain = function () { fs.writeFile(outputFilename, JSON.stringify(output, null), function (err) { if (err) { console.log(err); } else { console.log("JSON saved to " + outputFilename); } }); }
- Finally, we do need to start the Script once.
loadDeals();
You can find the complete code also as a Github Gist.
Now we can test the script from the command line:
node app.js
Scheduling the WebJob
After successfully running the script locally we change the output location for Azure and afterwards compress the whole folder into a regular .zip file.
In order to create a WebJob that actually runs on a schedule we need to sign into the old Azure portal. There it’s possible to create a new WebJob inside our previously created web app. Just enter a new name, upload the zip file and select a region that you want the job to run in.
For the second step we choose when the script should be executed. In our case it’s sufficient to run the Job just once a week as the deals won’t change more often than that.
As soon as the job is created we can try running it for the first time. If we’ve done everything right we should then get a “Success” in the results column.
Editing a WebJob Script
It is actually fairly easy to edit an active WebJob script (which we would never do in a productive environment!). Using Microsoft Webmatrix makes it really easy to sign into your Microsoft / Azure account and just remotely edit your web apps.
Retrieving the JSON result
Finally we want to consume our JSON document for example inside a mobile app. However, there’s a tiny issue: Azure Websites won’t serve any static JSON by default. This can be fixed by placing the following code inside the Web.config inside the root folder of the server:
<?xml version="1.0"?> <configuration> <system.webServer> <staticContent> <mimeMap fileExtension=".json" mimeType="application/json" /> </staticContent> </system.webServer> </configuration>
Feel free to ask questions and share your success stories! 🙂
[/fusion_builder_column][/fusion_builder_row][/fusion_builder_container]
When I try to run this it does not seem to output the file…I just get another command prompt