In this post we’ll look at scraping Websites on a regular basis using Azure WebJobs & Node.js (jQuery / cheerio) and returning the data as JSON.

In the example below we’ll each week grab the apps that are on sale as part of the “Red Stripe Deals” in the Windows Store. The resulting app details will be put into a JSON file that can be consumed for example by a mobile app.

Crawling & Scraping: be responsible

There’s already enough content out there about the ethics of scraping data from the web, so I won’t get into that at this point. However, we really do not want to break the site we try to get the data from. Especially when using powerful tools like Azure we should make sure not to DOS the server.

That’s why we’re using tools like Async.js and the zlib Library (which is part of Node.js) to limit bandwidth usage.

Tools

First things first

To create a new WebJob we first need an Azure Web App for it to run in. The free F1 tier will do just fine, especially for testing purposes.

azure new web app

Create a Node.js project

To create a new project we’ll create an empty folder that will eventually contain all the necessary files and which we’ll be able to upload as a .zip-Archive to Azure later on.

Inside the new folder open the command prompt (Shift + right click inside the folder) and install the required Node.js packages through npm.

request is an HTTP request client we use to download the web pages.

cheerio is a subset of jQuery created for Node.js. Using this we can easily extract data from the DOM just as we would do in regular jQuery.

async will help us to throttle requests by using it’s queue function.

Additionally we’ll make use of the following two packages, which are already part of Node.js:

zlib lets us download and extract Websites using gzip which dramatically reduces the overall amount of data transferred.
Using fs we can eventually save our data as a .json-Document to the hard drive.

Now let’s create an empty app.js file in our folder in which we load all modules:

Additionally we’ll declare some configuration variables and also an output array that we’ll push our app infos into.

dealsUri our entry page that contains links to all the apps we want to get.
baseUri is required for calling the absolute path to the app details pages.
outputFilename sets the location for our JSON output. The location varies depending on whether we test locally or if the Script is running inside an Azure WebJob since it’ll be copied into a temporary directory before being executed.

The Crawler

We want our script to do the following steps:

  1. Get the main Red-Stripe-Deals page with all the links to the individual apps using gzip:
  2. Extract all app links using cheerio (jQuery) and add them to the queue qDetails:
  3. Process the queue one by one getting the app details pages and handing them to our parser.
    We can choose how many tasks should run at the same time by changing the last parameter in async.queue() – 1 in this case:
  4. Extract all relevant app details, again using cheerio. We then push the app into the output array.
    Don’t forget the callback() telling the async-queue to proceed:
  5. When the queue is empty save the output-Array as a JSON document in our output location:
  6. Finally, we do need to start the Script once.

You can find the complete code also as a Github Gist.

Now we can test the script from the command line:

Scheduling the WebJob

After successfully running the script locally we change the output location for Azure and afterwards compress the whole folder into a regular .zip file.

In order to create a WebJob that actually runs on a schedule we need to sign into the old Azure portal. There it’s possible to create a new WebJob inside our previously created web app. Just enter a new name, upload the zip file and select a region that you want the job to run in.

webjobs-schedule-medienstudio.net.
For the second step we choose when the script should be executed. In our case it’s sufficient to run the Job just once a week as the deals won’t change more often than that.

webjobs-schedule-2-medienstudio.net

As soon as the job is created we can try running it for the first time. If we’ve done everything right we should then get a “Success” in the results column.

Editing a WebJob Script

It is actually fairly easy to edit an active WebJob script (which we would never do in a productive environment!). Using Microsoft Webmatrix makes it really easy to sign into your Microsoft / Azure account and just remotely edit your web apps.

webmatrix-webjobs-medienstudio.net

Retrieving the JSON result

Finally we want to consume our JSON document for example inside a mobile app. However, there’s a tiny issue: Azure Websites won’t serve any static JSON by default. This can be fixed by placing the following code inside the Web.config inside the root folder of the server:

via Microsoft Developer

Feel free to ask questions and share your success stories! 🙂

 [/fusion_builder_column][/fusion_builder_row][/fusion_builder_container]