Demo javascript code here

This scraping was meant to be a simple one where I could show how all you needed was use of split() to scrape. Then they blocked my IP. I was kind of surprised. I really hadn't even been hitting it that fast or hard but as soon as I used a VPN with a different IP, I was able to get at that data again.

So, if you are using this and suddenly you can't access the site anymore...sorry. Use TunnelBear like I did. The block is temporary and a few days later I was able to access allrecipes normally.

The mission: Get the cookies.

The tools:

// package.json

{
  "name": "all-recipes",
  "version": "1.0.0",
  "description": "",
  "main": "index.js",
  "scripts": {
    "start": "node index.js"
  },
  "author": "Jordan Hansen",
  "license": "ISC",
  "dependencies": {
    "request": "^2.88.0"
  }
}

There were really two parts to this. I wanted to create a json file with the ingredients and the instructions. I formed it like this:

{
    title: string;
    ingredients: string[];
    instructions: string[];
}

The first thing I normally do with a place where I'm searching for something is to see if there is a direct url so I don't have to submit search query everytime. You can replace this url with your own if you don't like cookies (monster). const searchForCookiesUrl = 'https://www.allrecipes.com/search/results/?wt=cookies&sort=re'; starts us off then.

That is our base line and we do our request from that. That is gives us a big list of cookies. I found a regex that find all urls on a page ( const allUrls = html.match(/\bhttps?:\/\/\S+/gi); ) and then just loop through that.

I found that all recipes contained /recipe/ specifically in the url so I made sure to only keep the urls that included this. Because there was often more than one link to the same recipe I created an array of recipes and once I found a valid one I put it in there. I then just checked this array to make sure I wasn't duplicating.

    for (let i = 0; i < allUrls.length; i++) {

        // Specific recipes contain '/recipe/' in the url.
        // Let's make sure we don't put a duplicate into our array.
        if (allUrls[i].includes('/recipe/') && !recipeUrls.includes(allUrls[i])) {
            const recipeUrl = allUrls[i].replace('"', "");
            recipeUrls.push(recipeUrl);

            try {
                listOfRecipes.push(await getRecipeDetails(recipeUrl));
            }
            catch (e) {
                console.log('error: ', e);
            }
        }
    }

From there I'd make another request to each of those recipe urls (honestly, this is probably why they blocked me. I'd get like 50 recipes and then make a request call to those recipes much faster than any mere human could.) I grabbed the title and to make my getRecipeDetails() not such a monster, I split off functions to get the list of ingredients and the instructions. Also, I know there is a request promise library that, you guessed it, handles promises with request. I'll need to check that out.

function getRecipeDetails(url) {

    return new Promise((resolve, reject) => {

        request.get(url, (err, res, html) => {
            if (err) {
                reject('err in getting details');
            }

            if (html) {
                let recipeDetails = {
                    title: html.split('recipe-main-content" class="recipe-summary__h1" itemprop="name">')[1].split('</h1>')[0],
                    ingredients: [],
                    instructions: []
                };

                recipeDetails = addIngredientList(html, recipeDetails);
                recipeDetails = addInstructions(html, recipeDetails);

                resolve(recipeDetails);
            }
            else {
                resolve(null);
            }

        });
    });

}

The add ingredient list function was pretty simple. We were to lucky enough to find an string that only the ingredients had: ingredientsSection.split('itemprop="recipeIngredient">');. From there we could just loop through the list. I probably could have dug into why there was occasionaly an undefined in my array.

function addIngredientList(html, recipeDetails) {
    const ingredientsSection = html.split('polaris-app">')[1].split('<label class="checkList__item" id="btn-addtolist">')[0];
    const listOfIngredients = ingredientsSection.split('itemprop="recipeIngredient">');

    for (i = 0; i < listOfIngredients.length; i++) {
        const splitIngredient = listOfIngredients[i].split('title="')[1];

        // splitIngredient sometimes returns undefined so let's check it
        if (splitIngredient) {
            recipeDetails.ingredients.push(listOfIngredients[i].split('title="')[1].split('">\r\n')[0]);
        }
    }

    return recipeDetails;
}

Adding the instructions was similarly fortunate in that it had the a string in each instruction that we could split and loop on: instructionsSection.split('recipe-directions__list--item">'); . Thanks allrecipes.com engineers! It also is worth noting that there is often whitespace around the strings you want. Using trim() takes care of that.

function addInstructions(html, recipeDetails) {
    const instructionsSection = html.split('itemprop="recipeInstructions">')[1].split("</ol>")[0];
    const listOfInstructions = instructionsSection.split('recipe-directions__list--item">');

    for (i = 0; i < listOfInstructions.length; i++) {
        const instruction = listOfInstructions[i].split('\n')[0].trim();

        // There is a white space
        if (instruction) {
            recipeDetails.instructions.push(instruction);
        }
    }

    return recipeDetails;

}

BAM. Done. Now if you've used this code, you've been blocked from allrecipes.com and your wife is mad at you because she can't make her favorite cookie recipe.

Difficulty level: 6/10

Difficult at that level mostly due to the IP blocking. I have scraped a lot of other sites a lot harder that didn't block me. I was in denial that they were blocking me until I used tunnelbear and saw that it worked fine with a different IP.

Demo javascript code here