Demo javascript code here (this code will probably be updated to add sections for sending the data to my database)

I really like what this bit of code does. I think of it as kind of an interactive scraper. This is also the first time I'm demoing some of what Puppeteer can do.

I LOVE audio books and I think Audible is just an amazing company. Their customer service has always been great. If you have any interest in Audible, you can try it pretty much always risk free. They normally give one credit (which is worth one book) away for free.

I'll admit that I have taken advantage of this free book more than once and so ended up with multiple accounts with books spread across all of them. It can be frustrating because sometimes I don't know which account has what or if I already own a book.

But...they don't have an API! This code helps solve that by signing into each account automatically, scraping all the books, switching to the next account, scraping those books, and in the end you can dump them all into a database. Then you can easily see what books you have and which account it's under.

This code is all in typescript, which I really love, but if you hate that, it's pretty easy to understand what is happening and just do it in javascript. In the repository under the src folder I have a sample-credentials.ts file that is just an exported array of the accounts that you will be signing in to. Just rename it to credentials and it's good to go.

// sample-credentials.ts

export const credentials = [
    {
        email: 'example@example1.com',
        pass: 'superSecurePassword',
        owner: 'Jordan',
        twoFactor: true
    },
    {
        email: 'example2@example2.com',
        pass: 'superSecurePassword',
        owner: 'Jon'
    },
    {
        email: 'example3@example3.com',
        pass: 'superSecurePassword',
        owner: 'Ashli'
    }

];

Puppeteer is different than request in that it actually opens a chrome browser and acts like a user. You can click, scroll up and down, and interact pretty much exactly like a user. A lot of pages can also detect purely programmatic scraping like request and since this acts just like a user, it's a lot less often that it's blocked.

Puppeteer is all promise based and I love async/await so I wrap the entire function in a self calling async function, like this:

(async () => {
  // awesome code here

})();

Check this for more explanation on async/await.

I also run a lot of my Puppeteer scrapers on a Digital Ocean (which I love) ubuntu box so I have a block at the top that makes Puppeteer work in ubuntu. In this block I also have a runTwoFactor flag that I'll explain later.

        let browser: Browser;
        const ubuntu = false;
        const headless = false;
        const runTwoFactor = true;
        if (ubuntu) {
            browser = await puppeteer.launch({ headless: true, args: [`--window-size=${1800},${1200}`, '--no-sandbox', '--disable-setuid-sandbox'] });
        }
        else {
            browser = await puppeteer.launch({ headless: headless, args: [`--window-size=${1800},${1200}`] });
        }

So we simply start by looping through our accounts and signing in. A couple of interesting things to note about Audible (and I think Amazon in general).

  1. They don't have a link that works just for signing in. It seems they create a unique context id when you click a sign in button. This may be just to track the flow of the user or it could be to slow down scraping. So you have to start from audible.com and then click sign in. You can't navigate directly to sign in.
  2. Apparently having images downloaded is required? With Puppeteer you can disable the download of images. This saves bandwidth when you really don't care about them, especially when automating something that may be happening daily. If you have them disabled when trying to sign in, it will not let you sign in.

Noting those two things, we enter the loop with our credentials and then sign in:

// Sign into audible by clicking their sign in button from their homepage

const url = 'http://www.audible.com';
const page = await browser.newPage();
await page.goto(url);
await page.click('.ui-it-sign-in-link');
await page.waitFor(1500);

await page.type('#ap_email', credentials[i].email);
await page.type('#ap_password', credentials[i].pass);

await page.click('#signInSubmit');

Currently, the library section of audible can only display at a max 50 titles on page. This is kind of a bummer because it's always easier if you can get it all displayed on one page. The nice thing they do have is the filters all as query params, so we can navigate directly to the library with the maxed out amount of 50 books showing dating from all time. The url is like this ${url}/lib?purchaseDateFilter=all&programFilter=all&sortBy=PURCHASE_DATE.dsc&pageSize=50&page=${pageNumber}.

I paginate through these by incrementing the page param until I hit a unique sad face emoji. This means there are no more books and I'm done. So I make a while loop that only stops when libraryHasBooks is turned to false. Which is triggered by finding the sad emoji. If it doesn't hit, it carries on through our scraping and gets the books.

// If you want to be real ninja. you could try to wait for a specific selector that // only shows when its completed loading
await page.waitFor(1500);
let libraryHasBooks = true;
let pageNumber = 1;
const libraryEmptyHandle = await page.$('.bc-text img[src*="empty_lib_emoji"]');
if (libraryEmptyHandle) {
	libraryHasBooks = false;
}
else {
Sad emoji wants more books :|`

Now we dig into the real meat of when it finds books. As a small note on Puppeteer $ returns one ElementHandle and $$ returns an array of them. So I find a selector that is unique for the row of books and start a for loop. It's worthwhile to note, that for loops wait for the loop to finish before continuing with the code after, which loops like forEach do not. See more here.

const contentRows = await page.$$('tr[id^="adbl-library-content-row-"]');
for (let content of contentRows) {

And inside the loop we start grabbing the data we want with unique identifiers and adding it to our book object and then push it into our array of books.

const book: IBook = {
    imageUrl: '',
    title: '',
    url: '',
    author: '',
    asin: '',
    owner: credentials[i].owner
};
book.imageUrl = await getPropertyBySelector(content, '.bc-pub-block.bc-lazy-load.bc-image-inset-border', 'src');
book.title = await getPropertyBySelector(content, '.bc-list-item a[href^="/pd/"].bc-link.bc-color-link', 'innerHTML');

// There were occasional times when there must have been a phantom row and a title // could not be found so I added this check
if (book.title) {
	book.title = book.title.replace('\n', '').trim();
}
book.url = await getPropertyBySelector(content, 'a[href^="/pd/"]', 'href');

// Similar to the title check
if (book.url) {
	book.asin = book.url.split('/')[5].split('?')[0];
}
book.author = await getPropertyBySelector(content, 'a[href^="/author/"].bc-link.bc-color-link', 'innerHTML');
console.log('book', book);

if (book.title) {
	books.push(book);
}

Then I increment the page name and keep paginating until I hit the sad face. Then we are done with that user! We log them out and close the tab.

await page.goto(`${url}/signout`);
await page.close();

Now, here's the fun part. Amazon can use two-factor authentication! And I use it on my main account. Sadly (and happily? for security reasons) there is no way to automate this. If you have an account with SMS two-factor authentication, I added a section in here for this.

We set the flag we mentioned at the beginning ( const runTwoFactor =true; ) and then it'll do the following when it hits a credential with the twoFactor flag set. It first does a check to see if the credentials has the flag set and if we are supposed to run two factor. If we aren't, it'll just skip the rest of the code with the nice javascript continue keyword. This is handy if you are going to be running it on an automated basis and know that you won't be there to enter the code.

if (credentials[i].twoFactor && !runTwoFactor) {
	continue;
}

Assuming we are running two factor, it'll keep parsing and then hits the following which is waiting for user interaction. It waits 30 second to see if the selector which signifies we are inside audible appears. So you have 30 seconds to get out your phone, get the code, enter it into the input field that will appear, and hit continue.

if (credentials[i].twoFactor && runTwoFactor) {
    try {
    	await page.waitForSelector('img.ui-it-header-logo', { timeout: 30000 })
    }
    catch (e) {
    	console.log('timed out waiting for 2fa');
    	await page.close();
    	continue;
    }
}

Pretty cool, right? Interactive scraping! It'd be cool if you could fully automate it but there will always be things like this that will need user interaction. Still, if I have 500 books and one user with two factor auth, I can get them all with just like 15 seconds of work. It's amazing.