Image downloader with puppeteer and the fetch API

Image downloader with puppeteer and the fetch API

In this tutorial, we are going to build a webpage image downloader. Assuming you visit a webpage and saw that the images in that page are cool and you want to have your own copy without downloading them one by one, this simple tool we will build is going to be a life saver for you. This little project is also a good way to practice and hone your webscraping skills.

We will create a new directory called image-downloader and navigate into it. Pop open your terminal window and type in the following commands.

mkdir image-downloader && cd image-downloader

I will assume that you have node js and npm installed on your machine. We will then initialize this directory with the standard package.json file by running npm init -y and then install two dependencies namely puppeteer and node-fetch. Run the following commands to get them installed.

npm install --save puppeteer node-fetch --verbose

You probably just saw a new npm flag --verbose. When installing puppeteer, what happens behind the scenes is that npm also installs the chromium browser because it is a dependency of puppeteer. This file is usually large and we are using the --verbose flag to see the progress of the installation, nothing fancy, but let's just use it because we can.

One more thing to do before getting our hands dirty with code is to create a directory where we want all our images to be downloaded. Let's name that directory images. We will also create index.js file where all the app's logic will go.

mkdir images && touch index.js

Actually, it's great to clearly outline our thought process before writing a single line of code.

  1. Get all image tags from the page and extract the href property from each of these image tags
  2. Make request to those href links and store them into the images directory (Saving images to disk)

Step one 1: Getting all image tags and href property

'use strict';

const puppeteer = require('puppeteer');
const fetch = require('node-fetch');
const fs = require('fs')

// Extract all imageLinks from the page
async function extractImageLinks(){
    const browser = await puppeteer.launch({
        headless: false
    })

    const page = await browser.newPage()

    // Get the page url from the user
    let baseURL = process.argv[2] ? process.argv[2] : "https://stocksnap.io"

    try {
        await page.goto(baseURL, {waitUntil: 'networkidle0'})
        await page.waitForSelector('body')

        let imageBank = await page.evaluate(() => {
            let imgTags = Array.from(document.querySelectorAll('img'))

            let imageArray = []

            imgTags.map((image) => {
                let src = image.src

                let srcArray = src.split('/')
                let pos = srcArray.length - 1
                let filename = srcArray[pos]

                imageArray.push({
                    src,
                    filename
                })
            })

            return imageArray
        })

        await browser.close()
        return imageBank

    } catch (err) {
        console.log(err)
    }
}

Now let me explain what is happening here. First, we created an async function called extractImageLinks. In that function, we created an instance of a browser page using puppeteer and stored it in the page constant. Think of this page as the new page you get after launching your chrome browser. We can now heedlessly control this page from our code. We then get the url of the page we want to download the image from the user and stored it in a variable named baseURL. We then navigate to that URL using the page.goto() function. The {waitUntil: 'networkidle0'} object passed as the second argument to this function is to ensure that we wait for the for the network request to complete before we proceed with parsing the page. page.waitForSelector('body') is telling puppeteer to wait for the html body tag to render before we start extracting anything from the page.

The page.evaluate() function allows us to run JavaScript code in that page instance as if we were in our Google Chrome Devtools. To get all image tags from the page, we call the document.querySelectorAll("img") function. However, this function returns an NodeList and not an array. So to convert this to an array, we wrapped the first function with the Array.from() method. Now we have an array to work with.

We then store all the image tags in the imgTags variable and initialized imageArray variable as a placeholder for all the href values. Since imgTags has been converted into an array, we then map through every tag in that array and extract the src property from each image tag.

Now time for some little hack, we want to download the image from the webpage maintianing the original filename as it appears in the webpage. For instance, we have this image src https://cdn.stocksnap.io/img-thumbs/960w/green-leaf_BVKZ4QW8LS.jpg. We wnat to get the green-leaf_BVKZ4QW8LS.jpg from that URL. One way to do this is to split the string using the "/" delimeter. We then end up with something like this:

let src = `https://cdn.stocksnap.io/img-thumbs/960w/green-leaf_BVKZ4QW8LS.jpg`.split("/")

// Output
["https:", "", "cdn.stocksnap.io", "img-thumbs", "960w", "green-leaf_BVKZ4QW8LS.jpg"]

Now the last index of the array after running the split array method on the image source contains the image's name and the extension as well, awesome!!!

Note: to get the last item from any array, we subtract 1 from the lengthm of that array like so:

let arr = [40,61,12] 
let lastItemIndex = arr.length - 1 // This is the index of the last item

console.log(lastItemIndex)
// Output
2

console.log(arr[lastItemIndex])
// Output
12

So we store the index of the last item in the pos variable and then store the name of the file in the filename variable as well. Now we have the source of the file and the file name of the current image in the loop, we then push these values as an object in the imageArray variable. After the mapping is done, we return the imageArray because by now it has been populated. We also return the imageBank variable which now contains the image links (sources) and the filenames.

Saving images to disk

function saveImageToDisk(url, filename){
    fetch(url)
    .then(res => {
        const dest = fs.createWriteStream(filename);
        res.body.pipe(dest)
    })
    .catch((err) => {
        console.log(err)
    })
}


// Run the script on auto-pilot
(async function(){
    let imageLinks = await extractImageLinks()
    console.log(imageLinks)

    imageLinks.map((image) => {
        let filename = `./images/${image.filename}`
        saveImageToDisk(image.src, filename)
    })
})()

Now let's decipher this little piece. In the anonymous IIFE, we are running the extractImageLinks to get the array containing the src and filename. Since the function is returns an array, we run the map function on that array and then pass the required parameters (url and filename) to saveImageToDisk. We then use the fetch API to make a GET request to that url and as the response is coming down the wire, we are concurrently piping it into the filename destination, in this case, a writable stream on our filesystem. This is very efficient because we are not waiting for the image to be fully loaded in memory before saving to disk but instead saving every chunk we get from the response directly.

Lets's run the code, cross our fingers and checkout our images directory

node index.js  https://stocksnap.io

We should see some cool images there in. Wooo! You can add this to your portfolio. There are so many improvements that can be done to this little software, such as allowing the user to specify the directory they want to download the image, handling Data URI images, proper error handling, code refactoring, creating a standalone CLI utility for it. Hint: use the commander npm package for that, etc. You can go ahead and extend this app and I'll be glad to see what improvements you will make it.

Full code

'use strict';

const puppeteer = require('puppeteer');
const fetch = require('node-fetch');
const fs = require('fs')

// Browser and page instance
async function instance(){
    const browser = await puppeteer.launch({
        headless: false
    })

    const page = await browser.newPage()
    return {page, browser}
}

// Extract all imageLinks from the page
async function extractImageLinks(){
    const {page, browser} = await instance()

    // Get the page url from the user
    let baseURL = process.argv[2] ? process.argv[2] : "https://stocksnap.io"

    try {
        await page.goto(baseURL, {waitUntil: 'networkidle0'})
        await page.waitForSelector('body')

        let imageLinks = await page.evaluate(() => {
            let imgTags = Array.from(document.querySelectorAll('img'))

            let imageArray = []

            imgTags.map((image) => {
                let src = image.src

                let srcArray = src.split('/')
                let pos = srcArray.length - 1
                let filename = srcArray[pos]

                imageArray.push({
                    src,
                    filename
                })
            })

            return imageArray
        })

        await browser.close()
        return imageLinks

    } catch (err) {
        console.log(err)
    }
}

(async function(){
    console.log("Downloading images...")

    let imageLinks = await extractImageLinks()

    imageLinks.map((image) => {
        let filename = `./images/${image.filename}`
        saveImageToDisk(image.src, filename)
    })

    console.log("Download complete, check the images folder")
})()

function saveImageToDisk(url, filename){
    fetch(url)
    .then(res => {
        const dest = fs.createWriteStream(filename);
        res.body.pipe(dest)
    })
    .catch((err) => {
        console.log(err)
    })
}

Shameless plug :blush:

If you enjoyed this article and are feeling super pumped, I run :link: webscrapingzone.com where I teach advanced webscraping techniques by building real-world projects and how you can monetize your webscraping skills instantly without even being hired. It's still in beta stage but you can join the waiting list and get :boom: 50% :boom: off when the course is released.

You can follow me on twitter - @microworlds

Thank you for your time :+1: