Scrape domain URLs and write results to BigQuery – News Couple
ANALYTICS

Scrape domain URLs and write results to BigQuery


In my intense love affair with the Google Cloud Platform, I have never felt more inspired to write content and try things out. After starting with the Snowplow Analytics setup guide, and continuing with the Lighthouse Audit Automation tutorial, I’ll show you another cool thing you can do with GCP.

In this guide, I’ll show you how to use an open source web crawler running in a Google Compute Engine (VM) virtual machine instance to scrape all internal and external links for a given domain, and write the results into a BigQuery table. With this setting, you can audit and monitor links on any website, search for bad status codes or missing addresses, and fix them to improve your site’s logical structure.


X


Simmer . Newsletter

Subscribe to the Simmer newsletter to get the latest news and content from Simo Ahava right in your inbox!

How it works

The idea is fairly simple. You are using a file Google Compute Engine VM instance to run the crawler script. The purpose here is that you can scale the instance as much as you want (and can afford it) to get the extra power you might not have with your local machine.

Engine instance calculation

The crawler runs across pages of the domain you specify in a file Settingsand writes the results into a file BigQuery table.

There are only a few moving parts here. Whenever you want to run crawling again, all you have to do is start the example again. You won’t be charged for when the instance is stopped (the script will stop the instance automatically once the crawl has finished), so you can simply leave the instance in the stopped state until you need to re-crawl.

You can also create a Google Cloud function that starts the instance with a trigger (HTTP request or Pub/Sub message, for example). There are many ways to skin this cat too!

The configuration also has a setting to use the Redis cache via GCP Memorystore, eg. A cache is useful if you have a huuuuuuge domain for crawling and want to be able to pause/resume crawling, or even use more than one VM instance to perform the crawl.

The cost Turning this setting on really depends on how much you crawl and how much power you allocate to the VM instance.

On my website, 7500 links and images to crawl takes about 10 minutes on a VM instance 16 CPU, 60 GB (without Redis). This translates to about 50 cents per crawl. I can downsize for less, and I’m sure there are other ways to improve it too.

preparations

The preparations are almost the same as in my previous articles, but with some simplifications.

Installing command line tools

Start by installing the following CLI tools:

  1. Google Cloud SDK

  2. silly person

To check that these commands are up and running, run the following commands in your terminal:

$ gcloud -v
Google Cloud SDK 228.0.0

$ git --version
git version 2.19.2

Set up a new Google Cloud Platform project with billing

Follow the steps here, and be sure to write down the project ID because you’ll need it in a number of places. I will use my example from web-scraper-gcp in this guide.

new project

Clone Github repo and edit configuration

Before you can get things up and running in GCP, you’ll need to create a file Settings file first.

The easiest way to access the necessary files is to clone the Github repo of this project.

  1. Browse to a local directory where you want to write the contents of the repo to.

  2. Run the following command to write the files into a new folder named web-scraper-gcp/:

$ git clone https://github.com/sahava/web-scraper-gcp.git

web scraper manual

After that, run the command mv config.json.sample config.json while in web-scraper-gcp/ Guide.

Finally, open the file config.json Editable in your favorite text editor. Here’s what the sample file looks like:


  "domain": "www.gtmtools.com",
  "startUrl": "https://www.gtmtools.com/",
  "projectId": "web-scraper-gcp",
  "bigQuery": 
    "datasetId": "web_scraper_gcp",
    "tableId": "crawl_results"
  ,
  "redis": 
    "active": false,
    "host": "10.0.0.3",
    "port": 6379
  

Below is an explanation of what fields are and what you need to do.

field Values Describe
"domain" "gtmtools.com" This is used to determine what a file is internal what is it external URL. The scan will match the pattern, so if the crawled URL is includes This string, it will be considered as an internal URL.
"startUrl" "https://www.gtmtools.com/" A fully qualified URL that is the crawling entry point.
"projectId" "web-scraper-gcp" Google Cloud Platform Project ID.
"bigQuery.datasetId" "web_scraper_gcp" The ID of the BigQuery dataset that the script will try to create. You must follow the rules of naming.
"bigQuery.tableId" "crawl_results" The ID of the table that the script will try to create. You must follow the rules of naming.
"redis.active" false set to true If you want to use a file Redis Crawl queue persistence cache.
"redis.host" "10.0.0.3" Set to the IP address from which the script can connect to the Redis instance.
"redis.port" 6379 Set to the Redis instance port number (usually 6379).

Once you have edited the configuration, you will need to upload it to your Google Cloud Storage container.

Upload configuration to GCS

Browse to https://console.cloud.google.com/storage/browser and make sure the correct project is selected.

Right GCS Project

the following, Create a new bucket In a nearby area, giving it a name that is easy to remember.

New GCS Bucket

Once done, insert the bucket, choose File Upload, and select a location config.json file from your local computer and load it into the bucket.

Upload file to GCS

Edit the installation text

The Git repository you download comes with a file named gce-install.sh. This script will be used to run the virtual machine instance with the correct settings (and it will start crawling on start). However, you will need to edit the file so it knows where to fetch your config file from. And therefore, Open The gce-install.sh Editable file.

Edit the following line:

bucket='gs://web-scraper-config/config.json'

change the web-scraper-config Part of the bucket name you just created. So if you name the bucket my-configuration-bucket, you can change the font to this:

bucket='gs://my-configuration-bucket/config.json'

Ensure that the required services are enabled in the Google Cloud Platform

The final preparatory step is to ensure that the required services are enabled in the Google Cloud Platform.

  1. Browse here, and make sure the Compute Engine API is enabled.

  2. Browse here, and make sure the BigQuery API is enabled.

  3. Browse here, and make sure the Google Cloud Storage API is enabled.

Create an instance of the GCE VM

You are now ready to create an instance of the Google Compute Engine, and run it using the installation script. Here’s what will happen when you do that:

  1. Once the instance is created, a file . will be run gce-install.sh script. In fact, this script will run when you start the instance again.

  2. The script will install all the dependencies required to run the web crawler. There are quite a few of them because running Chrome browser headless in a virtual machine is not the simple process.

  3. The penultimate step of the installation script is to run the Node application that contains the code you wrote to perform the crawl job.

  4. Node will get a file startUrl and BigQuery info from the config file (downloaded from the GCS container), and it will crawl the scope, writing the results into BigQuery.

  5. Once the crawl is complete, the VM instance will close itself.

To create the instance, you will need to run this command:

$ gcloud compute instances create web-scraper-gcp 
      --metadata-from-file=startup-script=./gce-install.sh 
      --scopes=bigquery,cloud-platform 
      --machine-type=n1-standard-16 
      --zone=europe-north1-a

Edit a file machine-type And zone If you want the instance to run on a different CPU/memory profile, and/or if you want to run it in a different region. You can find a list of machine types here, and a list of regions here.

Once you’re done, you’ll see something like this:

GCE creation command

Check to see if it works

First, go to the list of instances, and make sure you see the running instance (you’ll see a green checkmark next to it):

Running a VM instance

Of course, the fact that it runs doesn’t tell you much, just yet.

Next, go to BigQuery. You should see your project in the navigator, so click it to open it. Within the project, you should see a dataset and a table.

dataset and table

If you see them, the next step is to run a simple query in a file Query Editor. Click the name of the table in the navigator, then click query table Link. The query editor must be pre-populated with a table query, so it must be between a file SELECT And FROM Keywords, type: count(*). This is what the query should eventually look like:

select count

Finally, click on File He runs

button. This will run the query against the BigQuery table. The crawl may still be running, but thanks flocking insert

It is constantly adding rows to the table.  The query should return a result showing you how many rows the table currently has:

Test the query

If you see a result, then the whole thing works!  Continue to monitor the size of the table.  Once the crawl is over, the virtual machine instance will be closed, and you will be able to see it in its stopped state.

stopped instance

last thoughts First of all, this wasPlaying sports

. I’m all too familiar with cool crawling tools like Screaming Frog, which you can use to achieve the same thing.

  1. However, this setup has some great features:

  2. You can modify the crawler with additional options, and you can pass the tags to the Puppeteer instance running in the background.

  3. Since this crawler uses a browser without a user interface, it works better on dynamically generated sites than a normal HTTP requests crawler. It actually generates JavaScript and crawls dynamic links as well.

Since it writes data to BigQuery, you can monitor status codes and link integrity to your website in tools like Google Data Studio.

Anyway, I didn’t plan to create a tool that replaces some of the things that already exist. Instead, I wanted to show you how easy it is to run scripts and perform tasks in Google Cloud.

Let me know in the comments if you’re having trouble with this setup! I’m glad to see where the problem lies.



Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button