By Shaii

In this tutorial, we will learn how to build a web scraper with Go and Colly. We will also learn how to save our scraped data into a JSON file. Sometimes some things just don’t have an API. In those kinds of cases, you can always just write a little web scraper to help you get the data you need.

We’re going to be working with Go and the Colly package. The Colly package will allow us to crawl, scrape and traverse the DOM.

Prerequisites

To follow along, you will need to have Go installed.

Setting up project directory

Let’s get started. First, change into the directory where our projects are stored. In my case this would be the “Sites” folder, it may be different for you. Here we will create our project folder called rhino-scraper

1
2
3
cd Sites
mkdir rhino-scraper
cd rhino-scraper

In our rhino-scraper project folder, we’ll create our main.go file. This will be the entry point of our app.

1
touch main.go

Initialising go modules

We will be using go modulesto handle dependencies in our project.

Running the following command will create a go.mod file.

1
go mod init example.com/rhino-scraper

We’re going to be using the colly package to build our webscraper, so let’s install that now by running:

1
go get github.com/gocolly/colly

You will notice that running the above command created a go.sum file. This file holds a list of the checksum and versions for our direct and indirect dependencies. It is used to validate the checksum of each dependency to confirm that none of them have been modified.

In the main.go file we created earlier, let’s set up a basic package main and func main().

1
2
3
package main

func main() {}

Analysing the target page structure

For this tutorial we will be scraping some rhino facts from FactRetriever.com.

Below is a screenshot taken from the target page. We can see that each fact has a simple structure consisting of an id and a description.

FactRetriever.com Rhino Facts
Image: FactRetriever.com Rhino Facts

Creating the fact struct

In our main.go file, we can write a Fact struct type to represent the structure of a rhino fact. A fact will have:

  • an ID that will be of type int, and
  • a description that will be of type string.

The Fact struct type, the ID field and the Description field are all capitalised because we want them to be available outside of package main.

1
2
3
4
5
6
7
8
package main

type Fact struct {
	ID          int    `json:"id"`
	Description string `json:"description"`
}

func main() {}

Inside of func main, we will create an empty slice to hold our facts. We will initialise it with length zero and append to it as we go. This slice will only be able to hold Facts.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
package main

type Fact struct {
	ID          int    `json:"id"`
	Description string `json:"description"`
}

func main() {
	allFacts := make([]Fact, 0)
}

Using the Colly package

We will be importing a package called colly to provide us with the methods and functionality we’ll need to build our web scraper.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
package main

import "github.com/gocolly/colly"

type Fact struct {
	ID          int    `json:"id"`
	Description string `json:"description"`
}

func main() {
	allFacts := make([]Fact, 0)
}

Using the colly package, let’s create a new collector and set it’s allowed domains to be factretriever.com

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
package main

import "github.com/gocolly/colly"

type Fact struct {
	ID          int    `json:"id"`
	Description string `json:"description"`
}

func main() {
	allFacts := make([]Fact, 0)

	collector := colly.NewCollector(
		colly.AllowedDomains("factretriever.com", "www.factretriever.com"),
	)
}

HTML structure of a list of facts

If we inspect the HTML structure, we will see that the facts are list items inside an unordered list that has the class of factsList. Each fact list item has been assigned an id. We will use this id later.

HTML structure of Facts list
Image: HTML structure of Facts list

Now that we know what the HTML structure is like, we can write some code to traverse the DOM. The colly package makes use of a library called goQuery to interact with the DOM. goQuery is like jQuery, but for Golang.

Below is the code so far. We will go over the new lines, step-by-step

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
package main

import (
	"fmt"
	"log"
	"strconv"

	"github.com/gocolly/colly"
)

type Fact struct {
	ID          int    `json:"id"`
	Description string `json:"description"`
}

func main() {
	allFacts := make([]Fact, 0)

	collector := colly.NewCollector(
		colly.AllowedDomains("factretriever.com", "www.factretriever.com"),
	)

	collector.OnHTML(".factsList li", func(element *colly.HTMLElement) {
		factId, err := strconv.Atoi(element.Attr("id"))
		if err != nil {
			log.Println("Could not get id")
		}
		
		factDesc := element.Text

		fact := Fact{
			ID:          factId,
			Description: factDesc,
		}

		allFacts = append(allFacts, fact)
	})
}
  • Line 3-6
    • We import the fmt, log and strconv packages
  • Line 23
    • We are using the OnHTML method. It takes two arguments. The first argument is a target selector and the second argument is a callback function that is called everytime a target selector is encountered
  • Line 24
    • In the body of the OnHTML, we create a variable to store the ID of each element that is iterated over
    • The ID is currently of type string, so we use strconv.Atoi to convert it to type int
  • Line 25-27
    • The strconv.Atoi method returns an error as it’s second return value, so do some basic error handling
  • Line 29
    • We create a variable called factDesc to store the description text of each fact. Based on the Fact struct type we established earlier, we are expecting the fact description to be of type string.
  • Line 31-34
    • Here, we create a new Fact struct for every list item we iterate over
  • Line 36
    • Then we append the Fact struct to the allFacts slice

Begin crawling and scraping

We want to have some visual feedback to let us know that our scraper is actually visiting the page. Let’s do that now.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
package main

import (
	"fmt"
	"log"
	"strconv"

	"github.com/gocolly/colly"
)

type Fact struct {
	ID          int    `json:"id"`
	Description string `json:"description"`
}

func main() {
	allFacts := make([]Fact, 0)

	collector := colly.NewCollector(
		colly.AllowedDomains("factretriever.com", "www.factretriever.com"),
	)

	collector.OnHTML(".factsList li", func(element *colly.HTMLElement) {
		factId, err := strconv.Atoi(element.Attr("id"))
		if err != nil {
			log.Println("Could not get id")
		}
		
		factDesc := element.Text

		fact := Fact{
			ID:          factId,
			Description: factDesc,
		}

		allFacts = append(allFacts, fact)
	})
	
	collector.OnRequest(func(request *colly.Request) {
		fmt.Println("Visiting", request.URL.String())
	})

	collector.Visit("https://www.factretriever.com/rhino-facts")
}
  • Line 39-41
    • We use fmt.Println to output a Visting message whenever we request a URL
  • Line 43
    • We use the Visit() method to give our programme a starting point

If we run our program in the terminal now, by using the command

1
go run main.go

It will tell us that our collector visited the rhino facts page on Fact retriever.com

Saving our data to JSON

We may want to use our scraped data in another place. So let’s save it to a JSON file.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
package main

import (
	"fmt"
	"io/ioutil"
	"log"
	"os"
	"strconv"

	"github.com/gocolly/colly"
)

type Fact struct {
	ID          int    `json:"id"`
	Description string `json:"description"`
}

func main() {
	allFacts := make([]Fact, 0)

	collector := colly.NewCollector(
		colly.AllowedDomains("factretriever.com", "www.factretriever.com"),
	)

	collector.OnHTML(".factsList li", func(element *colly.HTMLElement) {
		factId, err := strconv.Atoi(element.Attr("id"))
		if err != nil {
			log.Println("Could not get id")
		}
		factDesc := element.Text

		fact := Fact{
			ID:          factId,
			Description: factDesc,
		}

		allFacts = append(allFacts, fact)
	})

	collector.OnRequest(func(request *colly.Request) {
		fmt.Println("Visiting", request.URL.String())
	})

	collector.Visit("https://www.factretriever.com/rhino-facts")

	writeJSON(allFacts)
}

func writeJSON(data []Fact) {
	file, err := json.MarshalIndent(data, "", " ")
	if err != nil {
		log.Println("Unable to create json file")
		return
	}

	_ = ioutil.WriteFile("rhinofacts.json", file, 0644)
}
  • Line 5
    • We import the ioutil package so we can to write to a file
  • Line 7
    • We import the os package
    • The OS package provides an interface to operating system functionality
  • Line 49
    • Let’s create a function called writeJSON that takes in one parameter of the type slice of fact
  • Line 50
    • Inside the function body, let’s use MarshalIndent to marshal the data we pass in
    • The MarshalIndent method returns the JSON encoding of data and also returns an error
  • Line 51-54
    • Some error handling. If we get an error here, we will just print a log message saying we were unable to create a JSON file
  • Line 56
    • We can then use the WriteFile method it provides us to write our JSON-encoded data to a file called "rhinofacts.json"
    • This file does not exist yet, so the WriteFile method will create it with the permissions code of 0644.

Our WriteJSON function is ready to use. We can call it on Line 8 and pass allFacts to it.

Now if we go back to the terminal and run the command go run main.go, all our scraped rhino facts will be saved in a JSON file called "rhinofacts.json".

Conclusion

In this tutorial, you learnt how to build a web scraper with Go and the Colly package. If you enjoyed this article and you’d like more, consider following Div Rhino on YouTube.

Congratulations, you did great. Keep learning and keep coding!

Resources