Golang Crawler Crawls for the Simplest Douban Movie Top250

  Data analysis, golang, Web crawler

Climbing for Douban Movie Top250

Crawler is standard. It was fun to watch the data. Let’s start with the simplest and most basic crawler.

Project address:https://github.com/go-crawler …

Target

Our target site isDouban movie Top250, it is estimated that everyone is very familiar

This time, eight fields are selected for simple summary analysis. The specific fields are as follows:

image

Simply analyze the target source

  • There are 25 articles in a page.
  • Including paging (10 pages in total) and paging rules are normal
  • The data field ordering of each item is regular and unchanged

Start

Due to the small quantity, our climbing steps are as follows

  • Analyze the page to get all pages
  • Analyze the pages and loop through the movie information of all pages
  • The crawled movie information is put into storage

Installation

$ go get -u github.com/PuerkitoBio/goquery

run

$ go run main.go

Code snippet

1. Get all pages

func ParsePages(doc *goquery.Document) (pages []Page) {
    pages = append(pages, Page{Page: 1, Url: ""})
    doc.Find("#content > div > div.article > div.paginator > a").Each(func(i int, s *goquery.Selection) {
        page, _ := strconv.Atoi(s.Text())
        url, _ := s.Attr("href")

        pages = append(pages, Page{
            Page: page,
            Url:  url,
        })
    })

    return pages
}

2. Analysis of Douban Movie Information

func ParseMovies(doc *goquery.Document) (movies []Movie) {
    doc.Find("#content > div > div.article > ol > li").Each(func(i int, s *goquery.Selection) {
        title := s.Find(".hd a span").Eq(0).Text()

        ...

        movieDesc := strings.Split(DescInfo[1], "/")
        year := strings.TrimSpace(movieDesc[0])
        area := strings.TrimSpace(movieDesc[1])
        tag := strings.TrimSpace(movieDesc[2])

        star := s.Find(".bd .star .rating_num").Text()

        comment := strings.TrimSpace(s.Find(".bd .star span").Eq(3).Text())
        compile := regexp.MustCompile("[0-9]")
        comment = strings.Join(compile.FindAllString(comment, -1), "")

        quote := s.Find(".quote .inq").Text()

        ...

        log.Printf("i: %d, movie: %v", i, movie)

        movies = append(movies, movie)
    })

    return movies
}

Data

image

image

image

What do you think of these data? I am really curious: =)