Golang Crawler Crawls to car home Used Car Product Bank

  Data analysis, golang, python, Web crawler

Climbing to car home Used Car Product Bank

Project address:https://github.com/go-crawler …

Target

Recently, people often mention car home in their ears and wonder about the price of used cars in China. Therefore, the target site for this time isHome of the carThe second-hand car products warehouse

image

Analyze target source:

  • There are 24 articles in a page.
  • There are pages, but this old product library will have problems after 100 pages, so we climbed to 99 pages.
  • All cities can be obtained
  • A total of 19w+ data can be crawled

Start

Crawling step

  • Get all the cities
  • Assemble all city urls into queue
  • Analysis of Used Car Page Structure
  • URL to queue for next page
  • Cycle to Pull All Paged Used Car Data
  • Recycle the used car data of cities in the queue
  • Wait to make sure there is no new URL in the queue
  • The crawled used car data are put into storage.

Acquire city

image

Looking through the page, you can find that you can get all the used car city lists in the city screening area, but you have to look through the code carefully. It will be found that JS loaded it, and the city is also unified in a variable.

image

There are two extraction methods

  • Analyze JS variables and extract them
  • DirectlyareaJsonCopy it out and parse it as a variable.

Here we can directly copy and paste it, because this is a relatively small change in value

Get paging

image

By analyzing the page, we can know that paging links are regular, for example:/2sc/hangzhou/a0_0msdgscncgpi1ltocsp2exb4/, you can findsp%d,spPage numbers follow

According to common sense, all paging links can be predicted and pushed into the queuego routineIt can be quickly pulled in one wave.

But there is a problem in this old product library. After more than 100 pages, the next page will always be 101 pages.

image

Therefore, we adopt a more traditional approach, by pulling down the link on the next page to access, in order to adapt to possible changes in paging links; The pagination display after 100 pages is also very strange, which is ignored first.

Get used car data

The structure of the page is relatively fixed, and regular HTML cleaning is sufficient.

func GetCars(doc *goquery.Document) (cars []QcCar) {
    cityName := GetCityName(doc)
    doc.Find(".piclist ul li:not(.line)").Each(func(i int, selection *goquery.Selection) {
        title := selection.Find(".title a").Text()
        price := selection.Find(".detail .detail-r").Find(".colf8").Text()
        kilometer := selection.Find(".detail .detail-l").Find("p").Eq(0).Text()
        year := selection.Find(".detail .detail-l").Find("p").Eq(1).Text()

        kilometer = strings.Join(compileNumber.FindAllString(kilometer, -1), "")
        year = strings.Join(compileNumber.FindAllString(strings.TrimSpace(year), -1), "")
        priceS, _ := strconv.ParseFloat(price, 64)
        kilometerS, _ := strconv.ParseFloat(kilometer, 64)
        yearS, _ := strconv.Atoi(year)

        cars = append(cars, QcCar{
            CityName: cityName,
            Title: title,
            Price: priceS,
            Kilometer: kilometerS,
            Year: yearS,
        })
    })

    return cars
}

Data

image

image

In the comparison of the average prices of various cities, we can find that Beijing, Shanghai and Shenzhen in the north, upper, wider and deeper are all on the list, while Hangzhou, which has gained more momentum in recent years, has directly occupied the top spot, and the last few are some distance away.

However, other cities are generally in a downward trend. It seems that used cars in first-tier cities are not cheap, of course, this is only the average price.

image

We can see the comparison between the price and the number of kilometers. The equal ratio difference among Shanghai, Chengdu and Zhengzhou is a little big. If we feel that there is a need, we can make a measurement on the price and the number of kilometers.

image

This picture is a bit interesting. We roughly counted the total kilometers. In the first few pictures, all those with higher average prices did not appear here, but Hohhot, Daqing, Zhongshan and others appeared at the top of the list.

Does it reflect that the vehicles in the first-tier cities are updated more quickly, while those in the later cities are updated more slowly, and the number of kilometers is basically even.

image

Through the analysis of the title, we can know that the naming of the vehicle product library is basically brand name+auto/manual +XXXX+attribute, and we can know a general situation by looking at the title.

References

Crawler project address