70 lines of python Code to Realize Bulk Download of Wallpapers

  python, Python crawler, python3.x

Project address:https://github.com/jrainlau/w …

Preface

I haven’t written an article for a long time, because I have been adapting to new posts recently and using my spare time to learn python. This article is a summary of a recent python learning phase. A crawler has been developed to download high-definition wallpaper from a wallpaper website in batches.

Note: The items in this article are only for python learning and are strictly prohibited for other purposes!

Initialization item

The project was usedvirtualenvTo create a virtual environment to avoid polluting the whole situation. Usepip3Download directly:

pip3 install virtualenv

Then build a new one at the right place.wallpaper-downloaderDirectory, usevirtualenvCreate a file namedvenvVirtual environment of:

virtualenv venv
 
 . venv/bin/activate

Next, create the dependent directory:

echo bs4 lxml requests > requirements.txt

Finally, yun can download and install the dependencies:

pip3 install -r requirements.txt

Analysis of Crawler Work Steps

For simplicity, let’s go directly to the wallpaper list page classified as “aero”:http://wallpaperswide.com/aer ….

clipboard.png

As you can see, there are 10 wallpapers available for download in this page. However, as thumbnails are displayed here, the definition is far from enough as wallpaper, so we need to enter the wallpaper details page to find the download link in high definition. From the first wallpaper point in, you can see a new page:

clipboard.png

Since my machine is a Retina screen, I plan to download the largest one directly to ensure high definition (volume shown in red circle).

After understanding the specific steps, the developer tool is used to find the corresponding dom node and extract the corresponding url. The process is no longer expanded. The reader can try it on his own and enter the coding section below.

Access page

Create a new onedownload.pyFile, and then introduce two libraries:

from bs4 import BeautifulSoup
 import requests

Next, write a function specifically for accessing the url and then returning the page html:

def visit_page(url):
 headers = {
 'User-Agent': 'Mozilla/5.0 (Macintosh;  Intel Mac OS X 10_13_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36'
 }
 r = requests.get(url, headers = headers)
 r.encoding = 'utf-8'
 return BeautifulSoup(r.text, 'lxml')

In order to prevent being hit by the website anti-crawling mechanism, we need to disguise the crawler as a normal browser by adding UA to the header, then specify utf-8 encoding, and finally return html in string format.

Extract links

After obtaining the html of the page, you need to extract the url corresponding to the wallpaper list of the page:

def get_paper_link(page):
 links = page.select('#content > div > ul > li > div > div a')
 return [link.get('href') for link in links]

This function will extract the url of all wallpaper details on the list page.

Download wallpaper

Once we have the address of the details page, we can go in and choose the appropriate size. After analyzing the dom structure of the page, we can know that each size corresponds to a link:

clipboard.png

So the first step is to extract these size-corresponding links:

wallpaper_source = visit_page(link)
 wallpaper_size_links = wallpaper_source.select('#wallpaper-resolutions > a')
 size_list = [{
 'size': eval(link.get_text().replace('x', '*')),
 'name': link.get('href').replace('/download/', ''),
 'url': link.get('href')
 } for link in wallpaper_size_links]

size_listIs a collection of these links. In order to facilitate the next selection of the highest definition (largest volume) wallpaper, insizeI used it 2evalMethod, directly put here5120x3200To calculate, assizeThe value of.

Once you have all the collections, you can use themax()Methods The highest definition item was selected:

biggest_one = max(size_list, key = lambda item: item['size'])

This ..biggest_onemidurlIs the download link corresponding to size, and then only need to pass throughrequestsThe library can download the linked resources:

result = requests.get(PAGE_DOMAIN + biggest_one['url'])
 
 if result.status_code == 200:
 open('wallpapers/' + biggest_one['name'], 'wb').write(result.content)

Note that first you need to create one in the root directorywallpapersDirectory, otherwise the runtime will report an error.

Tidy up, completedownload_wallpaperThe function looks like this:

def download_wallpaper(link):
 wallpaper_source = visit_page(PAGE_DOMAIN + link)
 wallpaper_size_links = wallpaper_source.select('#wallpaper-resolutions > a')
 size_list = [{
 'size': eval(link.get_text().replace('x', '*')),
 'name': link.get('href').replace('/download/', ''),
 'url': link.get('href')
 } for link in wallpaper_size_links]
 
 biggest_one = max(size_list, key = lambda item: item['size'])
 print('Downloading the ' + str(index + 1) + '/' + str(total) + ' wallpaper: ' + biggest_one['name'])
 result = requests.get(PAGE_DOMAIN + biggest_one['url'])
 
 if result.status_code == 200:
 open('wallpapers/' + biggest_one['name'], 'wb').write(result.content)

Batch operation

The above steps can only be downloadedFirst wallpaper list pageTheFirst wallpaper. If we want to downloadMultiple list pagesTheAll wallpapers, we need to call these methods cyclically. First we define several constants:

import sys
 
 if len(sys.argv) !  = 4:
 print('3 arguments were required but only find ' + str(len(sys.argv) - 1) + '!'  )
 exit()
 
 category = sys.argv[1]
 
 try:
 page_start = [int(sys.argv[2])]
 page_end = int(sys.argv[3])
 except:
 print('The second and third arguments must be a number but not a string!'  )
 exit()

Here, three constants are specified by taking command line arguments.category,page_startAndpage_end, respectively, corresponding to the wallpaper classification, starting page number, ending page number.

For convenience, define two more url-related constants:

PAGE_DOMAIN = 'http://wallpaperswide.com'
 PAGE_URL = 'http://wallpaperswide.com/' + category + '-desktop-wallpapers/page/'

Next, we can happily carry out batch operations. Before that, let’s define astart()Start function:

def start():
 if page_start[0] <= page_end:
 print('Preparing to download the ' + str(page_start[0])  + ' page of all the "' + category + '" wallpapers...')
 PAGE_SOURCE = visit_page(PAGE_URL + str(page_start[0]))
 WALLPAPER_LINKS = get_paper_link(PAGE_SOURCE)
 page_start[0] = page_start[0] + 1
 
 for index, link in enumerate(WALLPAPER_LINKS):
 download_wallpaper(link, index, len(WALLPAPER_LINKS), start)

Then put the previousdownload_wallpaperFunction to rewrite it again:

def download_wallpaper(link, index, total, callback):
 wallpaper_source = visit_page(PAGE_DOMAIN + link)
 wallpaper_size_links = wallpaper_source.select('#wallpaper-resolutions > a')
 size_list = [{
 'size': eval(link.get_text().replace('x', '*')),
 'name': link.get('href').replace('/download/', ''),
 'url': link.get('href')
 } for link in wallpaper_size_links]
 
 biggest_one = max(size_list, key = lambda item: item['size'])
 print('Downloading the ' + str(index + 1) + '/' + str(total) + ' wallpaper: ' + biggest_one['name'])
 result = requests.get(PAGE_DOMAIN + biggest_one['url'])
 
 if result.status_code == 200:
 open('wallpapers/' + biggest_one['name'], 'wb').write(result.content)
 
 if index + 1 == total:
 print('Download completed!  \n\n')
 callback()

Finally, specify the startup rules:

if __name__ == '__main__':
 start()

Run project

Enter the following code at the command line to start the test:

python3 download.py aero 1 2

You can then see the following output:

clipboard.png

Take charles and grab the bag. You can see that the script is running smoothly:

clipboard.png

At this point, the download script has been developed, and finally there is no need to worry about wallpaper shortage!