Project address:https://github.com/jrainlau/w …
Preface
I haven’t written an article for a long time, because I have been adapting to new posts recently and using my spare time to learn python. This article is a summary of a recent python learning phase. A crawler has been developed to download high-definition wallpaper from a wallpaper website in batches.
Note: The items in this article are only for python learning and are strictly prohibited for other purposes!
Initialization item
The project was usedvirtualenv
To create a virtual environment to avoid polluting the whole situation. Usepip3
Download directly:
pip3 install virtualenv
Then build a new one at the right place.wallpaper-downloader
Directory, usevirtualenv
Create a file namedvenv
Virtual environment of:
virtualenv venv
. venv/bin/activate
Next, create the dependent directory:
echo bs4 lxml requests > requirements.txt
Finally, yun can download and install the dependencies:
pip3 install -r requirements.txt
Analysis of Crawler Work Steps
For simplicity, let’s go directly to the wallpaper list page classified as “aero”:http://wallpaperswide.com/aer ….
As you can see, there are 10 wallpapers available for download in this page. However, as thumbnails are displayed here, the definition is far from enough as wallpaper, so we need to enter the wallpaper details page to find the download link in high definition. From the first wallpaper point in, you can see a new page:
Since my machine is a Retina screen, I plan to download the largest one directly to ensure high definition (volume shown in red circle).
After understanding the specific steps, the developer tool is used to find the corresponding dom node and extract the corresponding url. The process is no longer expanded. The reader can try it on his own and enter the coding section below.
Access page
Create a new onedownload.py
File, and then introduce two libraries:
from bs4 import BeautifulSoup
import requests
Next, write a function specifically for accessing the url and then returning the page html:
def visit_page(url):
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36'
}
r = requests.get(url, headers = headers)
r.encoding = 'utf-8'
return BeautifulSoup(r.text, 'lxml')
In order to prevent being hit by the website anti-crawling mechanism, we need to disguise the crawler as a normal browser by adding UA to the header, then specify utf-8 encoding, and finally return html in string format.
Extract links
After obtaining the html of the page, you need to extract the url corresponding to the wallpaper list of the page:
def get_paper_link(page):
links = page.select('#content > div > ul > li > div > div a')
return [link.get('href') for link in links]
This function will extract the url of all wallpaper details on the list page.
Download wallpaper
Once we have the address of the details page, we can go in and choose the appropriate size. After analyzing the dom structure of the page, we can know that each size corresponds to a link:
So the first step is to extract these size-corresponding links:
wallpaper_source = visit_page(link)
wallpaper_size_links = wallpaper_source.select('#wallpaper-resolutions > a')
size_list = [{
'size': eval(link.get_text().replace('x', '*')),
'name': link.get('href').replace('/download/', ''),
'url': link.get('href')
} for link in wallpaper_size_links]
size_list
Is a collection of these links. In order to facilitate the next selection of the highest definition (largest volume) wallpaper, insize
I used it 2eval
Method, directly put here5120x3200
To calculate, assize
The value of.
Once you have all the collections, you can use themax()
Methods The highest definition item was selected:
biggest_one = max(size_list, key = lambda item: item['size'])
This ..biggest_one
midurl
Is the download link corresponding to size, and then only need to pass throughrequests
The library can download the linked resources:
result = requests.get(PAGE_DOMAIN + biggest_one['url'])
if result.status_code == 200:
open('wallpapers/' + biggest_one['name'], 'wb').write(result.content)
Note that first you need to create one in the root directorywallpapers
Directory, otherwise the runtime will report an error.
Tidy up, completedownload_wallpaper
The function looks like this:
def download_wallpaper(link):
wallpaper_source = visit_page(PAGE_DOMAIN + link)
wallpaper_size_links = wallpaper_source.select('#wallpaper-resolutions > a')
size_list = [{
'size': eval(link.get_text().replace('x', '*')),
'name': link.get('href').replace('/download/', ''),
'url': link.get('href')
} for link in wallpaper_size_links]
biggest_one = max(size_list, key = lambda item: item['size'])
print('Downloading the ' + str(index + 1) + '/' + str(total) + ' wallpaper: ' + biggest_one['name'])
result = requests.get(PAGE_DOMAIN + biggest_one['url'])
if result.status_code == 200:
open('wallpapers/' + biggest_one['name'], 'wb').write(result.content)
Batch operation
The above steps can only be downloadedFirst wallpaper list pageTheFirst wallpaper. If we want to downloadMultiple list pagesTheAll wallpapers, we need to call these methods cyclically. First we define several constants:
import sys
if len(sys.argv) ! = 4:
print('3 arguments were required but only find ' + str(len(sys.argv) - 1) + '!' )
exit()
category = sys.argv[1]
try:
page_start = [int(sys.argv[2])]
page_end = int(sys.argv[3])
except:
print('The second and third arguments must be a number but not a string!' )
exit()
Here, three constants are specified by taking command line arguments.category
,page_start
Andpage_end
, respectively, corresponding to the wallpaper classification, starting page number, ending page number.
For convenience, define two more url-related constants:
PAGE_DOMAIN = 'http://wallpaperswide.com'
PAGE_URL = 'http://wallpaperswide.com/' + category + '-desktop-wallpapers/page/'
Next, we can happily carry out batch operations. Before that, let’s define astart()
Start function:
def start():
if page_start[0] <= page_end:
print('Preparing to download the ' + str(page_start[0]) + ' page of all the "' + category + '" wallpapers...')
PAGE_SOURCE = visit_page(PAGE_URL + str(page_start[0]))
WALLPAPER_LINKS = get_paper_link(PAGE_SOURCE)
page_start[0] = page_start[0] + 1
for index, link in enumerate(WALLPAPER_LINKS):
download_wallpaper(link, index, len(WALLPAPER_LINKS), start)
Then put the previousdownload_wallpaper
Function to rewrite it again:
def download_wallpaper(link, index, total, callback):
wallpaper_source = visit_page(PAGE_DOMAIN + link)
wallpaper_size_links = wallpaper_source.select('#wallpaper-resolutions > a')
size_list = [{
'size': eval(link.get_text().replace('x', '*')),
'name': link.get('href').replace('/download/', ''),
'url': link.get('href')
} for link in wallpaper_size_links]
biggest_one = max(size_list, key = lambda item: item['size'])
print('Downloading the ' + str(index + 1) + '/' + str(total) + ' wallpaper: ' + biggest_one['name'])
result = requests.get(PAGE_DOMAIN + biggest_one['url'])
if result.status_code == 200:
open('wallpapers/' + biggest_one['name'], 'wb').write(result.content)
if index + 1 == total:
print('Download completed! \n\n')
callback()
Finally, specify the startup rules:
if __name__ == '__main__':
start()
Run project
Enter the following code at the command line to start the test:
python3 download.py aero 1 2
You can then see the following output:
Take charles and grab the bag. You can see that the script is running smoothly:
At this point, the download script has been developed, and finally there is no need to worry about wallpaper shortage!