This article is suitable whether there are reptiles or not and
Node.js
Basic friends watch ~
需求:
- Use
Node.js
Crawling web resources, out-of-the-box configuration - The content of the crawled web page is
PDF
Format output
如果你是一名技术人员,那么可以看我接下来的文章,否则,请直接移步到我的github
仓库,直接看文档使用即可
Warehouse address:Attached documentation and source code, don’t forget to give a.star
Oh
The technology used for this requirement:Node.js
Andpuppeteer
-
puppeteer
Official website address:Puppeteer address -
Node.js
Official website address:Link description -
Puppeteer
It’s an official Google passDevTools
Protocol controlheadless Chrome
TheNode
Library. The api provided by Puppeteer can directly control Chrome to simulate most user operations to conduct UI Test or access pages as crawlers to collect data. - Environment and installation
-
Puppeteer
It depends on Node above 6.4, but it is super easy to use for asynchronous.async/await
, it is recommended to use Node above version 7.6. In addition, headless Chrome itself requires a higher version of the library on which the server depends. centos server is relatively stable. it is difficult to use headless Chrome in v6. upgrading the dependent version may cause various server problems (including but not limited to the inability to use ssh), and it is better to use a higher version of the server. (It is recommended to use the latest version ofNode.js
)
Try the Ox Knife and Climb Jingdong Resources
const puppeteer = require('puppeteer'); // 引入依赖
(async () => { //使用async函数完美异步
const browser = await puppeteer.launch(); //打开新的浏览器
const page = await browser.newPage(); // 打开新的网页
await page.goto('https://www.jd.com/'); //前往里面 'url' 的网页
const result = await page.evaluate(() => { //这个result数组包含所有的图片src地址
let arr = []; //这个箭头函数内部写处理的逻辑
const imgs = document.querySelectorAll('img');
imgs.forEach(function (item) {
arr.push(item.src)
})
return arr
});
// '此时的result就是得到的爬虫数据,可以通过'fs'模块保存'
})()
复制过去 使用命令行命令 ` node 文件名 ` 就可以运行获取爬虫数据了
这个 puppeteer 的包 ,其实是替我们开启了另一个浏览器,重新去开启网页,获取它们的数据。
- Only the picture content of the first page of JD.com has been crawled above. Assuming my needs are further expanded, I need to crawl to the first page of JD.com .
All in< a > the text content of all title in the jump webpage corresponding to the label is finally put into an array
.
- Ours
async
The above function is divided into five steps, onlypuppeteer.launch()
,
browser.newPage()
,browser.close()
It is written in a fixed way.
-
page.goto
To specify which web page we go to retrieve data, we can change the internal url address, or we can do it many times.
Call this method.
-
page.evaluate
Inside this function is the data logic that processes us into the web page we want to crawl.
-
page.goto
Andpage.evaluate
Two methods can be used inasync
Internal call multiple times.
That means we can go to Jingdong website first, process the logic and call againpage.goto
This function,
Note that all the above logic is
puppeteer
This bag helped us open another one out of sight.
The browser then processes the logic, so the final callbrowser.close()
Method to close that browser.
At this time, we optimized the previous code and crawled the corresponding resources.
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.jd.com/');
const hrefArr = await page.evaluate(() => {
let arr = [];
const aNodes = document.querySelectorAll('.cate_menu_lk');
aNodes.forEach(function (item) {
arr.push(item.href)
})
return arr
});
let arr = [];
for (let i = 0; i < hrefArr.length; i++) {
const url = hrefArr[i];
console.log(url) //这里可以打印
await page.goto(url);
const result = await page.evaluate(() => { //这个方法内部console.log无效
return $('title').text(); //返回每个界面的title文字内容
});
arr.push(result) //每次循环给数组中添加对应的值
}
console.log(arr) //得到对应的数据 可以通过Node.js的 fs 模块保存到本地
await browser.close()
})()
The console.log inside the tiankeng page.evaluate function cannot be printed, and the internal cannot obtain the external variables, only return can be returned.
The selector used must first go to the console of the corresponding interface to test whether DOM can be selected before using it. For example, JD.com cannot use querySelector. Because here
JQuery is used in JD.com’s interfaces, so we can use jQuery. In short, we can use all selectors they develop, otherwise we can’t.
Next, let’s climb directly for it.Node.js
The website home page is then generated directlyPDF
无论您是否了解Node.js和puppeteer的爬虫的人员都可以操作,请您一定万分仔细阅读本文档并按顺序执行每一步
Requirements for the realization of this project: Give us a webpage address, crawl his webpage content, and then output it into the PDF document we want. Please note that it is a high-quality PDF document.
- The first step, installation
Node.js
, recommendedhttp://nodejs.cn/download/
,Node.js
To download the corresponding operating system package - The second step is to download and install it.
Node.js
After that, startwindows
Command line tools (start the system search function under windows, enter cmd and enter, and it will come out) - The third step is to check whether the environment variable has been automatically configured and enter it in the command line tool.
node -v
If it appearsv10. ***
Field, the installation was successful.Node.js
- Step 4 if you find input in step 3
node -v
There is still no corresponding field, so please restart the computer - The fifth step is to open the project folder, open the command line tool (windows system directly in the file
url
Address bar inputcmd
Can be opened), inputnpm i cnpm nodemon -g
- Step 6 Download
puppeteer
Crawler package, after completing the fifth step, usecnpm i puppeteer --save
The command can be downloaded - Step 7 After downloading step 6, open the
url.js
To replace the address of the web page that you need the crawler to crawl (default ishttp://nodejs.cn/
) - Step 8 Enter on the command line
nodemon index.js
You can crawl the corresponding content and automatically output it to theindex.pdf
In the file
TIPS
The design idea of this project is one web page and one web page.index.pdf
Copy it out and continue to replace it.url
Address, continue to crawl, generate newCorresponding to a web page like JD.com’s home page that has opened a lazy image loading, some of the content crawled is
loading
The content of the state, for some anti-crawler mechanism of web pages, crawler will also have problems, but the vast majority of websites are ok
const puppeteer = require('puppeteer');
const url = require('./url');
(async () => {
const browser = await puppeteer.launch({ headless: true })
const page = await browser.newPage()
//选择要打开的网页
await page.goto(url, { waitUntil: 'networkidle0' })
//选择你要输出的那个PDF文件路径,把爬取到的内容输出到PDF中,必须是存在的PDF,可以是空内容,如果不是空的内容PDF,那么会覆盖内容
let pdfFilePath = './index.pdf';
//根据你的配置选项,我们这里选择A4纸的规格输出PDF,方便打印
await page.pdf({
path: pdfFilePath,
format: 'A4',
scale: 1,
printBackground: true,
landscape: false,
displayHeaderFooter: false
});
await browser.close()
})()
File Deconstruction Design
Data is very precious in this era, according to the design logic of web pages, select specific ones.
href
The address of the, can first directly obtain the corresponding resources, can also be used againpage.goto
Method to enter, then callpage.evaluate()
Processing logic, or output the corresponding
We won’t introduce too much here, after allNode.js
It can be heaven, maybe it can really do anything in the future. Please collect such high-quality and short tutorials.
Or forward it to your friend, thank you.