Use Node.js to crawl any web page resource and output high-quality PDF files to the local ~

  css, html, html5, javascript, node.js

detail?ct=503316480&z=0&ipn=d&word=%E6%B5%B7%E8%BE%B9%E5%A3%81%E7%BA%B8&hs=2&pn=0&spn=0&di=10120&pi=0&rn=1&tn=baiduimagedetail&is=0%2C0&ie=utf-8&oe=utf-8&cl=2&lm=-1&cs=3590237416%2C2845421745&os=3026828862%2C3835093178&simid=0%2C0&adpicid=0&lpn=0&ln=30&fr=ala&fm=&sme=&cg=&bdtype=0&oriquery=%E6%B5%B7%E8%BE%B9%E5%A3%81%E7%BA%B8&objurl=http%3A%2F%2Fabc.2008php.com%2F2017_Website_appreciate%2F2017-10-09%2F20171009204205.jpg&fromurl=ippr_z2C%24qAzdH3FAzdH3Fwkv_z%26e3Bdaabrir_z%26e3Bv54AzdH3Fp7h7AzdH3Fda80AzdH3F8aalAzdH3Flcblld_z%26e3Bip4s&gsm=0&islist=&querylist=

This article is suitable whether there are reptiles or not andNode.jsBasic friends watch ~

需求:
  • UseNode.jsCrawling web resources, out-of-the-box configuration
  • The content of the crawled web page isPDFFormat output
如果你是一名技术人员,那么可以看我接下来的文章,否则,请直接移步到我的github仓库,直接看文档使用即可

Warehouse address:Attached documentation and source code, don’t forget to give a.starOh

The technology used for this requirement:Node.jsAndpuppeteer

  • puppeteerOfficial website address:Puppeteer address
  • Node.jsOfficial website address:Link description
  • PuppeteerIt’s an official Google passDevToolsProtocol controlheadless ChromeTheNodeLibrary. The api provided by Puppeteer can directly control Chrome to simulate most user operations to conduct UI Test or access pages as crawlers to collect data.
  • Environment and installation
  • PuppeteerIt depends on Node above 6.4, but it is super easy to use for asynchronous.async/await, it is recommended to use Node above version 7.6. In addition, headless Chrome itself requires a higher version of the library on which the server depends. centos server is relatively stable. it is difficult to use headless Chrome in v6. upgrading the dependent version may cause various server problems (including but not limited to the inability to use ssh), and it is better to use a higher version of the server. (It is recommended to use the latest version ofNode.js)

Try the Ox Knife and Climb Jingdong Resources

const puppeteer = require('puppeteer'); //  引入依赖  
(async () => {   //使用async函数完美异步 
    const browser = await puppeteer.launch();  //打开新的浏览器
    const page = await browser.newPage();   // 打开新的网页 
    await page.goto('https://www.jd.com/');  //前往里面 'url' 的网页
    const result = await page.evaluate(() => {   //这个result数组包含所有的图片src地址
        let arr = []; //这个箭头函数内部写处理的逻辑  
        const imgs = document.querySelectorAll('img');
        imgs.forEach(function (item) {
            arr.push(item.src)
        })
        return arr 
    });
    // '此时的result就是得到的爬虫数据,可以通过'fs'模块保存'
})()

  复制过去 使用命令行命令 ` node 文件名 ` 就可以运行获取爬虫数据了 
这个 puppeteer 的包 ,其实是替我们开启了另一个浏览器,重新去开启网页,获取它们的数据。
  • Only the picture content of the first page of JD.com has been crawled above. Assuming my needs are further expanded, I need to crawl to the first page of JD.com .

All in< a > the text content of all title in the jump webpage corresponding to the label is finally put into an array.

  • OursasyncThe above function is divided into five steps, onlypuppeteer.launch(),

browser.newPage(),browser.close()It is written in a fixed way.

  • page.gotoTo specify which web page we go to retrieve data, we can change the internal url address, or we can do it many times.

Call this method.

  • page.evaluateInside this function is the data logic that processes us into the web page we want to crawl.
  • page.gotoAndpage.evaluateTwo methods can be used inasyncInternal call multiple times.

That means we can go to Jingdong website first, process the logic and call againpage.gotoThis function,

Note that all the above logic ispuppeteerThis bag helped us open another one out of sight.
The browser then processes the logic, so the final callbrowser.close()Method to close that browser.

At this time, we optimized the previous code and crawled the corresponding resources.

 const puppeteer = require('puppeteer');
(async () => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.goto('https://www.jd.com/');
    const hrefArr = await page.evaluate(() => {
        let arr = [];
        const aNodes = document.querySelectorAll('.cate_menu_lk');
        aNodes.forEach(function (item) {
            arr.push(item.href)
        })
        return arr
    });
    let arr = [];
    for (let i = 0; i < hrefArr.length; i++) {
        const url = hrefArr[i];
        console.log(url) //这里可以打印 
        await page.goto(url);
        const result = await page.evaluate(() => { //这个方法内部console.log无效 
            
              return  $('title').text();  //返回每个界面的title文字内容
        });
        arr.push(result)  //每次循环给数组中添加对应的值
    }
    console.log(arr)  //得到对应的数据  可以通过Node.js的 fs 模块保存到本地
    await browser.close()
})()

The console.log inside the tiankeng page.evaluate function cannot be printed, and the internal cannot obtain the external variables, only return can be returned.
The selector used must first go to the console of the corresponding interface to test whether DOM can be selected before using it. For example, JD.com cannot use querySelector. Because here
JQuery is used in JD.com’s interfaces, so we can use jQuery. In short, we can use all selectors they develop, otherwise we can’t.

Next, let’s climb directly for it.Node.jsThe website home page is then generated directlyPDF

无论您是否了解Node.js和puppeteer的爬虫的人员都可以操作,请您一定万分仔细阅读本文档并按顺序执行每一步

Requirements for the realization of this project: Give us a webpage address, crawl his webpage content, and then output it into the PDF document we want. Please note that it is a high-quality PDF document.

  • The first step, installationNode.js, recommendedhttp://nodejs.cn/download/,Node.jsTo download the corresponding operating system package
  • The second step is to download and install it.Node.jsAfter that, startwindowsCommand line tools (start the system search function under windows, enter cmd and enter, and it will come out)
  • The third step is to check whether the environment variable has been automatically configured and enter it in the command line tool.node -vIf it appearsv10. ***Field, the installation was successful.Node.js
  • Step 4 if you find input in step 3node -vThere is still no corresponding field, so please restart the computer
  • The fifth step is to open the project folder, open the command line tool (windows system directly in the fileurlAddress bar inputcmdCan be opened), inputnpm i cnpm nodemon -g
  • Step 6 DownloadpuppeteerCrawler package, after completing the fifth step, usecnpm i puppeteer --saveThe command can be downloaded
  • Step 7 After downloading step 6, open theurl.jsTo replace the address of the web page that you need the crawler to crawl (default ishttp://nodejs.cn/)
  • Step 8 Enter on the command linenodemon index.jsYou can crawl the corresponding content and automatically output it to theindex.pdfIn the file

TIPSThe design idea of this project is one web page and one web page.PDFFile, so every time you crawl a separate page, please put theindex.pdfCopy it out and continue to replace it.urlAddress, continue to crawl, generate newPDFFile, of course, you can also crawl multiple web pages at one time to generate multiple through circular compilation and other methods.PDFDocuments.

Corresponding to a web page like JD.com’s home page that has opened a lazy image loading, some of the content crawled isloadingThe content of the state, for some anti-crawler mechanism of web pages, crawler will also have problems, but the vast majority of websites are ok

const puppeteer = require('puppeteer');
const url = require('./url');
(async () => {
    const browser = await puppeteer.launch({ headless: true })
    const page = await browser.newPage()
    //选择要打开的网页  
    await page.goto(url, { waitUntil: 'networkidle0' })
    //选择你要输出的那个PDF文件路径,把爬取到的内容输出到PDF中,必须是存在的PDF,可以是空内容,如果不是空的内容PDF,那么会覆盖内容
    let pdfFilePath = './index.pdf';
    //根据你的配置选项,我们这里选择A4纸的规格输出PDF,方便打印
    await page.pdf({
        path: pdfFilePath,
        format: 'A4',
        scale: 1,
        printBackground: true,
        landscape: false,
        displayHeaderFooter: false
    });
    await browser.close()
})()

File Deconstruction Design

clipboard.png

Data is very precious in this era, according to the design logic of web pages, select specific ones.hrefThe address of the, can first directly obtain the corresponding resources, can also be used againpage.gotoMethod to enter, then callpage.evaluate()Processing logic, or output the correspondingPDFFile, of course, can also output multiple at one timePDFDocuments ~
We won’t introduce too much here, after allNode.jsIt can be heaven, maybe it can really do anything in the future. Please collect such high-quality and short tutorials.
Or forward it to your friend, thank you.