The 15 lines of code in the crawler article of Node.js in heaven crawled from JD.com’s Taobao resources

  css, html, javascript, node.js, typescript

Can reptiles only usepythonDo? No, we are in heavenNode.jsYou can do it too!

  • Packages to be prepared

    • Node.jsDownload address of the latest version ofNode.js official website
    • npmThe package manager downloads the latest official website versionNode.jsWill bring their ownnpm
    • npmThird party package forpuppeteerRun the command line tool in the corresponding js filenpm i puppeteer -DJust do it

Crawlers may fail to obtain some protected web pages.

New to Jianghu-Free Land

const puppeteer = require('puppeteer');  //Introduce Dependency
 (async () => {// perfect async using async function
 const browser = await puppeteer.launch();  //Open a new browser
 const page = await browser.newPage();  //Open a new webpage
 await page.goto('https://www.jd.com/');  //Go to the' url' webpage inside
 Evaluate (() = > {//this result array contains all the src addresses of the pictures.
 let arr = [];  //Logic of internal write processing of this arrow function
 const imgs = document.querySelectorAll('img');
 imgs.forEach(function (item) {
 arr.push(item.src)
 })
 return arr
 });
 //'The result at this time is the obtained crawler data, which can be saved through the' fs' module
 })()
 
 Copy the past command line command' node file name' to run and get crawler data

This ..puppeteerThe package actually opens another browser for us to re-open the web page and get their data.

China’s Entry into WTO with Natural and Natural Manners

  • Only the picture content of the first page of JD.com has been crawled above. Assuming my needs are further expanded, I need to crawl to the first page of JD.com .

All in< a > the text content of all title in the jump webpage corresponding to the label is finally put into an array.

  • OursasyncThe above function is divided into five steps, onlypuppeteer.launch(),

browser.newPage(),browser.close()It is written in a fixed way.

  • page.gotoTo specify which web page we go to retrieve data, we can change the internal url address, or we can do it many times.

Call this method.

  • page.evaluateInside this function is the data logic that processes us into the web page we want to crawl.
  • page.gotoAndpage.evaluateTwo methods can be used inasyncInternal call multiple times.

That means we can go to Jingdong website first, process the logic and call againpage.gotoThis function,

Note that all the above logic ispuppeteerThis bag helped us open another one out of sight.
The browser then processes the logic, so the final callbrowser.close()Method to close that browser.

  • At this time, we optimized the previous code and crawled the corresponding resources.
const puppeteer = require('puppeteer');
 (async () => {
 const browser = await puppeteer.launch();
 const page = await browser.newPage();
 await page.goto('https://www.jd.com/');
 const hrefArr = await page.evaluate(() => {
 let arr = [];
 const aNodes = document.querySelectorAll('.cate_menu_lk');
 aNodes.forEach(function (item) {
 arr.push(item.href)
 })
 return arr
 });
 let arr = [];
 for (let i = 0;   i < hrefArr.length;  i++) {
 const url = hrefArr[i];
 Log (url)//can be printed here.
 await page.goto(url);
 Evaluate (() = > {//this method has invalid internal console.log
 
 return  $('title').text();  //return the title text content of each interface
 });
 Arr.push(result) // Adds the corresponding value to the array every cycle
 }
 The corresponding data obtained by console.log(arr) // can be saved locally through fs module of Node.js
 await browser.close()
 })()

The console.log inside the tiankeng page.evaluate function cannot be printed, and the internal cannot obtain the external variables, only return can be returned.
The selector used must first go to the console of the corresponding interface to test whether DOM can be selected before using it. For example, JD.com cannot use querySelector. Because here
JQuery is used in JD.com’s interfaces, so we can use jQuery. In short, we can use all selectors they develop, otherwise we can’t.

Powerful Martial Arts-Fantasy and Mystery

Data is very precious in this era. The above two can obtain some specific resources and select specific ones according to the design logic of the webpage.hrefThe address of the,
The corresponding resources can be directly obtained first, or can be reused.page.gotoMethod to enter, then callpage.evaluate()Processing logic.
We won’t introduce too much here, after allNode.jsIt can be heaven, maybe it can really do anything in the future. Please collect such high-quality and short tutorials.
Or forward it to your friend, thank you.