Java crawler rapid development tool: uncs

  java

Zero: Write in front

Uncs is a tool for java to quickly develop crawlers. It is simple and convenient. After a large number of version iterations and production verification, it can be applied to most websites. Welcome to use it.

I. Basic Usage

  • Development package acquisition
    At present, it can only be obtained from maven server on the company intranet.
     <dependency>
        <groupId>com.cdc</groupId>
        <artifactId>uncs</artifactId>
        <version>3.0.0.6</version>
    </dependency>
  • Develop single step process

Step one

java
package com.cdc.uncs.service.parts;

import com.cdc.uncs.exception.UncsException;
import com.cdc.uncs.model.HttpCrawlInfo;
import com.cdc.uncs.model.TestRequest;
import com.cdc.uncs.model.TestResponse;
import com.cdc.uncs.service.NetCrawlPart;
import com.cdc.uncs.service.TransContext;

public class NetCrawlTestPart extends NetCrawlPart<TestRequest, TestResponse> {

    @Override
    public void beforeCrawl(TransContext<TestRequest, TestResponse> context, HttpCrawlInfo crawlInfo, String crawlId, String type) throws UncsException {
        String url = "http://www.baidu.com";
        crawlInfo.setUrl(url);
    }

    @Override
    public void afterCrawl(TransContext<TestRequest, TestResponse> context, HttpCrawlInfo crawlInfo, String crawlId, String type) throws UncsException {
        System.out.println(crawlInfo.getHttpCrawlResult());
    }
}

Step 2

package com.cdc.uncs.service.parts;

import com.cdc.uncs.exception.UncsException;
import com.cdc.uncs.model.HttpCrawlInfo;
import com.cdc.uncs.model.TestRequest;
import com.cdc.uncs.model.TestResponse;
import com.cdc.uncs.service.NetCrawlPart;
import com.cdc.uncs.service.TransContext;

public class NetCrawlTestPart2 extends NetCrawlPart<TestRequest, TestResponse> {

    @Override
    public void beforeCrawl(TransContext<TestRequest, TestResponse> context, HttpCrawlInfo crawlInfo, String crawlId, String type) throws UncsException {

        String url = "http://www.hao123.com";
        crawlInfo.setUrl(url);
    }

    @Override
    public void afterCrawl(TransContext<TestRequest, TestResponse> context, HttpCrawlInfo crawlInfo, String crawlId, String type) throws UncsException {
        System.out.println(crawlInfo.getHttpCrawlResult());
    }
}
  • Service profile

uncsTestApplicationContext.xml

<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd
       http://uncs.cdc.com/schema/uncs http://uncs.cdc.com/schema/uncs/springuncs.xsd"
       xmlns:uncs="http://uncs.cdc.com/schema/uncs"
       default-autowire="byName">
    <uncs:crawl id="testService" browser="Chrome51" poolSize="5" proxyType="no">
        <uncs:list>
            <uncs:netCrawlPart class="com.cdc.uncs.service.parts.NetCrawlTestPart" desc="登陆"/>
            <uncs:netCrawlPart class="com.cdc.uncs.service.parts.NetCrawlTestPart2" desc="获取"/>
        </uncs:list>
    </uncs:crawl>
</beans>
  • Demo sample
        // ----------------------系统启动---------------------------
        // 用户自定义的服务配置文件
        String xmlTest = "classpath*:uncsTestApplicationContext.xml";
        // 启动uncs的初始化参数   redis:ip、port socks5:ip、port 项目缩写 http代理 获取http代理超时时间 代理类型
        InitParam param = new InitParam("127.0.0.1", 6379, "ss5.xx.com", 1080, "ct", "http://xxx", 3000, "no");
        // 启动uncs  param:启动参数  xmlTest...:服务配置文件,可以是多个
        UncsLancher.startUp(param, xmlTest);

        // ----------------------调用服务--------------------------
        // 定义上下文,贯穿整个服务
        TransContext<TestRequest, TestResponse> transContext = TransContext.newInstance(TestRequest.class, TestResponse.class);
        // crawlId:单个爬取交易的唯一索引
        String crawlId = Long.toString(System.currentTimeMillis());
        // type:交易的类型,辅助参数,用户自定义。例如爬取时可以把类型作为type,可以贯穿整个交易
        String type = "xxx";
        transContext.setCrawlId(crawlId);
        transContext.setType(type);
        // 服务名称,对应配置文件中uncs:crawl标签的id
        String serverName = "testService";
        // 开始执行交易
        TestResponse response = UncsService.startService(serverName, transContext);

Second: source url

Svn Address: For Internal Use Only

III. Agreement

  • CrawlId ID must be lost, running through the whole service.
  • The step implementation class within the process must inherit the related parent class
  • Redis must be used to support the framework for the time being, and stand-alone versions that do not require redis will be developed in the future.

Fourth, design ideas

The crawler development framework based on process is dynamically configurable and extensible. What the user can’t care about doesn’t need to be considered by the user, and all details that can be shielded are shielded.

V. Detailed Configuration Explanation

5.1 crawl Transaction Configuration

Uncs:crawl tag

attr:

  • Id: unique service name, e.g. testService
  • Browser: browser type, enumeration: Chrome51(chrome browser), IE9,IE8,FIREFOX,DEFAULT, default is chrome browser, set this attribute, the code does not need to set http header user-agent
  • PoolSize: The thread pool size at service runtime is the concurrent size supported by this service. If not set, the common thread pool will be used.
  • ProxyType: proxy type, no- no proxy is usedHttp: use http proxySocks:socks5 proxy default-Http: default http proxyDefault-socks: default socks proxy (two default proxy types are set in initialization parameter InitParam)

property:

Uncs:list– process list, list of implementation classes configured by the service in sequence, all types of part are supported in the list

Uncs:finalPart– a step that will be executed after the process is completed, regardless of success or failure.

Uncs:proxyService—extended proxy service. users can customize bean to write their own proxy service. when the attr-proxyType of uncs:crawl is set to http or socks, the system will load the proxy service of this label. refer to the chapter “proxy configuration and use” for details.

5.2 part

The parent class of all template steps, empty templates, can be used freely.

Step: create java class-> inherit com.cdc.uncs.service.Part– > rewrite work method-> configuration file

When this step may not need to be executed, restart the isPassPart method and return true to skip. All subclass templates have this step.

The corresponding configuration file label: Uncs: Partclass-Implementation Class desc– Step Name, not to be filled with the default shorthand for class name

5.3 netCrawlPart

The network crawls the step template, the user does not need to care about how the httpclient uses this template.

Step: create java class-> inherit the configuration file of com.CDC.uncas.service.netcrawler-> rewrite the beforeCrawl and afterCrawl methods->
BeforeCrawl: Assemble http Request Parameters Before Crawling

The parameter HttpCrawlInfo crawlInfo in the method is set to change the request content.

Attribute name (-) Details (-)
method post/get
mineType Img- Picture json- html- Text Not Supported
httpParamType Form- formstring–pure string
url Crawled urls
referer Source page
charset Encoding, default is utf-8
isRelocal Whether to jump automatically
params Form parameters
stringParam It takes effect only when httpParamType is string
headerParam Http header parameter supports Map method and one-by-one setting method. common header supports convenient setting method. user-agent can be set without cookie setting.
httpCrawlResult/httpCrawlImgResult Used to save the returned results
cookies Cookie can be set manually.
relocalList When isRelocal is true, the page jump process is saved here
proxyService You can set up an agent in a single step
isNetErrThrow If an exception is thrown when a network error occurs, it will not be thrown by default.
isLogResponse Do you want to record the return log
e True exception object
timeOut Single step timeout
isJdkSafe Does jdk1.7 Security Policy Support
tempInfo Temporary Parameters, User Part http Interactive Parameter Setting
tempInfo.usedProxy But step proxy policy
tempInfo.responseCode Http return code
tempInfo.proxyParam Parameters of proxy service
tempInfo.httpRetryMaxTimes Maximum number of retries after failure
tempInfo.sslVersion Specifies the ssl protocol version number
tempInfo.clearCookie Do you want to empty cookie?
tempInfo.poolEntityAliveTime Http pooling, survival time of each link
tempInfo.poolSize Http pooling, pool size
afterCrawl:爬取后解析返回结果

HttpCrawlInfo crawlInfo.getHttpCrawlResult和getHttpCrawlImgResult获取返回结果

The corresponding configuration file label, uncs:netCrawlPart, class– implementation class desc– step name, does not fill in the default for the class name abbreviation
Examples:

<uncs:crawl id="testService" browser="Chrome51" poolSize="5" proxyType="no">
        <uncs:list>
            <uncs:netCrawlPart class="com.cdc.uncs.service.parts.NetCrawlTestPart" desc="网络爬取测试步骤"/>
        </uncs:list>
    </uncs:crawl>

java

package xxx;

import xxx;

/**
 * 加载查询
 */
public class FlowQueryPart extends NetCrawlPart<FlowGetterRequest, FlowGetterResponse> {

    @Override
    public void beforeCrawl(TransContext<FlowGetterRequest, FlowGetterResponse> context, HttpCrawlInfo crawlInfo, String s, String s1) throws UncsException {
        String url = (String) context.getTempParamValue(ParamKey.XXX);
        String referer = (String) context.getTempParamValue(ParamKey.REFXXX);
        
        crawlInfo.addAccept("text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8");
        crawlInfo.addAcceptEncoding("gzip, deflate, sdch");
        crawlInfo.addAcceptLanguage("zh-CN,zh;q=0.8");
        crawlInfo.addConnection("keep-alive");
        crawlInfo.addHost("XXX.com");

        crawlInfo.addParam("query", "true");
        crawlInfo.addParam("q_from_date",  (String) context.getTempParamValue(ParamKey.BEGIN_DATE));
        crawlInfo.addParam("q_to_date", (String) context.getTempParamValue(ParamKey.END_DATE));

        crawlInfo.setCharset(CoreConstant.CHARSET);
        crawlInfo.setMineType(MineType.HTML);
        crawlInfo.setMethod(HttpMethod.GET);
        crawlInfo.setUrl(url);
        crawlInfo.setReferer(referer);

        context.addTempParam(ParamKey.NEXT_REFERER, url);

    }

    @Override
    public void afterCrawl(TransContext<FlowGetterRequest, FlowGetterResponse> context, HttpCrawlInfo crawlInfo, String crawlId, String bankCode) throws UncsException {
        String result = crawlInfo.getHttpCrawlResult();
        if(Strings.isNullOrEmpty(result)) {
            throw new SystemUncsException("网页加载错误", ErrorCode.XXX);
        }
    }
}

5.4 loopPart

The cycle step template, which is outdated, has been replaced by the complexLoopPart complex cycle step template. it is no longer maintained and can be used. there may be some bugs and only one-step cycle is supported.

5.5 switchPart

Select a step template, similar to java’s switch, which supports taking different branch steps according to different scenarios
Scene example: when a website crawls, it needs to crawl in different branches according to the provinces where it belongs.
Sample configuration:

<uncs:crawl id="testService" browser="Chrome51" poolSize="5" proxyType="no">
        <uncs:list>
          <!-- choosePartClass必须继承com.cdc.uncs.service.ChooseKeyPart -->
            <uncs:switchPart choosePartClass="com.cdc.uncs.service.parts.ChooseKeyTestPart" choosePartDesc="选择key测试步骤">
                <uncs:entity key="bj">
                    <uncs:list>
                        <uncs:netCrawlPart class="com.cdc.uncs.service.parts.NetCrawlTestPart" desc="北京"/>
                    </uncs:list>
                </uncs:entity>
                <uncs:entity key="sh">
                    <uncs:list>
                        <uncs:netCrawlPart class="com.cdc.uncs.service.parts.NetCrawlTestPart2" desc="上海"/>
                    </uncs:list>
                </uncs:entity>
            </uncs:switchPart>
        </uncs:list>
    </uncs:crawl>

Code sample:

package xxxx;
import xxxx;

public class ChooseKeyTestPart extends ChooseKeyPart<TestRequest, TestResponse> {

    @Override
    public boolean isPassPart(TransContext<TestRequest, TestResponse> context) {
        // 不需要发送网络请求在这实现
        chooseKey("bj");
        return true;
    }

    @Override
    public void beforeCrawl(TransContext<TestRequest, TestResponse> context, HttpCrawlInfo crawlInfo, String crawlId, String mobileType) throws UncsException {
        // 需要发送网络请求来判断的才需要实现
    }

    @Override
    public void afterCrawl(TransContext<TestRequest, TestResponse> context, HttpCrawlInfo crawlInfo, String crawlId, String mobileType) throws UncsException {
    }
}

5.6 groupRetryPart

The group retry step can realize the group retry of the whole step. The maximum number of retries can be set. Whether to retry requires the user to call the retry method according to the actual scene.
Scene example: the success rate of identifying the picture verification code is not 100%, when it fails, it needs to be re-identified and re-verified.
Sample configuration:

<uncs:crawl id="testService" browser="Chrome51" poolSize="5" proxyType="no">
        <uncs:list>
              <!-- betweenMillis重试间隔时间(毫秒) maxRetryTimes最大重试次数 -->
            <uncs:groupRetryPart betweenMillis="10" maxRetryTimes="5">
                <uncs:list>
                    <uncs:netCrawlPart class="com.cdc.uncs.service.parts.NetCrawlTestPart" desc="网络爬取测试步骤"/>
                </uncs:list>
            </uncs:groupRetryPart>
        </uncs:list>
    </uncs:crawl>

Note: when the number of retries exceeds the maximum number of retries, the user needs to determine whether or not to throw exceptions. exceptions are not thrown by default and the process is executed normally.
Code sample:

@Override
    public void afterCrawl(TransContext<FlowGetterRequest, FlowGetterResponse> context, HttpCrawlInfo crawlInfo, String crawlId, String bankCode) throws UncsException {

        String result = crawlInfo.getHttpCrawlResult();
        if(Strings.isNullOrEmpty(result)) {
            throw new SystemUncsException("网页加载错误", ErrorCode.ERROR_3006);
        }
        // 校验结果
        try {
            CCBBaseUtil.validateResult(result, crawlId, bankCode, this.getName(), log);
        } catch (UncsException e) {
            String code = e.getCode();
            if(ErrorCode.ERROR_0000.equals(code)) {
                // 验证码错误,重试
                // 验证重试次数
                if(this.getGroupRetryCurrent(context) < this.getGroupRetryMax(context)) {
                    // 重试
                    retry();
                }else{
                    log.log(crawlId, this.getName(), bankCode, "图片验证码超过最大重试次数");
                    throw new SystemUncsException("图片验证码错误次数超限,请重试,并检查", ErrorCode.ERROR_2002);
                }
            } else {
                throw e;
            }
        }
    }

5.7 complexLoopPart

Complex loop step template, similar to java loop, supports both for loop and while loop. the default is for loop, which supports any template application.
New Support Circular Horizontal Concurrency
Examples of scenarios:

for循环,爬取某网站数据时,按月份循环爬取为第一层循环,每个月类型的分页为第二层循环
while循环,同for循环,区别在于银行的分支只有下一页,不知道总页数

Sample configuration:

<!-- loopType 循环类型 for/while 不填默认for  preClass前置处理类,必须继承com.cdc.uncs.service.LoopPrePart,一般用做查询和设置最大循环次数,当然也可以在内部步骤内设置  isAsyn是否并发 asynThreadCount并发最大线程数 -->
<uncs:complexLoopPart loopType="for" preClass="" preDesc="" isAsyn="false" asynThreadCount="5">
                <uncs:list>
                    <uncs:netCrawlPart class="com.cdc.uncs.service.parts.NetCrawlTestPart" desc="复杂循环步骤"/>
                    <uncs:netCrawlPart class="com.cdc.uncs.service.parts.NetCrawlTestPart2" desc="复杂循环步骤2"/>
                </uncs:list>
</uncs:complexLoopPart>

Code sample:

    @Override
    public void beforeCrawl(TransContext<TestRequest, TestResponse> context, HttpCrawlInfo crawlInfo, String crawlId, String mobileType) throws UncsException {

        // 获取当前循环次数
        this.getComplexLoopCurrent(context);
        getCookieValue(crawlId, "BAIDUID");
        String url = "http://www.baidu.com";
        crawlInfo.setUrl(url);
    }

    @Override
    public void afterCrawl(TransContext<TestRequest, TestResponse> context, HttpCrawlInfo crawlInfo, String crawlId, String mobileType) throws UncsException {

        // 设置循环最大数次
        this.setComplexLoopMax(context, 5);
        System.out.println(crawlInfo.getHttpCrawlResult());
    }

Note: Any template can be applied inside the loop, but only the steps within the loop can operate the loop attributes (maximum number of pages, current number of pages). Steps within the loop cannot be operated across levels.

5.8 finalPart

The steps that the service will take to eventually process the content, regardless of success or failure.
Scenario example: after crawling a website, in order to prevent judging the login status, you need to log out after the end.
Sample configuration:

<uncs:crawl id="testService" browser="Chrome51" poolSize="5" proxyType="no">
        <uncs:list>
            <uncs:complexLoopPart loopType="for" preClass="" preDesc="">
                <uncs:list>
                    <uncs:netCrawlPart class="com.cdc.uncs.service.parts.NetCrawlTestPart" desc="复杂循环步骤"/>
                    <uncs:netCrawlPart class="com.cdc.uncs.service.parts.NetCrawlTestPart2" desc="复杂循环步骤2"/>
                </uncs:list>
            </uncs:complexLoopPart>
        </uncs:list>
        <uncs:finalPart>
            <uncs:part class="com.cdc.uncs.service.parts.NetCrawlTestPart" desc="退出"/>
        </uncs:finalPart>
    </uncs:crawl>

Six: Breakpoints

Uncs supports program breakpoints, i.e. temporary interruption of running services. when certain scenarios are met, the services can be restarted, and the services will continue to execute from the interrupted steps.
For example, when crawling a website, users are sometimes required to input short messages. At this time, human participation is required, and the program must be interrupted. The execution can continue only after users input short messages.
Code example:
Interrupt code

@Override
    public void afterCrawl(TransContext<TestRequest, TestResponse> context, HttpCrawlInfo crawlInfo, String crawlId, String mobileType) throws UncsException {
        // 中断一下
        this.pauseNeedMsg();
        System.out.println(crawlInfo.getHttpCrawlResult());
    }

Restart service

        // 重新启动服务
        String msgCode = "123456";
        TestResponse response1 = UncsService.restartService(crawlId, msgCode, null, TestResponse.class);

VII. Agent Configuration and Use

Uncs supports http proxy and socks5 proxy, supports user-defined proxy acquisition methods, and also supports the use of system-provided proxy methods with strong scalability.
Agent configuration method 1:

<!-- 默认的http代理 -->
<uncs:crawl id="testService" browser="Chrome51" poolSize="5" proxyType="default-http">
<!-- 默认的socks代理 -->
<uncs:crawl id="testService" browser="Chrome51" poolSize="5" proxyType="default-socks">
<!-- 系统启动时,设置默认的两种代理方式及全局代理方式
InitParam param = new InitParam("127.0.0.1", 6379, "ss5.xxx", 1080, "ct", "http://xxxx", 3000, "no");-->

Agent configuration mode 2:

<uncs:crawl id="testService" browser="Chrome51" poolSize="5" proxyType="http">
    <uncs:list>
        <uncs:netCrawlPart class="com.cdc.uncs.service.parts.NetCrawlTestPart" desc="步骤"/>
    </uncs:list>
    <!-- 实现类必须继承对应的类 com.cdc.uncs.http.IHttpProxy或com.cdc.uncs.http.ISocksProxy -->
    <uncs:proxyService class="com.cdc.uncs.http.impl.TestHttpProxy">
        <uncs:property name="ip" value="127.0.0.1"/>
        <uncs:property name="testxxxxx" value=""/>
    </uncs:proxyService>
</uncs:crawl>
package com.cdc.uncs.http.impl;

import com.cdc.uncs.http.IHttpProxy;
import com.cdc.uncs.http.ISocksProxy;
import com.cdc.uncs.util.HttpGreenHelper;
import org.apache.http.HttpHost;

import java.util.HashMap;
import java.util.Map;

/**
 * 默认http代理
 */
public class TestHttpProxy extends IHttpProxy {
    private String ip;
    private Object testxxxxx;
    public TestHttpProxy() {
    }

    /**
     * 获取crawlId
     *
     * @param cid           唯一标识
     * @param type          类型
     * @param logServerName 日志标识
     * @return
     */
    @Override
    public HttpHost getProxy(String cid, String type, String logServerName) {
        return new HttpHost(ip, 8888);
    }
    public String getIp() {
        return ip;
    }
    public void setIp(String ip) {
        this.ip = ip;
    }
    public Object getTestxxxxx() {
        return testxxxxx;
    }
    public void setTestxxxxx(Object testxxxxx) {
        this.testxxxxx = testxxxxx;
    }
}
代理配置方式三:可以在某个步骤单独使用代理,参考《五-5.2 netCrawlPart》

Eight: log configuration and use

Uncs uses a fixed log name and is divided into a standard log and an uncas log. the standard log is the format output required by the log search system and can be ignored. uncas log represents a business log
logback:

<property name="log.base" value="D:/log/uncslog" />
<appender name="uncslog" class="ch.qos.logback.core.rolling.RollingFileAppender">
    <fileNamePattern>${log.base}/uncs%d{yyyy-MM-dd}.log</fileNamePattern>
    <rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
        <FileNamePattern>
            ${log.base}/uncs.log.%d{yyyy-MM-dd}.log
        </FileNamePattern>
    </rollingPolicy>
    <layout class="ch.qos.logback.classic.PatternLayout">
        <pattern>[%level] %date [%thread] - %msg%n</pattern>
    </layout>
</appender>
<appender name="standardlog" class="ch.qos.logback.core.rolling.RollingFileAppender">
    <fileNamePattern>${log.base}/flow_standard%d{yyyy-MM-dd}.log</fileNamePattern>
    <rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
        <FileNamePattern>
            ${log.base}/flow_standard%d{yyyy-MM-dd}.log
        </FileNamePattern>
    </rollingPolicy>
    <layout class="ch.qos.logback.classic.PatternLayout">
        <pattern>%date{"yyyy-MM-dd,HH:mm:ss,SSS"}||%msg%n</pattern>
    </layout>
</appender>
<!-- 标准日志 -->
<logger name="standard" additivity="false">
    <level value="DEBUG" />
    <appender-ref ref="standardlog" />
</logger>
<logger name="com.cdc.uncs" additivity="false">
    <level value="DEBUG" />
    <appender-ref ref="uncslog" />
</logger>

Nine: Asynchronous

Provide asynchronous services

// 异步启动服务
UncsService.ayncStartService
// 获取当前服务状态
UncsService.getResponse

Ten: Version Upgrade History

See “uncs Submission History. md” for details.
The latest version of 3.0.0.6

Eleven: Future Speculation

  • Optimize code quality, improve http initialization part code (optimized) and cookie processing part code (completed)
  • Let part hold the context so that some methods no longer need the context parameter (complete)
  • Provides tools for quickly generating code
  • Provide visualization tools to view the status corresponding to a crawlId ID at any time
  • Integration of outstanding crawler frameworks to form corresponding templates
  • Provide stand-alone mode, can choose not to use redis, local storage
  • Provide concurrent step templates to improve speed (completion)

XII: Instructions for fiddler

1. Upgrade the version to 2.3.0.1-SNAPSHOT or above

2. vm parameters -Duncs.useFidder=1

3. fiddler configuration tools-> fiddler options-> https-> actions-> exportroot certification to …

4、<JDK_Home>binkeytool.exe -import -file C:Users<Username>DesktopFiddlerRoot.cer -keystore FiddlerKeystore -alias Fiddler

Author: Liu PengfeiYixin Institute of Technology