java, Web crawler



Httpclient is one of the most commonly used tools in java development. usually, people will use httpcilent to call remote sites. using the relatively basic api, they will develop crawlers for a long time, come into contact with api not commonly used by httpclient, and encounter various kinds of pits at the same time. the pits encountered in these years will be summarized below.

Pit, pit, pit

I. received financial alert: handle _ failure

  • Resolution process

When developing a mobile crawler in a province, it was wrong to load the title on the front page. It was not easy to try all kinds of methods. Later, it was found that jdk1.8 would work well.
After a week-long source code search, it was found that the source of the error was http’s unsupported encryption algorithm when shaking hands.
256 bits (TLS _ DHE _ RSA _ WITH _ AES _ 256 _ CBC _ SHA) are not supported in JDK versions 1.8 and below

  • Solution

1. Need to download jce extension package …

2. replace the two jars in/JRE /jre/lib/security/

3. if you report the error after the coverage, it will be a mistake. ,

Explain that the downloaded version is not correct. Please download the corresponding jdk version

Two: certificates

  • Resolution process

When packing with mvn, an error is reported.
security.cert.CertificateException: Certificates does not conform toalgorithm constraints
The reason is that in this configuration file after java1.6, MD2 encryption method is considered to be too low in security, so this encryption method is not supported, and ciphertext with RSA length less than 1024 is also not supported.
Need to modify
#jdk.certpath.disabledAlgorithms=MD2, RSA keySize < 1024

However, to do so, we need to change every machine. If the new machine forgets to change, problems will occur. We need a set of methods to solve the problems only at the code level.

  • Solution

After checking the source code, we found the code location that triggered the problem. We can solve this problem by forcibly inheriting SSLContextBuilder and forcing the values of private keymanagers and trustmanagers to be null.

static class MySSLContextBuilder extends SSLContextBuilder {
   static final String TLS   = "TLS";
   static final String SSL   = "SSL";
   private String protocol;
   private Set<KeyManager> keymanagers;
   private Set<TrustManager> trustmanagers;
   private SecureRandom secureRandom;
   public MySSLContextBuilder() {
      this.keymanagers = new HashSet<KeyManager>();
      this.trustmanagers = new HashSet<TrustManager>();

Three: The timeout period does not take effect

  • Resolution process
    httpGet.getParams().setParameter(ClientPNames.HANDLE_REDIRECTS, !isAutoRelocal);

  • Solution

Request. getparameters (). setparameter is an expired method, and each parameter has a corresponding one in RequestConfig. it can be traversed and replaced once.

boolean isRedirect = true;
        if(request != null) {
            HttpParams params = request.getParams();
            if (params instanceof HttpParamsNames) {
                // 暂时只支持这个类型
                isRedirect = params.getBooleanParameter(
                        ClientPNames.HANDLE_REDIRECTS, true);
            // 清空request
            request.setParams(new BasicHttpParams());
        if(timeOut > 0) {
            builder = RequestConfig.custom().setConnectionRequestTimeout(timeOut).setConnectTimeout(timeOut).setSocketTimeout(timeOut).setRedirectsEnabled(isRedirect).setCookieSpec(CookieSpecs.BEST_MATCH);
        } else {
            builder = RequestConfig.custom().setConnectionRequestTimeout(connectionTimeout).setConnectTimeout(connectionTimeout).setRedirectsEnabled(isRedirect).setSocketTimeout(socketTimeout).setCookieSpec(CookieSpecs.BEST_MATCH);

IV. fildder Monitoring

  • Problem

Fildder is often used to monitor network requests when developing crawlers, but it is very difficult to use it when using httpclient. it is often difficult to check various methods on the internet.
Let’s make a mistake for everyone. using the following method can perfectly solve this problem and make the fildder monitoring easier.

  • Solution
// client builder
HttpClientBuilder builder = HttpClients.custom();
if(useFidder) {
            // 默认fidder写死
            builder.setProxy(new HttpHost("", 8888));
tools->fiddler options->https->actions->export root certificate to ... 
<JDK_Home>\bin\keytool.exe -import -file C:\Users\<Username>\Desktop\FiddlerRoot.cer -keystore FiddlerKeystore -alias Fiddler

Five: gzip support

  • Problems and Solutions

Some websites return gzip compression. The returned content is the result of compression and needs decompression.

HttpClient wrappedHttpClient =  builder.setUserAgent(requestUA)
                .addInterceptorLast(new HttpResponseInterceptor() {
                    public void process(HttpResponse httpResponse, HttpContext httpContext) throws HttpException, IOException {
                        HttpEntity httpEntity = httpResponse.getEntity();
                        Header header = httpEntity.getContentEncoding();
                        if (header != null) {
                            for (HeaderElement element : header.getElements()) {
                                if ("gzip".equalsIgnoreCase(element.getName())) {
                                    httpResponse.setEntity(new GzipDecompressingEntity(httpResponse.getEntity()));



Author: Liu PengfeiYixin Institute of Technology