Httpclient

  java, Web crawler

Httpclient

Preface

Httpclient is one of the most commonly used tools in java development. usually, people will use httpcilent to call remote sites. using the relatively basic api, they will develop crawlers for a long time, come into contact with api not commonly used by httpclient, and encounter various kinds of pits at the same time. the pits encountered in these years will be summarized below.

Pit, pit, pit

I. received financial alert: handle _ failure

  • Resolution process

When developing a mobile crawler in a province, it was wrong to load the title on the front page. It was not easy to try all kinds of methods. Later, it was found that jdk1.8 would work well.
After a week-long source code search, it was found that the source of the error was http’s unsupported encryption algorithm when shaking hands.
256 bits (TLS _ DHE _ RSA _ WITH _ AES _ 256 _ CBC _ SHA) are not supported in JDK versions 1.8 and below

  • Solution

1. Need to download jce extension packagehttp://www.oracle.com/technet …

2. replace the two jars in/JRE /jre/lib/security/

3. if you report the error after the coverage, it will be a mistake. ,

Explain that the downloaded version is not correct. Please download the corresponding jdk version

Two: certificates

  • Resolution process

When packing with mvn, an error is reported.
security.cert.CertificateException: Certificates does not conform toalgorithm constraints
The reason is that in this configuration file after java1.6, MD2 encryption method is considered to be too low in security, so this encryption method is not supported, and ciphertext with RSA length less than 1024 is also not supported.
Need to modify
JAVA_HOME/jre/lib/security/java.security
#jdk.certpath.disabledAlgorithms=MD2, RSA keySize < 1024

However, to do so, we need to change every machine. If the new machine forgets to change, problems will occur. We need a set of methods to solve the problems only at the code level.

  • Solution

After checking the source code, we found the code location that triggered the problem. We can solve this problem by forcibly inheriting SSLContextBuilder and forcing the values of private keymanagers and trustmanagers to be null.

static class MySSLContextBuilder extends SSLContextBuilder {
   static final String TLS   = "TLS";
   static final String SSL   = "SSL";
   private String protocol;
   private Set<KeyManager> keymanagers;
   private Set<TrustManager> trustmanagers;
   private SecureRandom secureRandom;
   public MySSLContextBuilder() {
      super();
      this.keymanagers = new HashSet<KeyManager>();
      this.trustmanagers = new HashSet<TrustManager>();
   }
}

Three: The timeout period does not take effect

  • Resolution process
很多人在使用httpclient时会到网上去找例子,例子中经常会有类似这样的设置
    httpGet.getParams().setParameter(ClientPNames.HANDLE_REDIRECTS, !isAutoRelocal);

使用上面方法时,发送httpclient时,在读取配置时如果发现getParams不为空,则会把以前设置的所有参数都不用了,而使用这里面设置的,所以超时时间会失效
  • Solution

Request. getparameters (). setparameter is an expired method, and each parameter has a corresponding one in RequestConfig. it can be traversed and replaced once.

boolean isRedirect = true;
        if(request != null) {
            HttpParams params = request.getParams();
            if (params instanceof HttpParamsNames) {
                // 暂时只支持这个类型
                isRedirect = params.getBooleanParameter(
                        ClientPNames.HANDLE_REDIRECTS, true);
            }
            // 清空request
            request.setParams(new BasicHttpParams());
        }
        if(timeOut > 0) {
            builder = RequestConfig.custom().setConnectionRequestTimeout(timeOut).setConnectTimeout(timeOut).setSocketTimeout(timeOut).setRedirectsEnabled(isRedirect).setCookieSpec(CookieSpecs.BEST_MATCH);
        } else {
            builder = RequestConfig.custom().setConnectionRequestTimeout(connectionTimeout).setConnectTimeout(connectionTimeout).setRedirectsEnabled(isRedirect).setSocketTimeout(socketTimeout).setCookieSpec(CookieSpecs.BEST_MATCH);
        }

IV. fildder Monitoring

  • Problem

Fildder is often used to monitor network requests when developing crawlers, but it is very difficult to use it when using httpclient. it is often difficult to check various methods on the internet.
Let’s make a mistake for everyone. using the following method can perfectly solve this problem and make the fildder monitoring easier.

  • Solution
首先java端
// client builder
HttpClientBuilder builder = HttpClients.custom();
if(useFidder) {
            // 默认fidder写死
            builder.setProxy(new HttpHost("127.0.0.1", 8888));
}
fildder端
tools->fiddler options->https->actions->export root certificate to ... 
<JDK_Home>\bin\keytool.exe -import -file C:\Users\<Username>\Desktop\FiddlerRoot.cer -keystore FiddlerKeystore -alias Fiddler

Five: gzip support

  • Problems and Solutions

Some websites return gzip compression. The returned content is the result of compression and needs decompression.


HttpClient wrappedHttpClient =  builder.setUserAgent(requestUA)
                .addInterceptorLast(new HttpResponseInterceptor() {
                    @Override
                    public void process(HttpResponse httpResponse, HttpContext httpContext) throws HttpException, IOException {
                        HttpEntity httpEntity = httpResponse.getEntity();
                        Header header = httpEntity.getContentEncoding();
                        if (header != null) {
                            for (HeaderElement element : header.getElements()) {
                                if ("gzip".equalsIgnoreCase(element.getName())) {
                                    httpResponse.setEntity(new GzipDecompressingEntity(httpResponse.getEntity()));
                                }
                            }
                        }
                    }
                })

Summary

上面一些能想起来的坑,还会遇到很多问题,欢迎来讨论
做一个广告:想简单开发爬虫的欢迎使用uncs

Author: Liu PengfeiYixin Institute of Technology