Zookeeper’s Regret of Expansion

  extend, Migration, zookeeper

I. background

Based on the company’s hard demand for development, production VM servers should be migrated to ZStack virtualization servers. Check the servers used by your project, among which zookeeper cluster is the one that needs to be migrated.

II. Migration Plan

In order that the migration will not affect the business, it is better to adoptExpansion->Shrink capacityIn the same way.

zk

说明:
1.原生产集群为VM-1,VM-2,VM-3组成一个3节点的ZK集群;
2.对该集群扩容,增加至6节点(新增ZS-1,ZS-2,ZS-3),进行数据同步完成;
3.进行缩容,下掉原先来的三个节点(VM-1,VM-2,VM-3);
4.替换nginx解析地址。

OK! 目标很明确,过程也很清晰,然后开干。

III. Steps (the process has been verified in the test environment to be no problem):

  1. The zk environment configuration for the three newly added servers is the same as the old cluster configuration, and it is better to use the same version (the moderator uses 3.4.6);
  2. Add the address of the new cluster to zoo.cfg of the old node (one by one), and then restart the newly added nodes one by one.

zk-2

IV. Issues

  • ZS-1The startup was successful. zkServer.sh status reported an error and checked with zkserver.shstatus. the feedback is as follows:
[root@localhost bin]# ./zkServer.sh  status
ZooKeeper JMX enabled by default
Using config: /usr/zookeeper/zookeeper-3.4.6/bin/../conf/zoo.cfg
Error contacting service. It is probably not running.
  • At this time, check the data and the data synchronization is normal.
ZS-1 数据同步正常,但是无法查看节点的状态信息;
  • The suspicion is that the old node did not restart. At this time, check the information of the original cluster node and find that the state of the original cluster node is abnormal. After investigation and position ing, the original cluster has been in an abnormal state.
  • The reason for the initial positioning may be that the election of the original cluster was abnormal, which led to the failure of the normal inclusion of new nodes and continued investigation.
  • Restore the initial state of the cluster, if the state of the cluster nodes has been unable to view normally. OK, continue positioning. …

V. screening process

The following methods come from the network:

There may be several reasons:

First, zoo.cfg file configuration: the directory specified by dataLogDir has not been created.

1.zoo.cfg
[root@SIA-215 conf]# cat zoo.cfg
...
dataDir=/app/zookeeperdata/data
dataLogDir=/app/zookeeperdata/log
...

2.路径
[root@SIA-215 conf]# cd /app/zookeeperdata/
[root@SIA-215 zookeeperdata]# ll
total 8
drwxr-xr-x 3 root root 4096 Apr 23 19:59 data
drwxr-xr-x 3 root root 4096 Aug 29  2015 log

After investigation and elimination of this factor.

Second, the integer in myid file is not in the correct format or does not correspond to the server integer in zoo.cfg.

[root@SIA-215 data]# cd /app/zookeeperdata/data
[root@SIA-215 data]# cat myid 
2[root@SIA-215 data]# 

It is not the reason to eliminate after locating and checking.

Third, the firewall is not closed.

Use serviceptables stop to shut down the firewall;
Confirm with serviceptables status;
Use chkconfig iptables off to disable the firewall.

Verify that the firewall is turned off.

[root@localhost ~]# service iptables status
iptables: Firewall is not running.
确认防火墙是关闭的

Fourth, the port is occupied.


[root@localhost bin]# netstat -tunlp | grep 2181
tcp        0      0 :::12181                    :::*                        LISTEN      30035/java          
tcp        0      0 :::22181                    :::*                        LISTEN      30307/java 

确认端口没有被占用

Fifth, host name error in zoo.cfg file.


经测试环境测试,主机名正确,多域名解析也正常,不存在此问题

Sixth, in the hosts file, the host name of this machine has two correspondences, only the mapping of host name and ip address needs to be preserved.


经测试环境测试,主机名正确,多域名解析也正常,不存在此问题 排除。

Seventh, there is a problem with the nc command in zkServer.sh


 可能是机器上没有安装nc命令,还有种说法是在zkServer.sh里找到这句:
 STAT=`echo stat | nc localhost $(grep clientPort “$ZOOCFG” | sed -e ‘s/.*=//’) 2> /dev/null| grep Mode`
 在nc与localhost之间加上 -q 1 (是数字1而不是字母l)
 
 zookeeper版本是3.4.6,zkServer.sh里根本没有这一句(获取状态的语句没有用nc命令)

 # -q is necessary on some versions of linux where nc returns too quickly, and no stat result is output
    clientPortAddress=`grep "^[[:space:]]*clientPortAddress[^[:alpha:]]" "$ZOOCFG" | sed -e 's/.*=//'`
    if ! [ $clientPortAddress ]
    then
        clientPortAddress="localhost"
    fi
    clientPort=`grep "^[[:space:]]*clientPort[^[:alpha:]]" "$ZOOCFG" | sed -e 's/.*=//'`
    STAT=`"$JAVA" "-Dzookeeper.log.dir=${ZOO_LOG_DIR}" "-Dzookeeper.root.logger=${ZOO_LOG4J_PROP}" \
             -cp "$CLASSPATH" $JVMFLAGS org.apache.zookeeper.client.FourLetterWordMain \
             $clientPortAddress $clientPort srvr 2> /dev/null    \
          | grep Mode`
    if [ "x$STAT" = "x" ]
    then
        echo "Error contacting service. It is probably not running."
        exit 1
    else
        echo $STAT
        exit 0
    fi
    ;;

Six, the following is their investigation methods:

At present, the data synchronization of the old cluster is normal, and leader election (obtained from logs) can be conducted, but the node status cannot be viewed, which is the same as the abnormal information. The data cannot be synchronized due to cluster expansion.

Resolution:

1. Try to start in foreground mode, select a non-primary node to restart, and the foreground can view the startup log.


zkserver.sh start-foreground

节点启动正常,无异常输出。

2. view shell script: analyze zkServer.sh

  • “errorcontacting service.

STAT=`"$JAVA" "-Dzookeeper.log.dir=${ZOO_LOG_DIR}" "-Dzookeeper.root.logger=${ZOO_LOG4J_PROP}" \
             -cp "$CLASSPATH" $JVMFLAGS org.apache.zookeeper.client.FourLetterWordMain \
             $clientPortAddress $clientPort srvr 2> /dev/null    \
          | grep Mode`
    if [ "x$STAT" = "x" ]
    then
        echo "Error contacting service. It is probably not running."
        exit 1
    else
        echo $STAT
        exit 0
    fi
    ;;
  • Intercept part of the script content: we can initially decide that it should be$STATGet exception if STAT variable is empty, errorcontacting service.

OK, then analyze this$STATWhat the hell is that?


 if [ “x$STAT” = “x” ]
then
echo “Error contacting service. It is probably not running.”
exit 1
else
echo $STAT
exit 0
fi

3, try to use shell debug mode to see the execution process:

  • The intercept fragment execution log is as follows: indeed, the STAT variable is indeed empty, causing the output errorcontacting service.

++ grep '^[[:space:]]*clientPort[^[:alpha:]]' /app/zookeeper-3.4.6/bin/../conf/zoo.cfg
+ clientPort=5181
++ grep Mode
++ /opt/jdk1.8.0_131/bin/java -Dzookeeper.log.dir=. -Dzookeeper.root.logger=INFO,CONSOLE -cp '/app/zookeeper-3.4.6/bin/../build/classes:/app/zookeeper-3.4.6/bin/../build/lib/*.jar:/app/zookeeper-3.4.6/bin/../lib/slf4j-log4j12-1.6.1.jar:/app/zookeeper-3.4.6/bin/../lib/slf4j-api-1.6.1.jar:/app/zookeeper-3.4.6/bin/../lib/netty-3.7.0.Final.jar:/app/zookeeper-3.4.6/bin/../lib/log4j-1.2.16.jar:/app/zookeeper-3.4.6/bin/../lib/jline-0.9.94.jar:/app/zookeeper-3.4.6/bin/../zookeeper-3.4.6.jar:/app/zookeeper-3.4.6/bin/../src/java/lib/*.jar:/app/zookeeper-3.4.6/bin/../conf:.:/opt/jdk1.8.0_131/lib/dt.jar:/opt/jdk1.8.0_131/lib/tools.jar' org.apache.zookeeper.client.FourLetterWordMain localhost 5181 srvr
+ STAT=
+ ‘[‘ x = x ‘]’
+ echo ‘Error contacting service. It is probably not running.’
Error contacting service. It is probably not running.
+ exit 1

4. Modify the shell script: Analyze that zkServer.sh always adds output STAT content to the script. We will not filter this time.


STAT1=`"$JAVA" "-Dzookeeper.log.dir=${ZOO_LOG_DIR}" "-Dzookeeper.root.logger=${ZOO_LOG4J_PROP}" \
             -cp "$CLASSPATH" $JVMFLAGS org.apache.zookeeper.client.FourLetterWordMain \
             $clientPortAddress $clientPort srvr 2> test.log \ `

echo "$STAT1"
  • The best way is to copy a new script so as not to pollute the original script. This is what I did. Then run the script.

[root@localhost bin]# ./zkServer.sh  status
ZooKeeper JMX enabled by default
Using config: /usr/zookeeper/zookeeper-3.4.10/bin/../conf/zoo.cfg
Error contacting service. It is probably not running.
  • Then look at the generated test.log file: indeed there are abnormal contents.

in thread “main” java.lang.NumberFormatException: For input string: “2181
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:492)
at java.lang.Integer.parseInt(Integer.java:527)
at org.apache.zookeeper.client.FourLetterWordMain.main(FourLetterWordMain.java:76)
  • Judging from the log information, it is suggested that the port number 2181 is caused. Not a legal number.

There is this sentence in zkServer.sh:


clientPort=`grep “^[[:space:]]*clientPort[^[:alpha:]]” “$ZOOCFG” | sed -e ‘s/.*=//’`
grep “^[[:space:]]*clientPort[^[:alpha:]]” “$ZOOCFG” | sed -e ‘s/.*=//’在执行过程中,实际命令如下:
grep ‘^[[:space:]]*clientPort[^[:alpha:]]’ /app/zookeeper-3.4.6/bin/../conf/zoo.cfg | sed -e ‘s/.*=//’
  • Finally, the problem with the configuration file can be basically confirmed.
  • Replace configuration file: restart problem solved.
  • The existence reason may be that editing zoo.cfg encoding format etc. causes the file content parsing exception.

By Mao Zhengwei

Expand reading:[Yixin Technology Salon No.01] AI Middle Station: An Agile Intelligent Business Support Scheme | Sharing Record

[Yixin Technology Salon 02] Construction Practice of Taiwan in Yixin Agile Data | Record of Sharing