[SUMERU] Arrangement Practice of Yixin Distributed Security Service

  Safety, Service arrangement

Summary

1. Distributed security service orchestration concept

2. Sumeru’s Key Realization Ideas

3. Application scenarios

Preface

In the author’s understanding, one of the essence of security defense is to increase the attacker’s attack cost, especially the time cost. From the perspective of defense, how to discover potential security risks as early as possible and in a timely manner becomes particularly important. Therefore, security scanning requires high timeliness. At the same time of self-detection, tens of thousands of attackers are also constantly detecting your security risks. Optimists may not think so, but in fact doing security is the barrel principle, and short board is the attacker’s first choice. If we add the time cost of verification program development and landing, it may cause a certain discovery delay. Sometimes when there is a problem, you have to race against time to avoid losses or stop losses in time.

In addition, distributed technology has been used to solve the bottleneck of single machine performance, and developers of security products such as vulnerability scanner have also been deeply obsessed with the concept of distributed, because after the omission rate and false alarm rate reach a certain bottleneck, scanning speed becomes another breakthrough.

The long security scanning period is also the pain point we encountered in our previous practical work. In addition, the security defense is the whole face rather than a single point, so we really don’t need too much to form a face. Therefore, the problem of high cost of research and development and operation of scanning tools is also bald. Therefore, this article introduces the practical experience of Yixin security team in applying distributed security service orchestration. Although there are still many deficiencies, we have also achieved many expected results. In short, I hope you can gain something or refer to it.

Brief description of requirements

Shorten the safe scanning period

  • For example, the port scanning period is long, the target is 10000+ IP, and the full port+service fingerprint scanning is optimized from 7 hours to 30 minutes. Masscan needs to maintain a stable rate for port scanning (de pending on the actual different network environment), otherwise it will cause a large number of omissions, so the single-machine multi-process scheme is not reliable.
  • Give full play to server I/O and computing resources

Reduce the cost of research and development and operation of scanning tools

  • Provide application development mechanism, support one-click import and export and version management.
  • Through visual operation, security tasks can be flexibly arranged into a scanning process.
  • To meet the daily needs of rapid online and iterative, such as intelligence gathering, target monitoring, specific scanning, etc.
  • SDK and Restful API are provided to facilitate calls from other platforms.

Security service scheduling

Some students may be unfamiliar with security service orchestration. First, let’s briefly explain: service orchestration is a concept in micro-service system. Choreography refers to controlling the interaction of various resources through the interaction sequence of messages. However, the resources involved in the interaction are all equal and there is no centralized control.

Security service orchestration can be understood as a workflow composed of a series of independent security services calling each other. As shown in the following figure, each box represents an independent security service, and a complete workflow is formed by calling each other.

图

In general, security risk detection is a complete workflow rather than a single functional module. As shown in the above figure, the purpose of port scanning is not only to detect high-risk exposed ports, but also to be used as the scanning target for weak password scanning, PoC scanning, etc. After scanning, it may be necessary to continue to carry out false alarm processing or notify alarms, etc. In reality, there are a large number of open source and commercial security products (tools) with different functions. Most of our requirements are to combine some of their functions. Therefore, our approach is to abstract security products (tools) into application-based security services. For example, two excellent port scanning tools in the industry have their own advantages:

  1. Masscan, high speed stateless port scan
  2. Nmap, with rich service fingerprint scanning

How can we combine the features of these two tools for our use? The current common practice is to roughly integrate them together through glue language such as Python. As a result, the integration cost is relatively high. Specifically, there are mainly two aspects:

  1. It is difficult to flexibly combine and reuse
  2. Scanning scale is limited

On the other hand, due to the consideration of research and development cost, most of Party A’s teams with self-research capability have adopted open source tools for secondary development. Therefore, the development cost brought by deep integration of tools through coding is undoubtedly huge. Is there any slightly elegant solution? Let’s take a look at the characteristics of the layout:

The key to arrangement is process+adaptation.

  • The process is to string various tasks into a workflow
  • Adaptation is to get through the data between tasks.

It seems that choreography can solve some of our problems. In order to realize and verify the above concepts more conveniently, we have developed Sumeru distributed task scheduling framework using Python. Sumeru is derived from Yixin’s distributed scanner to pick stars and pull away the underlying distributed task execution logic.

Sumeru implements the layout concept of visual dragging. We can observe our task execution through a tree structure in both the design phase and the result display phase. The following figure is a screenshot of editing interface for sumeru task layout, which can arbitrarily arrange different applications by dragging. The following figure is a daily comprehensive scanning plan task, integrating IP parsing, Masscan port scanning, Nmap service fingerprint scanning, sensitive directory scanning and PoC scanning into a completed workflow:

图

At the same time, a tree-shaped result display interface is provided, which can refresh the task execution status in real time and provide users with an intuitive overview of task execution. Different task statuses will be reflected in the color of points:

图

Realization of Sumeru Distributed Task Framework

According to Imperva’s survey data on GitHub code base, over 20% of the network attack tools or PoC codes in the current GitHub code base are written in Python, which has become the preferred language for hackers to develop network attack tools. Sumeru carried some optimization expectations of the existing work and the vision for the future work. At the same time, he also referred to many existing distributed task scheduling frameworks, such as Celery,Java implemented in Python, XXL-job and Elastic-job implemented in Java, and found that they did not meet our needs well. At the same time, most commonly used open source security tools are implemented by Python or call class libraries implemented by Python, so distributed task scheduling framework based on Python has become sumeru’s target positioning. here are some key function points for everyone to introduce, and at the same time post the function architecture diagram for your reference.

图

Application-security-as-a-service

Re-creating an existing or optimized basic method by others is called wheel making by everyone in the industry. Therefore, the meaning of reuse is how to avoid repeated wheel making, that is, to abstract repetitive work more generally. Our demand is very simple, that is, to turn tools with different functions into reusable wheels. This is the same as the idea of micro-service, but it is slightly different. Wheels alone are not enough. We need to turn wheels. First, we need to talk about how to turn wheels. The key steps are mainly two points:

  1. Transformation, i.e.Application development-abstract the interaction interface of third-party tools into applications.
  2. Assembly, i.e.Layout design—Design the calling relationship between applications.

Application development is the first step of implementation on the ground, so sumeru designed the function of the application center to facilitate version management and distribution of the application, and at the same time to provide one-click import and export of the application. As shown in the screenshot of the application center:

图

The concept of application makes security services or tools independent and more suitable for maintenance and development iteration.

Core Implementation: Task Fragmentation and Failover

Two key concepts in the implementation of the traditional distributed task scheduling framework are mentioned here: task fragmentation to improve performance and failure transfer to improve usability.

Task fragmentation

Task slicing is to parallelize a large-scale task with more detailed data to improve the throughput of the whole system, which plays a vital role in improving distributed performance. So what exactly is mission fragmentation? Let’s give an example to illustrate:

Suppose we have 2 scanning targets
IP:192.168.1.1 , 192.168.1.2 ,

2 user names: admin,guest,

Two passwords: 123456,111111,
As follows:

  1. target 192.168.1.1 , 192.168.1.2
  2. username admin,guest
  3. password 123456,111111

If slicing option is set, Sumeru will use Cartesian product calculation to support task slicing. After slicing:
222 = 8 in total 8 slices

192.168.1.1,admin ,123456
192.168.1.1,admin ,111111
192.168.1.1,guest ,123456
192.168.1.1,guest ,111111
192.168.1.2,admin ,123456
192.168.1.2,admin ,111111
192.168.1.2,guest ,123456
192.168.1.2,guest ,111111

If there is no fragmentation, the eight tasks can only be taken as a whole, and cannot be distributed to each execution node for execution, and failure transfer cannot be performed at a finer granularity. After the task is fragmented, sumeru will generate a task tree for saving the task state according to the scheduling and fragmentation results, and perform task allocation to execution nodes according to various scheduling algorithms such as load balancing.

failover

Failover, also known as failover, refers to when one of the devices or services in the system fails and cannot operate, the other device or service can automatically take over the work performed by the original failed system, which is used by sumeru to ensure the execution status in the process of task execution.

We design the following two situations to trigger failure transfer, as shown in the following figure (red represents abnormal state):

  1. Exception occurred in the task, including the exception caught by the task manager and the exception actively thrown by the user.
  2. An exception occurred in the execution node.

We have a practical application scenario here, which realizes adaptive port scanning of internal and external domain names, which will be described later.

图

Sumeru also supports setting timeout. If the specified time limit is exceeded, it will be regarded as an exception to the task to prevent the task from being suspended due to unknown reasons.

Guardian mode

Guardian task mode is suitable for scenes providing services to the outside world, such as wind control rule engine, which is a data processing application based on data parallelism. Distributed nodes will obviously improve its performance.

Specifically, it is the daemon (thread) mode. Users can choose the thread or process mode according to the actual scene, so we collectively call it the daemon mode.

Conventional tasks are generally completed at one time or scheduled to be executed periodically. However, in some scenarios, we want the tasks to remain running all the time, providing service applications (such as passive scanning proxy applications and HTTP service applications) to the outside, like real-time data processing applications (such as wind control rule engines). Different from conventional tasks, we need to ensure the survival status of these tasks as much as possible. If the task has the guardian mode checked, the dispatching center will ensure that the tasks are inWithin groupsThere is only one task instance running. If the node is abnormal, it will fail to transfer to other nodes.

Sumeru provides distribution options. If checked, there will be only one task instance on each node in the node grouping.

Other characteristics

Other features are also listed briefly:

  1. Scheduler-HA based on ETCD. In order to ensure the high availability of the whole dispatching center, we implemented high availability based on distributed K-V system ETCD.
  2. Concurrency Support: Provides execution of tasks with different granularity for threads and processes
  3. Second-level scheduled tasks, supports custom scheduled task extension (e.g. @hourly,@weekly)
  4. Provide application development kit: base class, debugging, deployment, version management
  5. Provide SDK, RestfulAPI and complete authorization mechanism
  6. Support email notification, data backup, log ElasticSearch access, etc.
  7. Asynchronous implementation
  8. Python2/3 compatible
  9. Support the view of execution log in the form of Web console, as shown in the following figure:

图

Other features include task life cycle management, application availability detection, secure communication, etc., which are not described in detail at length.

Examples of application scenarios

Sumeru was designed to solve the problems in some scenes at the beginning of the design, so these application scenes will also be briefly introduced.

1. Port Scanning Improves Scanning Performance for Task Fragmentation

Performance improvement: Python’s performance is indeed relatively low, but most of the scenarios are computationally intensive and will produce bottlenecks. IO-intensive scenarios such as security scanning will not have much impact. The principle of over-slicing has been introduced in the foregoing. Here we will test the performance improvement of slicing+distribution with specific data. Let’s take port scanning as an example, 10,000+IP for full port+service fingerprint scanning. As shown in the figure, nodes {1,3,6,9} take {25220.28,5386.728,3076.681,1624.101} (seconds) respectively, optimizing to 6.4% of the previous time consumption.

图

2. Failure Transfer Adaptive Network Environment

In the process of scanning, if a node has no network connection or the network is unstable, it will be transferred to other nodes in the same group to continue to execute until all tasks can run normally, as shown in the following figure.

图

3, demand quickly online

Sumeru provides a complete set of rapid application development and online process.

At present, most of Party A’s security platforms actually require relatively comprehensive platforms, so they are often made into a large and complete set of tools, mostly including scanning tools (Web scanning, passive scanning, host scanning, port scanning, Git leak scanning), threat intelligence, knowledge base, etc.
At the same time, it also brings about an increase in maintenance costs. For example, suddenly there is a new requirement: monitoring covert network transaction information, which at first glance belongs to threat information, but is completely different from the original threat information format, usage and deployment methods.
The development brother may have to reluctantly add a new function to the sub-menu of threat intelligence and go ahead with the development. Such demand is not rare, and finally the entire platform is becoming more and more bloated and difficult to maintain.

If we use the application development method, we can abstract the function into an application, only need to write the application, distribute it online and call it remotely, thus separating the service from the platform, and making it more convenient for the team to maintain it cooperatively.

The Restful API provided by sumeru makes it easier to integrate applications as services in CI/CD (continuous integration/continuous deployment). It is implemented using Python and is more friendly to Python ecological support. If there is a suitable scene for students, welcome to communicate with us.

Summary

It is our original intention to avoid building wheels repeatedly and to focus part of our energy on the improvement of professional capabilities. sumeru has basically realized the idea of distributed security service orchestration. there is still a lot of room for improvement in performance and stability. we hope it can play more value.

Thank you for your hard work! After this article, some colleagues will continue to share relevant content, please look forward to it.

Author’s micro-signal: lfzark (please specify the purpose of adding it), welcome everyone to communicate and make common progress.

The landing of an idea requires a series of technologies and resources to support. It is really not easy to polish the safety products in the safety department of Party A’s company. Thank those great gods who even make products comparable to those of Party B and provide us with examples to learn from.

Yixin safety emergency response center (CESRC) website is:https://security.creditease.cnThe platform aims to gather experts, social organizations and individuals in the field of security to jointly discover potential vulnerability information, to protect the safety of all products and businesses of Yixin, to promote direct communication and cooperation between White Hat, security teams and security enthusiasts and Yixin, and to reduce and reduce various potential security risks.

Yixin Institute of Technology