Massive Data Search-Search Engine

  Search engine

In our daily life and work, search websites such as Baidu and Google have become schools where we are taught and reassured. As the saying goes, when there is a problem, find a matchmaker. So how did Baidu find the data it needed in the sea data? Why did he search so fast? We all know that it is because of Baidu’s search engine. What is a search engine? Some programmers may think of es, but it does not represent a search engine. It is only one of the tools. However, this tool is really useful and efficient.

This article will tell you the basic knowledge of search engines and some methods of Chinese word segmentation, and then make a small demo to try data retrieval. Let everyone have a preliminary understanding of the implementation of search engines.

I introduction of search engines

1. What is a search engine

The introduction of Baidu Encyclopedia is quoted here: Search Engine refers to a system that collects information from the Internet according to a certain strategy and using a specific computer program. After organizing and processing the information, it provides users with retrieval services and displays relevant information retrieved by users to users.

2. Classification of search engines

Search engines include full-text index, catalog index, meta search engine, vertical search engine, aggregate search engine, portal search engine and free link list.

Here we mainly introduce the full-text index, which is the classification of search engines used by Baidu.

Full-text index:

The first is the collection of data in the database. The automatic information collection function of the search engine is divided into two types. One is regular search, that is, every once in a while (such as Google is generally 28 days), the search engine actively sends out a “spider” program to search Internet websites within a certain IP address range. Once a new website is found, it will automatically extract the website information and website address and add them to its own database. The other is to submit the website search, that is, the website owner submits the website address to the search engine voluntarily. It sends a “spider” program to your website within a certain period of time (ranging from 2 days to several months), scans your website and stores the relevant information in the database for users to query.

When users search for information with keywords, the search engine will search in the database. If it finds a website that matches the user’s requirements, it will use a special algorithm-usually based on the matching degree, location, frequency and link quality of keywords in the webpage-to calculate the relevance and ranking level of each webpage, and then return these webpage links to the user in sequence according to the level of relevance. This kind of engine is characterized by high search rate.

3. What problems can search engines solve

1 > efficiently query data (query data by using various algorithms at a millisecond rate, whether it is tens of millions of data or hundreds of millions of data)

2 > it is relatively easy to switch the common database to a search engine.

3 > large amount of data, timeliness, high concurrency, etc.

4. Application scenario of search engine:

1 > when the database reaches the level of millions of data 2 > requires timeliness, high performance and Ms-level response

Let’s take a look at the application of search engines in our common Internet:

Solr

The search engine we are going to talk about today is Solr, so what is Solr? What are its advantages and disadvantages compared with es? Let’s first briefly introduce solr:

Solr is a full-text search server based on Lucene. At the same time, it has been extended to provide a richer query language for use than Lucene. At the same time, it has been configurable, extensible, optimized query performance and provided a perfect function management interface. It supports Xml/Http protocol and JSONAPI interface.

It has the following characteristics:

  • Scalability: Solr can distribute indexing and query processing operations to multiple servers in a cluster.
  • Rapid deployment: Solr is open source software, which is easy to install and configure. It can be used directly according to the Sample configuration in the installation package and can be divided into stand-alone and cluster modes.
  • Massive Data: Solr is designed for massive data processing above 100 million, which can handle massive data retrieval well.
  • Optimized search function: Solr search speed is fast enough. Solr can process complex search queries in millisecond level. Usually, a complex query can be processed in tens of milliseconds.

II. Introduction to Word Segmentation

Next, we want to find out how word segmentation is realized. Then, why do we want to do word segmentation? What does this have to do with search engines? How do we break down a few words or a passage into multiple keywords in the search box?

Have you heard of any participators? For example, lucene’s own Chinese word segmenter smartcn, as well as the most commonly used IK word segmenter, etc. today we will mainly talk about IK word segmenter.

IK segmenter

The IK word segmenter will first maintain several dictionaries to record some commonly used words, such as the main word list: main2012.dic, quantifier table quantifier.dic, stop words stopword.dic.

Dictionary is a dictionary management class, which loads this dictionary into the memory structure respectively. The specific dictionary code is located at org.wltea.analyzer.dic.dictsegment. This class implements a core data structure of a participator, namely Tire Tree.

Tire Tree (dictionary tree) is a tree structure with a relatively simple structure. It is used to build dictionaries and find words quickly by comparing prefix characters one by one. Therefore, it is sometimes called prefix tree. Specific examples are as follows.

For example, I am the Chinese people in Zhongguancun, Haidian District, Beijing

The dictionaries we have set up are Beijing, Haidian District, Zhongguancun, China and the Chinese people, so the dictionary tree we have formed according to the dictionaries looks like this:

  1.png  

Then we segment the words according to this dictionary tree. In IK participators, there are basically two modes: one is smart mode and the other is non-smart mode, which can be configured during initialization in code.

In fact, we don’t need to explain the literal meaning of the two modes. We can see the results by directly printing the two modes:

Original sentence:

I am the Chinese people in Zhongguancun, Haidian District, Beijing

Smart mode:

Beijing, Haidian District, Zhongguancun, Chinese People

Non-smart mode:

Beijing, Haidian District, Zhongguancun, China, Chinese People

Obviously, the non-smart mode is to output all the words that can be separated. Smart mode outputs a reasonable word segmentation result according to internal methods, which involves ambiguity judgment.

Let’s give a more representative example: Zhang San’s statement is indeed reasonable.

According to the forward matching possible word element chain:

    

L1:{张三,张,三}

    L2:{说}

    L3:{的确,的,确实,确,实在,实,在理,在,理}

    首来看一下最基本的一些元素结构类:

public class Lexeme implements Comparable{  
    ……  
   
    //词元的起始位移  
    private int offset;  
    //词元的相对起始位置  
    private int begin;  
    //词元的长度  
    private int length;  
    //词元文本  
    private String lexemeText;  
    //词元类型  
    private int lexemeType;  
     ……  
}

  
Lexeme here can be understood as a word or a word. Where begin refers to its position in the input text. Note that it is Comparable, with the priority of the front starting position and the priority of the longer length, which can be used to determine the position of a word in the word element chain of a word segmentation result, and can be used to obtain the order of each word in the word segmentation result in the above example.

/* 
* 词元在排序集合中的比较算法
* @see java.lang.Comparable#compareTo(java.lang.Object) 
*/  
public int compareTo(Lexeme other) {  
//起始位置优先  
    if(this.begin < other.getBegin()){  
        return -1;  
    }else if(this.begin == other.getBegin()){  
     //词元长度优先  
     if(this.length > other.getLength()){  
         return -1;  
     }else if(this.length == other.getLength()){  
         return 0;  
     }else {//this.length < other.getLength()  
         return 1;  
     }  
        
    }else{//this.begin > other.getBegin()  
     return 1;  
    }  
}

Smart pattern disambiguation algorithm:

The word element position weight comparison (word element length product) means selecting the set with long word element positions at the back.

L31:{ of, indeed, reasonable }11+22+3*2=11

L32:{ True, Real, Reasonable} 12+21+3*2=10

L33:{ Indeed, Truth, Reason} 12+22+3*1=9

Last word segmentation result: Zhang San, said, yes, indeed, reasonable

That’s all we have to say about participles. You can read the source code of IK participles and have a deeper understanding. Source code address:https://github.com/quentinxxz …

III. Inverted Index Algorithm

Introduction

We can think of the inverted index algorithm as a directory when looking up a dictionary. Once we know the directory of the words we need to look up, we will quickly find the words we need. If you use professional language to explain it is:

Inverted index originates from the fact that records need to be found according to the values of attributes in practical applications. Each entry in this index table includes an attribute value and the address of each record with the attribute value. Since the attribute value is not determined by the record but by the attribute value, it is called inverted index. Files with inverted indexes are called inverted index files, or inverted file for short.

Inverted file (inverted index), the index object is the words in a document or document set, etc. It is used to store the storage location of these words in a document or a group of documents, and is the most commonly used index mechanism for documents or document sets.

The key step of a search engine is to establish an inverted index, which is generally expressed as a keyword, followed by its frequency (number of occurrences), location (in which article or web page, relevant date, author and other information), which is equivalent to making an index for hundreds of billions of pages on the Internet, like a book’s catalog and label. Readers can find the relevant pages directly according to the catalog if they want to see the relevant chapters of any topic. There is no need to search from the first page to the last page of the book, page by page.

2. Lucene inverted index principle

Lucerne is an open source, high performance java-based full-text search engine toolkit, not a complete full-text search engine, but a full-text search engine architecture, providing a complete query engine, index engine, and some text analysis engines. The aim is to provide a simple and easy-to-use toolkit for software developers to realize full-text retrieval in the target system, or to build a complete full-text retrieval engine based on this.

There are two articles 1 and 2:

The content of article 1 is as follows:

Jack lives in BeiJing,I live in BeiJing too.   

The content of article 2 is as follows:

He lived in Taiyuan.

1 > get keywords  

First of all, we have to use the way we said before to divide the words first, then because of English reasons, we need to filter out the useless words in, on and of, and then restore the words of the third person singular plus s or the past plus ed, such as live back to live, live back to live, and then remove unnecessary punctuation marks. After the above processing, the remaining keywords are:

文章1的所有关键词为:[Jack] [live] [BeiJing] [i] [live] [BeiJing]    

文章2的所有关键词为:[he] [live] [Taiyuan]

2 > establish inverted index

Form

Key words Article number [frequency of occurrence] Location of occurrence
BeiJing 1[2] 3,6
he 2[1] 1
i 1[1] 4
live 1[2] 2,5,
2[1] 2
Taiyuan 2[1] 3
tom 1[1] 1

These are the core parts of lucene index structure. We notice that the keywords are arranged in character order (lucene does not use B-tree structure), so lucene can quickly locate the keywords with binary search algorithm.

3 > implementation

When implementing, lucene saves the above three columns as a Term Dictionary, frequencies and positions respectively. The dictionary file contains not only each keyword, but also pointers to frequency files and location files, through which the frequency information and location information of the keyword can be found.

4 > compression algorithm

In order to reduce the size of the index file, Lucene also uses compression technology for the index.

Firstly, the keywords in the dictionary file are compressed, and the keywords are compressed into

Secondly, a great deal of use is to compress the number. The number only saves the difference from the previous value (this can reduce the length of the number and thus the number of bytes needed to save the number). For example, the current article number is 16389 (save in 3 bytes if not compressed), the previous article number is 16382, and save 7 (save in only one byte) after compression.

5 > reason for use

Assuming that the word “live” is to be queried, lucene first searches the dictionary binary to find the word, reads out all article numbers through the pointer to the frequency file, and then returns the result. Dictionaries are usually very small, so the whole process takes milliseconds.

However, with the common sequence matching algorithm, instead of building indexes, string matching is carried out on the contents of all articles. This process will be quite slow. When the number of articles is large, the time is often intolerable.

IV. Basic Configuration and Use of solr

Here we install solr in windows system

Download addresshttp://lucene.apache.org/solr …

After decompression:

2.png

Cmd enters solr’s bin directory and uses the command solr start (for convenience, you can configure solr’s environment variables and use solr name directly in cmd after matching)

3.png

Seeing this interface, it shows that solr service was successfully started, port number is 8983, accesshttp://localhost: 8983, will automatically jump tohttp://localhost:8983/solr/#/

4.png

This page will display the basic information of solr, lucene information, Java information, etc.

Then we introduce a solr instruction: solr -h can see the basic information of solr

      5.png

  Configure solr
    
    Configure core core

Solrcreate-cmycore -dbaisc _ configs:-c parameter specifies the defined core name,-d parameter specifies the configuration directory

6.png

After executing the command, there will be an additional test directory.

图片描述

To access the soleadmin page:http://localhost: 8983/, check core, one more test

图片描述   
  
  

Conf and data directories are located under solr-6.3.0serversolrtest directory, corresponding to configuration and data respectively.

图片描述

Add data to core

Open directory: solr-6.3.0serversolrtestconf, add a field:

<field name="name" type="string" indexed="false" stored="true" required="true" multiValued="false" />

Then restart solr: solr restart -p 8983

图片描述

Go to the soladmin page, select core-test-document and fill in the data in Document(s):

{
"id":"1",
"name":"宝马"
}

Click submit to return to Status: success, which means adding data successfully.

图片描述

After adding a few more, click Query to query the data:
图片描述

Q in the query interface represents the query criteria. For example, enter name: “BMW” and execute the query again

图片描述

You can also access the url directly by get:http://localhost:8983/solr/test/select? Q=name: bmw

14.png

Yixin Institute of Technology