[case13]NLP System Architecture and Main Process



This paper mainly sorts out the architecture and process of NLP system.

NLP architecture


This picture comes from[Legislative Science Popularization: A Brief Introduction to the Architecture of Natural Language System]

Main process steps

  • Dividing/cutting words (Tokenization)
  • Part of speech tagging (POS Tagging)
  • Semantic chunk (Chunking)
  • Named entity annotations (Named Entity Tagging)

The first few shallow analysis tasks mainly belong to nlp, namely sequence labeling tasks.

  • Syntactic analysis
  • Text/Semantic Analysis

Chinese word segmentation

Unlike English, Chinese does not have spaces for word segmentation, so a series of Chinese characters must be decomposed into appropriate words before analyzing the text.

Word segmentation (Clauses to Words) this major technology

Dictionary-based word segmentation (Maximum matching method, shortest path method and maximum probability method), the actual use of more than the following:

  • An Open Source System of Chinese Word Segmentation Algorithm Based on Conditional Random Fields (CRF).
  • The Open Source System of Chinese Word Segmentation Algorithm Based on Zhang Huaping NShort (Core Algorithm of Stutter Word Segmentation)。

Coincidence (From Words to Words) mainly uses the method based on word sequence labeling.

Part of speech tagging (POS Tagging)

Part of speech, also known as part of speech, is the grammatical attribute of vocabulary and is the bridge between conjunctions and syntax.
Part-of-Speech Tagging (or POS Tagging), also known as part-of-speech tagging, refers to determining the grammatical role each word plays in a sentence.

Most of this technology uses HMM (Hidden Markov Model)+Viterbi algorithm, maximum entropy algorithm (Maximum Entropy)。 There are two main types of popular Chinese part-of-speech tags at present: Peking University part-of-speech tagging set and Pennsylvania part-of-speech tagging set.

The words in modern Chinese can be divided into two types of 12 parts of speech: one type is content words: nouns, verbs, adjectives, numerals, quantifiers and pronouns; The other is function words: adverbs, prepositions, conjunctions, auxiliary words, interjections and onomatopoeia.

Semantic chunk (Chunking)

According to the syntactic structure, the sentences marked with part of speech are grouped together to form such words as subject, predicate, object, etc.

The most common method of semantic chunking is Conditional Random Fields,CRF)

Named entity annotations (Named Entity Tagging)

Named entity identification is used to identify entities with specific meaning in text. Common entities mainly include names of people, places, organizations and other proper nouns. The Named Entity Recognition Task also identifies three major categories (entity, time and number) and seven minor categories (person name, organization name, place name, time, date, currency and percentage) of named entities in the text.

The technology used here is the standard HMM model and Viterbi algorithm.

Syntactic analysis

Syntactic analysis is to automatically deduce the grammatical structure of a sentence according to a given grammatical system, analyze the grammatical units contained in the sentence and the relationship between these grammatical units, and transform the sentence into a structured grammatical tree.

At present, the main theories of syntactic analysis are as follows:

  • Grammatical Analysis of Phrase Structure
  • Dependency grammar analysis

Text/Semantic Analysis

Mainly includes: text similarity analysis, text keyword extraction, text classification, content summary, emotional tendency analysis.
Semantic analysis involves anaphora resolution and other technologies. Naive Bayesian algorithm can be used for text classification.


This paper mainly analyzes the architecture and main process of the lower NLP system, which is convenient for further targeted study.