Awk of Linux Three Swordsmen

  awk, linux, Operation and maintenance

The First awk Introduction and Expression Example

  • A language with strange names
  • Pattern scanning and processing, data processing and report generation.

Awk is not only a command in linux system, but also a programming language. It can be used to process data and generate reports (Excel); The processed data may be one or more files; Can be directly from the standard input, can also be obtained through the pipeline standard input; Awk can edit commands directly on the command line for operation, or it can be written as awk program for more complicated application.

Sed processes stream editor text stream, water stream.

I. introduction of awk environment

The awk involved in this article is gawk, that is, the GNU version of awk.

[root@creditease awk]# cat /etc/redhat-release
CentOS Linux release 7.5.1804 (Core)
[root@creditease awk]# uname -r
3.10.0-862.el7.x86_64
[root@creditease awk]# ll `which awk`
lrwxrwxrwx. 1 root root 4 Nov  7 14:47 /usr/bin/awk -> gawk 
[root@creditease awk]# awk --version
GNU Awk 4.0.2

II. Format of awk

Awk instructions consist of modes, actions, or a combination of modes and actions.

  • Pattern, which can be similarly understood as sed pattern matching, can consist of expressions or regular expressions between two forward slashes. For example, NR==1, which is the mode, can be understood as a condition.
  • Action is action, which consists of one or more statements in braces separated by semicolons. The following awk uses the format.

III. Records and Domains

Name Meaning
record Record, line
filed Fields, Regions, Fields, Columns

1)NF(number of field) represents the number of regions (columns) in a row, and $NF takes the last region.

2)$ symbol indicates taking a column (region), $1,$2,$NF

3)NR (number of record) line number, awk has a built-in variable NR for each line record number to save, and the value of NR will be +1 automatically after processing a record

4)FS(-F)field separator Column Separator, what separates rows into multiple columns

3.1 Specify Separator

[root@creditease awk]# awk -F "#" '{print $NF}' awk.txt 
GKL$123
GKL$213
GKL$321
[root@creditease awk]# awk -F '[#$]' '{print $NF}' awk.txt 
123
213
321

3.2 basic conditions and actions for conditional actions

[root@creditease awk]# cat awk.txt 
ABC#DEF#GHI#GKL$123
BAC#DEF#GHI#GKL$213
CBA#DEF#GHI#GKL$321
[root@creditease awk]# awk -F "#" 'NR==1{print $1}' awk.txt
ABC

3.3 Only Conditions

 [root@creditease awk]# awk -F "#" 'NR==1' awk.txt
ABC#DEF#GHI#GKL$123

The default action is {print $0}

3.4 Action Only

[root@creditease awk]# awk -F "#" '{print $1}' awk.txt
ABC
BAC
CAB

All rows are processed by default

3.5 Multiple Modes and Actions

[root@creditease awk]# awk -F "#" 'NR==1{print $NF}NR==3{print $NF}' awk.txt 
GKL$123
GKL$321

3.6 understanding of $0

$0 in awk indicates the entire line

[root@creditease awk]# awk '{print $0}' awk_space.txt
ABC DEF GHI GKL$123
BAC DEF GHI GKL$213
CBA DEF GHI GKL$321

3.7 FNR

FNR is similar to NR, but multi-file records are not incremented, and each file starts with 1 (processing multi-file will be discussed later)

[root@creditease awk]# awk '{print NR}' awk.txt awk_space.txt 
1
2
3
4
5
6
[root@creditease awk]# awk '{print FNR}' awk.txt awk_space.txt 
1
2
3
1
2
3

Fourth, regular expressions and operators

Awk, like sed, can match the input text through pattern matching.
Awk also supports a large number of regular expression patterns, most of which are similar to the metacharacters supported by sed, and regular expressions are a necessary tool for playing with three swordsmen.

Regular Expression Metacharacters Supported by awk

图片描述

By default, awk does not support metacharacters, and metacharacters that require parameters to be added to support them.

Metacharacters Function Example Explanation
x{m} X repeat m times /cool{5}/ It should be noted that the difference between cool with brackets or without brackets is that X can make the string only one character, so /cool{5}/ means matching coo plus 5 L, i.e. coolllll. /(cool) {2.}/means match coolcool, coolcool, etc.
x{m,} X repeats at least m times /(cool){2,}/ Ditto
x{m,n} X repeats at least m times, but not more than n times, and parameters are required to be specified: –posix or –re-interval. This mode cannot be used without this parameter /(cool){5,6}/ Ditto

In the application of regular expressions, the default is to find the matching string in the line. If there is a match, the action operation is executed. However, sometimes only a fixed list is required to match the specified regular expression.

For example:

I want to take the fifth column ($5) in the /etc/passwd file to find the row that matches the mail string, so I need to use the other two matching operators. And awk only has these two operators to match the regular expression.

Regular matching operator
~ Used to match expressions of records or regions.
! ~ Used to express the opposite meaning of ~.

4.1 Regular Examples

1) display GHI column in awk.txt

[root@creditease awk]# cat awk.txt 
ABC#DEF#GHI#GKL$123
BAC#DEF#GHI#GKL$213
CBA#DEF#GHI#GKL$321
[root@creditease awk]# awk -F "#" '{print $3}' awk.txt 
GHI
GHI
GHI
[root@creditease awk]# awk -F "#" '{print $(NF-1)}' awk.txt 
GHI
GHI
GHI

2) Display the row containing 321

[root@creditease awk]# awk '/321/{print $0}' awk.txt 
CBA#DEF#GHI#GKL$321

3) Use # as separator to display the row with the first column beginning with B or the last column ending with 1

[root@creditease awk]# awk -F "#" '$1~/^B/{print $0}$NF~/1$/{print $0}' awk.txt 
BAC#DEF#GHI#GKL$213
CBA#DEF#GHI#GKL$321

4) Use # as the delimiter to display the row with the first column beginning with B or C.

[root@creditease awk]# awk -F "#" '$1~/^B|^C/{print $0}' awk.txt 
BAC#DEF#GHI#GKL$213
CBA#DEF#GHI#GKL$321
[root@creditease awk]# awk -F "#" '$1~/^[BC]/{print $0}' awk.txt 
BAC#DEF#GHI#GKL$213
CBA#DEF#GHI#GKL$321
[root@creditease awk]# awk -F "#" '$1~/^(B|C)/{print $0}' awk.txt 
BAC#DEF#GHI#GKL$213
CBA#DEF#GHI#GKL$321
[root@creditease awk]# awk -F "#" '$1!~/^A/{print $0}' awk.txt 
BAC#DEF#GHI#GKL$213
CBA#DEF#GHI#GKL$321

V. Comparative Expressions

Awk is a programming language, which can make more complicated judgment. When the condition is true, awk executes relevant action, mainly making relevant judgment for a certain area, for example, if the printing score is above 80 points, thus it is necessary to make comparative judgment for this area.

The following table lists the relational operators that awk can use, which can be used to compare numeric strings and regular expressions. When the expression is true, the expression result is 1, otherwise it is 0. awk only executes the relevant action if the expression is true.

Awk supported relational operators

Operator Meaning Example
< Less than x>y
<= Less than or equal to. x<=y
== equal to x==y
! = Not equal to x! =y
>= Greater than or equal to x>=y
> Greater than x<y

5.1 Example of Comparative Expression

Show lines 2 and 3 of awk.txt

NR //,//

[root@creditease awk]# awk 'NR==2{print $0}NR==3{print $0}' awk.txt 
BAC#DEF#GHI#GKL$213
CBA#DEF#GHI#GKL$321
[root@creditease awk]# awk 'NR>=1{print $0}' awk.txt 
ABC#DEF#GHI#GKL$123
BAC#DEF#GHI#GKL$213
CBA#DEF#GHI#GKL$321
[root@creditease awk]# awk '/BAC/,/CBA/{print $0}' awk.txt 
BAC#DEF#GHI#GKL$213
CBA#DEF#GHI#GKL$321

Chapter 2 awk Modules, Variables and Execution

The complete awk structure diagram is as follows:

I. BEGIN module

BEGIN module is executed before awk reads files. BEGIN mode is often used to modify the values of built-in variables ORS, RS, FS, OFS, etc. Can not accept any input files

II. awk Built-in Variables (Predefined Variables)

variable name Attribute
$0 Current record, one full line
$1,$2,$3….$a The n-th area currently recorded, separated by FS.
FS Enter the area separator, which defaults to a space. field separator
NF The number of areas in the current record is the number of columns. number of field
NR The number of records that have been read out is the line number, starting with 1. number of record
RS The record delimiter entered defaults to a line break. record separator
OFS The output area delimiter is also a space by default. output record separator
FNR The read-in record number of the current file is recalculated for each file.
FILENAME The file name of the file currently being processed

Special note: FS RS supports regular expressions.

2.1 First Role: Define Built-in Variables

[root@creditease awk]# awk 'BEGIN{RS="#"}{print $0}' awk.txt 
ABC
DEF
GHI
GKL$123
BAC
DEF
GHI
GKL$213
CBA
DEF
GHI
GKL$321

2.2 Second Role: Print Logo

[root@creditease awk]# awk 'BEGIN{print "=======start======"}{print $0}' awk.txt 
=======start======
ABC#DEF#GHI#GKL$123
BAC#DEF#GHI#GKL$213
CBA#DEF#GHI#GKL$321

2.3 awk Implements Computing Function

 [root@creditease files]# awk 'BEGIN{a=8;b=90;print a+b,a-c,a/b,a%b}'
98 8 0.0888889 8

END module

END executes the END module when awk reads all the files, which is generally used to output a result (accumulation, array result). It can also be end identification information similar to BEGIN module.

3.1 First Role: Print Logo

[root@creditease awk]# awk 'BEGIN{print "=======start======"}{print $0}END{print "=======end======"}' awk.txt
=======start======
ABC#DEF#GHI#GKL$123
BAC#DEF#GHI#GKL$213
CBA#DEF#GHI#GKL$321
=======end======

3.2 Second Effect: Accumulation

1) count blank lines (/etc/services file)

grep sed awk

[root@creditease awk]# grep "^$" /etc/services  |wc -l
17
[root@creditease awk]# sed -n '/^$/p' /etc/services |wc -l
17
[root@creditease awk]# awk '/^$/' /etc/services |wc -l
17
[root@creditease awk]# awk '/^$/{i=i+1}END{print i}' /etc/services
17

2) arithmetic problems

1+2+3……+100=5050, how to use awk?

[root@creditease awk]# seq 100|awk '{i=i+$0}END{print i}'
5050

IV. Summary of awk Detailed Explanation

1. There can only be one BEGIN and END module. BEGIN{}BEGIN{} or END{}END{} are all wrong.

2. There can be multiple modules for whom to find and what to do.

V. summary of awk implementation process

Awk execution process:

1. Command line assignment (-F or -V)

2. Execute the contents in BEGIN mode

3. Start reading files

4. Judge whether the condition (mode) is true or not

  • If it is established, the contents of the corresponding action will be executed.
  • Read the next line, cycle judgment
  • Until the end of the last file is read

5. finally, execute the contents in END mode

Part III: awk Arrays and Syntax

I. awk array

1.1 array structure

people[police]=110

people[doctor]=120

[root@creditease awk]# awk 'BEGIN{word[0]="credit";word[1]="easy";print word[0],word[1]}'
credit easy
[root@creditease awk]# awk 'BEGIN{word[0]="credit";word[1]="easy";for(i in word)print word[i]}'
credit
easy

1.2 Array Classification

Cable
Argument group: subscripted by numbers
Associative array: subscripted by string

1.3 awk associative array

The existing text has the following format: random letters on the left and random numbers on the right, i.e. the numbers after the same letters are added together and output in alphabetical order

a  1
b  3
c  2
d  7
b  5
a  3 
g  2
f  6

With a subscript of $1, create an array a[$1]=a[$1]+$2(a[$1]+=$2) and output the results in conjunction with the END and for loops:

[root@creditease awk]# awk '{a[$1]=a[$1]+$2}END{for(i in a)print i,a[i]}' jia.txt 
a 4
b 8
c 2
d 7
f 6
g 2
注意:for(i in a) 循环的顺序不是按照文本内容的顺序来处理的,排序可以在命令后加sort排序

1.4 awk index array

Arrays indexed by numbers
Seq generates numbers from 1 to 10, requiring only counting lines to be displayed

[root@creditease awk]# seq 10|awk '{a[NR]=$0}END{for(i=1;i<=NR;i+=2){print a[i]}}'
1
3
5
7
9

Seq generates a number of 1-10, requiring that the last 3 lines of the file not be displayed

[root@creditease awk]# seq 10|awk '{a[NR]=$0}END{for(i=1;i<=NR-3;i++){print a[i]}}'
1
2
3
4
5
6
7
解析:改变i的范围即可,多用于不显示文件的后几行

1.5 awk Array De-duplication in Actual Combat

A++ and ++a

[root@creditease awk]# awk 'BEGIN{print a++}'
0
[root@creditease awk]# awk 'BEGIN{print ++a}'
1
[root@creditease awk]# awk 'BEGIN{a=1;b=a++;print a,b}'
2 1
[root@creditease awk]# awk 'BEGIN{a=1;b=++a;print a,b}'
2 2

注:

都是 b = a+1

b=a++ 先把 a 的值赋予b,然后 a + 1

b=++a 先执行a+1,然后把a的值赋予b

The following text is deduplicated for the second column

[root@creditease awk]# cat qc.txt 
2018/10/20   xiaoli     13373305025
2018/10/25   xiaowang   17712215986
2018/11/01   xiaoliu    18615517895 
2018/11/12   xiaoli     13373305025
2018/11/19   xiaozhao   15512013263
2018/11/26   xiaoliu    18615517895
2018/12/01   xiaoma     16965564525
2018/12/09   xiaowang   17712215986
2018/11/24   xiaozhao   15512013263
解法一:
[root@creditease awk]# awk '!a[$2]++' qc.txt 
2018/10/20   xiaoli     13373305025
2018/10/25   xiaowang   17712215986
2018/11/01   xiaoliu    18615517895 
2018/11/19   xiaozhao   15512013263
2018/12/01   xiaoma     16965564525
解析:
!a[$3]++是模式(条件),命令也可写成awk '!
a[$3]=a[$3]+1{print $0}' qc.txt
a[$3]++ ,“++”在后,先取值后加一
!a[$3]=a[$3]+1:是先取a[$3]的值,比较“!a[$3]”是否符合条件(条件非0),后加1
注意:此方法去重后的结果显示的是文本开头开始的所有不重复的行
解法二:
[root@creditease awk]# awk '++a[$2]==1' qc.txt 
2018/10/20   xiaoli     13373305025
2018/10/25   xiaowang   17712215986
2018/11/01   xiaoliu    18615517895 
2018/11/19   xiaozhao   15512013263
2018/12/01   xiaoma     16965564525
解析:
++a[$3]==1是模式(条件),也可写成a[$3]=a[$3]+1==1即只有当条件(a[$3]+1的结果)为1的时候才打印出内容
++a[$3] ,“++”在前,先加一后取值
++a[$3]==1:是先加1,后取a[$3]的值,比较“++a[$3]”是否符合条件(值为1)
注意:此方法去重后的结果显示的是文本开头开始的所有不重复的行
解法三:
[root@creditease awk]# awk '{a[$2]=$0}END{for(i in a){print a[i]}}' qc.txt
2018/11/12   xiaoli     13373305025
2018/11/26   xiaoliu    18615517895
2018/12/01   xiaoma     16965564525
2018/12/09   xiaowang   17712215986
2018/11/24   xiaozhao   15512013263

解析:
注意此方法去重后的结果显示的是文本结尾开始的所有不重复的行

1.6 awk handles multiple files (array, NR, FNR)

Use awk to take the first column of file.txt and the second column of file1.txt and then redirect to a new file new.txt

[root@creditease awk]# cat file1.txt 
a b
c d
e f
g h
i j
[root@creditease awk]# cat file2.txt 
1 2
3 4
5 6
7 8
9 10
[root@creditease awk]# awk 'NR==FNR{a[FNR]=$1}NR!=FNR{print a[FNR],$2}' file1.txt file2.txt 
a 2
c 4
e 6
g 8
i 10
解析:NR==FNR处理的是第一个文件,NR!=FNR处理的是第二个文件.
注意:当两个文件NR(行数)不同的时候,需要把行数多的放前边.
解决方法:把行数多的文件放前边,行数少的文件放后边.
把输出的结果放入一个新文件new.txt中:
[root@creditease awk]# awk 'NR==FNR{a[FNR]=$1}NR!=FNR{print a[FNR],$2>"new.txt"}' file1.txt file2.txt 
[root@creditease awk]# cat new.txt 
a 2
c 4
e 6
g 8
i 10

1.7 awk analyzes log files and counts the number of websites visited.

[root@creditease awk]# cat url.txt 
http://www.baidu.com
http://mp4.video.cn
http://www.qq.com
http://www.listeneasy.com
http://mp3.music.com
http://www.qq.com
http://www.qq.com
http://www.listeneasy.com
http://www.listeneasy.com
http://mp4.video.cn
http://mp3.music.com
http://www.baidu.com
http://www.baidu.com
http://www.baidu.com
http://www.baidu.com
[root@creditease awk]# awk -F "[/]+" '{h[$2]++}END{for(i in h) print i,h[i]}' url.txt
www.qq.com 3
www.baidu.com 5
mp4.video.cn 2
mp3.music.com 2
www.crediteasy.com 3

Second, awk simple syntax

2.1 function sub gsub

Replacement function

Format: sub(r, s, target) gsub(r, s, target)

[root@creditease awk]# cat sub.txt 
ABC DEF AHI GKL$123
BAC DEF AHI GKL$213
CBA DEF GHI GKL$321
[root@creditease awk]# awk '{sub(/A/,"a");print $0}' sub.txt 
aBC DEF AHI GKL$123
BaC DEF AHI GKL$213
CBa DEF GHI GKL$321
[root@creditease awk]# awk '{gsub(/A/,"a");print $0}' sub.txt 
aBC DEF aHI GKL$123
BaC DEF aHI GKL$213
CBa DEF GHI GKL$321
注:sub只会替换行内匹配的第一次内容;相当于sed ‘s###’
    gsub 会替换行内匹配的所有内容;相当于sed ‘s###g’
[root@creditease awk]# awk '{sub(/A/,"a",$1);print $0}' sub.txt 
aBC DEF AHI GKL$123
BaC DEF AHI GKL$213
CBa DEF GHI GKL$321

Exercise:

0001|20081223efskjfdj|EREADFASDLKJCV
0002|20081208djfksdaa|JDKFJALSDJFsddf
0003|20081208efskjfdj|EREADFASDLKJCV
0004|20081211djfksdaa1234|JDKFJALSDJFsddf
以'|'为分隔, 现要将第二个域字母前的数字去掉,其他地方都不变, 输出为:
0001|efskjfdj|EREADFASDLKJCV
0002|djfksdaa|JDKFJALSDJFsddf
0003|efskjfdj|EREADFASDLKJCV
0004|djfksdaa1234|JDKFJALSDJFsddf

方法:
awk -F '|'  'BEGIN{OFS="|"}{sub(/[0-9]+/,"",$2);print $0}' sub_hm.txt
awk -F '|'  -v OFS="|" '{sub(/[0-9]+/,"",$2);print $0}' sub_hm.txt

2.2 usage of if and slse

Contents:

AA

BC

AA

CB

CC

AA

Results:

AA YES

BC NO YES

AA YES

CB NO YES

CC NO YES

AA YES

1) [root@creditease awk]# awk '{if($0~/AA/){print $0" YES"}else{print $0" NO YES"}}' ifelse.txt 
AA YES
BC NO YES
AA YES
CB NO YES
CC NO YES
AA YES
解析:使用if和else,if $0匹配到AA,则打印$0 "YES",else反之打印$0 " NO YES"。
2)[root@creditease awk]# awk '$0~/AA/{print $0" YES"}$0!~/AA/{print $0" NO YES"}' ifelse.txt 
AA YES
BC NO YES
AA YES
CB NO YES
CC NO YES
AA YES
解析:使用正则匹配,当$0匹配AA时,打印出YES,反之,打印出“NO YES”

2.3 next usage

As mentioned above, use next to implement it.

Next: Skip all code behind it

 [root@creditease awk]# awk '$0~/AA/{print $0" YES";next}{print $0" NO YES"}' ifelse.txt 
AA YES
BC NO YES
AA YES
CB NO YES
CC NO YES
AA YES
解析:
{print $0" NO YES"}:此动作是默认执行的,当前边的$0~/AA/匹配,就会执行{print $0" YES";next}
因为action中有next,所以会跳过后边的action。
如果符合$0~/AA/则打印YES ,遇到next后,后边的动作不执行;如果不符合$0~/AA/,会执行next后边的动作;
next前边的(模式匹配),后边的就不执行,前边的不执行(模式不匹配),后边的就执行。

2.4 printf does not wrap output and next usage

Printf: no line breaks after printing

The following text, if Description: is empty afterwards, merges the contents of the following line into this line.

Packages: Hello-1
Owner: me me me me
Other: who care?
Description:
Hello world!
Other2: don't care
想要结果:
Packages: Hello-1
Owner: me me me me
Other: who care?
Description: Hello world!
Origial-Owner: me me me me
Other2: don't care
1)[root@creditease awk]# awk '/^Desc.*:$/{printf $0}!/Desc.*:$/{print $0}' printf.txt 
Packages: Hello-1
Owner: me me me me
Other: who care?
Description:Hello world!
Other2: don't care
解析:使用正则匹配,匹配到'/^Desc.*:$/,就使用printf打印(不换行),不匹配的打印出整行。
2)使用if和else实现
[root@creditease awk]# awk '{if(/Des.*:$/){printf $0}else{print $0}}' printf.txt 
Packages: Hello-1
Owner: me me me me
Other: who care?
Description:Hello world!
Other2: don't care
3)使用next实现
[root@creditease awk]# awk '/Desc.*:$/{printf $0;next}{print $0}' printf.txt 
Packages: Hello-1
Owner: me me me me
Other: who care?
Description:Hello world!
Other2: don't care
注:可简写成awk '/Desc.*:$/{printf $0;next}1'
printf.txt  ## 1是pattern(模式),默认action(动作)是{print $0}

2.5 After deduplication, the count is redirected to the specified file as required

The text is as follows. It is required to calculate the number of repetitions for each item, and then put those with more than 2 repetitions into gt2.txt file and those with less than or equal to 2 repetitions into le2.txt file.

[root@creditease files]# cat qcjs.txt 
aaa
bbb
ccc
aaa
ddd
bbb
rrr
ttt
ccc
eee
ddd
rrr
bbb
rrr
bbb
[root@creditease awk]# awk '{a[$1]++}END{for(i in a){if(a[i]>2){print i,a[i]>"gt2.txt"}else{print i,a[i]>"le2.txt"}}}' qcjs.txt 
[root@creditease awk]# cat gt2.txt 
rrr 3
bbb 4
[root@creditease awk]# cat le2.txt 
aaa 2
ccc 2
eee 1
ttt 1
ddd 2
解析:{print },或括号中打印后可直接重定向到一个新文件,文件名用双引号引起来。如: {print $1 >"xin.txt"}

Three, awk matters needing attention

A)NR==FNR ## cannot be written as NR=FNR(= meaning assignment in awk)

b)NR! =FNR ##NR is not equal to FNR

c){a=1; A[NR]} This will report an error: variables and array names cannot be duplicated in the same command
D)printf output does not wrap lines

E){print}, or after printing in brackets, you can directly redirect to a new file, and the file name is enclosed in double quotation marks. For example: {print $1 >”xin.txt”}

F) when the mode (condition) is 0, the following actions will not be executed! After 0, the action will be executed.

Author: Qin Wei

Source: Yixin Institute of Technology