The First awk Introduction and Expression Example
- A language with strange names
- Pattern scanning and processing, data processing and report generation.
Awk is not only a command in linux system, but also a programming language. It can be used to process data and generate reports (Excel); The processed data may be one or more files; Can be directly from the standard input, can also be obtained through the pipeline standard input; Awk can edit commands directly on the command line for operation, or it can be written as awk program for more complicated application.
Sed processes stream editor text stream, water stream.
I. introduction of awk environment
The awk involved in this article is gawk, that is, the GNU version of awk.
[root@creditease awk]# cat /etc/redhat-release
CentOS Linux release 7.5.1804 (Core)
[root@creditease awk]# uname -r
3.10.0-862.el7.x86_64
[root@creditease awk]# ll `which awk`
lrwxrwxrwx. 1 root root 4 Nov 7 14:47 /usr/bin/awk -> gawk
[root@creditease awk]# awk --version
GNU Awk 4.0.2
II. Format of awk
Awk instructions consist of modes, actions, or a combination of modes and actions.
- Pattern, which can be similarly understood as sed pattern matching, can consist of expressions or regular expressions between two forward slashes. For example, NR==1, which is the mode, can be understood as a condition.
- Action is action, which consists of one or more statements in braces separated by semicolons. The following awk uses the format.
III. Records and Domains
Name | Meaning |
---|---|
record | Record, line |
filed | Fields, Regions, Fields, Columns |
1)NF(number of field) represents the number of regions (columns) in a row, and $NF takes the last region.
2)$ symbol indicates taking a column (region), $1,$2,$NF
3)NR (number of record) line number, awk has a built-in variable NR for each line record number to save, and the value of NR will be +1 automatically after processing a record
4)FS(-F)field separator Column Separator, what separates rows into multiple columns
3.1 Specify Separator
[root@creditease awk]# awk -F "#" '{print $NF}' awk.txt
GKL$123
GKL$213
GKL$321
[root@creditease awk]# awk -F '[#$]' '{print $NF}' awk.txt
123
213
321
3.2 basic conditions and actions for conditional actions
[root@creditease awk]# cat awk.txt
ABC#DEF#GHI#GKL$123
BAC#DEF#GHI#GKL$213
CBA#DEF#GHI#GKL$321
[root@creditease awk]# awk -F "#" 'NR==1{print $1}' awk.txt
ABC
3.3 Only Conditions
[root@creditease awk]# awk -F "#" 'NR==1' awk.txt
ABC#DEF#GHI#GKL$123
The default action is {print $0}
3.4 Action Only
[root@creditease awk]# awk -F "#" '{print $1}' awk.txt
ABC
BAC
CAB
All rows are processed by default
3.5 Multiple Modes and Actions
[root@creditease awk]# awk -F "#" 'NR==1{print $NF}NR==3{print $NF}' awk.txt
GKL$123
GKL$321
3.6 understanding of $0
$0 in awk indicates the entire line
[root@creditease awk]# awk '{print $0}' awk_space.txt
ABC DEF GHI GKL$123
BAC DEF GHI GKL$213
CBA DEF GHI GKL$321
3.7 FNR
FNR is similar to NR, but multi-file records are not incremented, and each file starts with 1 (processing multi-file will be discussed later)
[root@creditease awk]# awk '{print NR}' awk.txt awk_space.txt
1
2
3
4
5
6
[root@creditease awk]# awk '{print FNR}' awk.txt awk_space.txt
1
2
3
1
2
3
Fourth, regular expressions and operators
Awk, like sed, can match the input text through pattern matching.
Awk also supports a large number of regular expression patterns, most of which are similar to the metacharacters supported by sed, and regular expressions are a necessary tool for playing with three swordsmen.
Regular Expression Metacharacters Supported by awk
By default, awk does not support metacharacters, and metacharacters that require parameters to be added to support them.
Metacharacters | Function | Example | Explanation |
---|---|---|---|
x{m} | X repeat m times | /cool{5}/ | It should be noted that the difference between cool with brackets or without brackets is that X can make the string only one character, so /cool{5}/ means matching coo plus 5 L, i.e. coolllll. /(cool) {2.}/means match coolcool, coolcool, etc. |
x{m,} | X repeats at least m times | /(cool){2,}/ | Ditto |
x{m,n} | X repeats at least m times, but not more than n times, and parameters are required to be specified: –posix or –re-interval. This mode cannot be used without this parameter | /(cool){5,6}/ | Ditto |
In the application of regular expressions, the default is to find the matching string in the line. If there is a match, the action operation is executed. However, sometimes only a fixed list is required to match the specified regular expression.
For example:
I want to take the fifth column ($5) in the /etc/passwd file to find the row that matches the mail string, so I need to use the other two matching operators. And awk only has these two operators to match the regular expression.
Regular matching operator | |
---|---|
~ | Used to match expressions of records or regions. |
! ~ | Used to express the opposite meaning of ~. |
4.1 Regular Examples
1) display GHI column in awk.txt
[root@creditease awk]# cat awk.txt
ABC#DEF#GHI#GKL$123
BAC#DEF#GHI#GKL$213
CBA#DEF#GHI#GKL$321
[root@creditease awk]# awk -F "#" '{print $3}' awk.txt
GHI
GHI
GHI
[root@creditease awk]# awk -F "#" '{print $(NF-1)}' awk.txt
GHI
GHI
GHI
2) Display the row containing 321
[root@creditease awk]# awk '/321/{print $0}' awk.txt
CBA#DEF#GHI#GKL$321
3) Use # as separator to display the row with the first column beginning with B or the last column ending with 1
[root@creditease awk]# awk -F "#" '$1~/^B/{print $0}$NF~/1$/{print $0}' awk.txt
BAC#DEF#GHI#GKL$213
CBA#DEF#GHI#GKL$321
4) Use # as the delimiter to display the row with the first column beginning with B or C.
[root@creditease awk]# awk -F "#" '$1~/^B|^C/{print $0}' awk.txt
BAC#DEF#GHI#GKL$213
CBA#DEF#GHI#GKL$321
[root@creditease awk]# awk -F "#" '$1~/^[BC]/{print $0}' awk.txt
BAC#DEF#GHI#GKL$213
CBA#DEF#GHI#GKL$321
[root@creditease awk]# awk -F "#" '$1~/^(B|C)/{print $0}' awk.txt
BAC#DEF#GHI#GKL$213
CBA#DEF#GHI#GKL$321
[root@creditease awk]# awk -F "#" '$1!~/^A/{print $0}' awk.txt
BAC#DEF#GHI#GKL$213
CBA#DEF#GHI#GKL$321
V. Comparative Expressions
Awk is a programming language, which can make more complicated judgment. When the condition is true, awk executes relevant action, mainly making relevant judgment for a certain area, for example, if the printing score is above 80 points, thus it is necessary to make comparative judgment for this area.
The following table lists the relational operators that awk can use, which can be used to compare numeric strings and regular expressions. When the expression is true, the expression result is 1, otherwise it is 0. awk only executes the relevant action if the expression is true.
Awk supported relational operators
Operator | Meaning | Example |
---|---|---|
< | Less than | x>y |
<= | Less than or equal to. | x<=y |
== | equal to | x==y |
! = | Not equal to | x! =y |
>= | Greater than or equal to | x>=y |
> | Greater than | x<y |
5.1 Example of Comparative Expression
Show lines 2 and 3 of awk.txt
NR //,//
[root@creditease awk]# awk 'NR==2{print $0}NR==3{print $0}' awk.txt
BAC#DEF#GHI#GKL$213
CBA#DEF#GHI#GKL$321
[root@creditease awk]# awk 'NR>=1{print $0}' awk.txt
ABC#DEF#GHI#GKL$123
BAC#DEF#GHI#GKL$213
CBA#DEF#GHI#GKL$321
[root@creditease awk]# awk '/BAC/,/CBA/{print $0}' awk.txt
BAC#DEF#GHI#GKL$213
CBA#DEF#GHI#GKL$321
Chapter 2 awk Modules, Variables and Execution
The complete awk structure diagram is as follows:
I. BEGIN module
BEGIN module is executed before awk reads files. BEGIN mode is often used to modify the values of built-in variables ORS, RS, FS, OFS, etc. Can not accept any input files
II. awk Built-in Variables (Predefined Variables)
variable name | Attribute |
---|---|
$0 | Current record, one full line |
$1,$2,$3….$a | The n-th area currently recorded, separated by FS. |
FS | Enter the area separator, which defaults to a space. field separator |
NF | The number of areas in the current record is the number of columns. number of field |
NR | The number of records that have been read out is the line number, starting with 1. number of record |
RS | The record delimiter entered defaults to a line break. record separator |
OFS | The output area delimiter is also a space by default. output record separator |
FNR | The read-in record number of the current file is recalculated for each file. |
FILENAME | The file name of the file currently being processed |
Special note: FS RS supports regular expressions.
2.1 First Role: Define Built-in Variables
[root@creditease awk]# awk 'BEGIN{RS="#"}{print $0}' awk.txt
ABC
DEF
GHI
GKL$123
BAC
DEF
GHI
GKL$213
CBA
DEF
GHI
GKL$321
2.2 Second Role: Print Logo
[root@creditease awk]# awk 'BEGIN{print "=======start======"}{print $0}' awk.txt
=======start======
ABC#DEF#GHI#GKL$123
BAC#DEF#GHI#GKL$213
CBA#DEF#GHI#GKL$321
2.3 awk Implements Computing Function
[root@creditease files]# awk 'BEGIN{a=8;b=90;print a+b,a-c,a/b,a%b}'
98 8 0.0888889 8
END module
END executes the END module when awk reads all the files, which is generally used to output a result (accumulation, array result). It can also be end identification information similar to BEGIN module.
3.1 First Role: Print Logo
[root@creditease awk]# awk 'BEGIN{print "=======start======"}{print $0}END{print "=======end======"}' awk.txt
=======start======
ABC#DEF#GHI#GKL$123
BAC#DEF#GHI#GKL$213
CBA#DEF#GHI#GKL$321
=======end======
3.2 Second Effect: Accumulation
1) count blank lines (/etc/services file)
grep sed awk
[root@creditease awk]# grep "^$" /etc/services |wc -l
17
[root@creditease awk]# sed -n '/^$/p' /etc/services |wc -l
17
[root@creditease awk]# awk '/^$/' /etc/services |wc -l
17
[root@creditease awk]# awk '/^$/{i=i+1}END{print i}' /etc/services
17
2) arithmetic problems
1+2+3……+100=5050, how to use awk?
[root@creditease awk]# seq 100|awk '{i=i+$0}END{print i}'
5050
IV. Summary of awk Detailed Explanation
1. There can only be one BEGIN and END module. BEGIN{}BEGIN{} or END{}END{} are all wrong.
2. There can be multiple modules for whom to find and what to do.
V. summary of awk implementation process
Awk execution process:
1. Command line assignment (-F or -V)
2. Execute the contents in BEGIN mode
3. Start reading files
4. Judge whether the condition (mode) is true or not
- If it is established, the contents of the corresponding action will be executed.
- Read the next line, cycle judgment
- Until the end of the last file is read
5. finally, execute the contents in END mode
Part III: awk Arrays and Syntax
I. awk array
1.1 array structure
people[police]=110
people[doctor]=120
[root@creditease awk]# awk 'BEGIN{word[0]="credit";word[1]="easy";print word[0],word[1]}'
credit easy
[root@creditease awk]# awk 'BEGIN{word[0]="credit";word[1]="easy";for(i in word)print word[i]}'
credit
easy
1.2 Array Classification
Cable
Argument group: subscripted by numbers
Associative array: subscripted by string
1.3 awk associative array
The existing text has the following format: random letters on the left and random numbers on the right, i.e. the numbers after the same letters are added together and output in alphabetical order
a 1
b 3
c 2
d 7
b 5
a 3
g 2
f 6
With a subscript of $1, create an array a[$1]=a[$1]+$2(a[$1]+=$2) and output the results in conjunction with the END and for loops:
[root@creditease awk]# awk '{a[$1]=a[$1]+$2}END{for(i in a)print i,a[i]}' jia.txt
a 4
b 8
c 2
d 7
f 6
g 2
注意:for(i in a) 循环的顺序不是按照文本内容的顺序来处理的,排序可以在命令后加sort排序
1.4 awk index array
Arrays indexed by numbers
Seq generates numbers from 1 to 10, requiring only counting lines to be displayed
[root@creditease awk]# seq 10|awk '{a[NR]=$0}END{for(i=1;i<=NR;i+=2){print a[i]}}'
1
3
5
7
9
Seq generates a number of 1-10, requiring that the last 3 lines of the file not be displayed
[root@creditease awk]# seq 10|awk '{a[NR]=$0}END{for(i=1;i<=NR-3;i++){print a[i]}}'
1
2
3
4
5
6
7
解析:改变i的范围即可,多用于不显示文件的后几行
1.5 awk Array De-duplication in Actual Combat
A++ and ++a
[root@creditease awk]# awk 'BEGIN{print a++}'
0
[root@creditease awk]# awk 'BEGIN{print ++a}'
1
[root@creditease awk]# awk 'BEGIN{a=1;b=a++;print a,b}'
2 1
[root@creditease awk]# awk 'BEGIN{a=1;b=++a;print a,b}'
2 2
注:
都是 b = a+1
b=a++ 先把 a 的值赋予b,然后 a + 1
b=++a 先执行a+1,然后把a的值赋予b
The following text is deduplicated for the second column
[root@creditease awk]# cat qc.txt
2018/10/20 xiaoli 13373305025
2018/10/25 xiaowang 17712215986
2018/11/01 xiaoliu 18615517895
2018/11/12 xiaoli 13373305025
2018/11/19 xiaozhao 15512013263
2018/11/26 xiaoliu 18615517895
2018/12/01 xiaoma 16965564525
2018/12/09 xiaowang 17712215986
2018/11/24 xiaozhao 15512013263
解法一:
[root@creditease awk]# awk '!a[$2]++' qc.txt
2018/10/20 xiaoli 13373305025
2018/10/25 xiaowang 17712215986
2018/11/01 xiaoliu 18615517895
2018/11/19 xiaozhao 15512013263
2018/12/01 xiaoma 16965564525
解析:
!a[$3]++是模式(条件),命令也可写成awk '!
a[$3]=a[$3]+1{print $0}' qc.txt
a[$3]++ ,“++”在后,先取值后加一
!a[$3]=a[$3]+1:是先取a[$3]的值,比较“!a[$3]”是否符合条件(条件非0),后加1
注意:此方法去重后的结果显示的是文本开头开始的所有不重复的行
解法二:
[root@creditease awk]# awk '++a[$2]==1' qc.txt
2018/10/20 xiaoli 13373305025
2018/10/25 xiaowang 17712215986
2018/11/01 xiaoliu 18615517895
2018/11/19 xiaozhao 15512013263
2018/12/01 xiaoma 16965564525
解析:
++a[$3]==1是模式(条件),也可写成a[$3]=a[$3]+1==1即只有当条件(a[$3]+1的结果)为1的时候才打印出内容
++a[$3] ,“++”在前,先加一后取值
++a[$3]==1:是先加1,后取a[$3]的值,比较“++a[$3]”是否符合条件(值为1)
注意:此方法去重后的结果显示的是文本开头开始的所有不重复的行
解法三:
[root@creditease awk]# awk '{a[$2]=$0}END{for(i in a){print a[i]}}' qc.txt
2018/11/12 xiaoli 13373305025
2018/11/26 xiaoliu 18615517895
2018/12/01 xiaoma 16965564525
2018/12/09 xiaowang 17712215986
2018/11/24 xiaozhao 15512013263
解析:
注意此方法去重后的结果显示的是文本结尾开始的所有不重复的行
1.6 awk handles multiple files (array, NR, FNR)
Use awk to take the first column of file.txt and the second column of file1.txt and then redirect to a new file new.txt
[root@creditease awk]# cat file1.txt
a b
c d
e f
g h
i j
[root@creditease awk]# cat file2.txt
1 2
3 4
5 6
7 8
9 10
[root@creditease awk]# awk 'NR==FNR{a[FNR]=$1}NR!=FNR{print a[FNR],$2}' file1.txt file2.txt
a 2
c 4
e 6
g 8
i 10
解析:NR==FNR处理的是第一个文件,NR!=FNR处理的是第二个文件.
注意:当两个文件NR(行数)不同的时候,需要把行数多的放前边.
解决方法:把行数多的文件放前边,行数少的文件放后边.
把输出的结果放入一个新文件new.txt中:
[root@creditease awk]# awk 'NR==FNR{a[FNR]=$1}NR!=FNR{print a[FNR],$2>"new.txt"}' file1.txt file2.txt
[root@creditease awk]# cat new.txt
a 2
c 4
e 6
g 8
i 10
1.7 awk analyzes log files and counts the number of websites visited.
[root@creditease awk]# cat url.txt
http://www.baidu.com
http://mp4.video.cn
http://www.qq.com
http://www.listeneasy.com
http://mp3.music.com
http://www.qq.com
http://www.qq.com
http://www.listeneasy.com
http://www.listeneasy.com
http://mp4.video.cn
http://mp3.music.com
http://www.baidu.com
http://www.baidu.com
http://www.baidu.com
http://www.baidu.com
[root@creditease awk]# awk -F "[/]+" '{h[$2]++}END{for(i in h) print i,h[i]}' url.txt
www.qq.com 3
www.baidu.com 5
mp4.video.cn 2
mp3.music.com 2
www.crediteasy.com 3
Second, awk simple syntax
2.1 function sub gsub
Replacement function
Format: sub(r, s, target) gsub(r, s, target)
[root@creditease awk]# cat sub.txt
ABC DEF AHI GKL$123
BAC DEF AHI GKL$213
CBA DEF GHI GKL$321
[root@creditease awk]# awk '{sub(/A/,"a");print $0}' sub.txt
aBC DEF AHI GKL$123
BaC DEF AHI GKL$213
CBa DEF GHI GKL$321
[root@creditease awk]# awk '{gsub(/A/,"a");print $0}' sub.txt
aBC DEF aHI GKL$123
BaC DEF aHI GKL$213
CBa DEF GHI GKL$321
注:sub只会替换行内匹配的第一次内容;相当于sed ‘s###’
gsub 会替换行内匹配的所有内容;相当于sed ‘s###g’
[root@creditease awk]# awk '{sub(/A/,"a",$1);print $0}' sub.txt
aBC DEF AHI GKL$123
BaC DEF AHI GKL$213
CBa DEF GHI GKL$321
Exercise:
0001|20081223efskjfdj|EREADFASDLKJCV
0002|20081208djfksdaa|JDKFJALSDJFsddf
0003|20081208efskjfdj|EREADFASDLKJCV
0004|20081211djfksdaa1234|JDKFJALSDJFsddf
以'|'为分隔, 现要将第二个域字母前的数字去掉,其他地方都不变, 输出为:
0001|efskjfdj|EREADFASDLKJCV
0002|djfksdaa|JDKFJALSDJFsddf
0003|efskjfdj|EREADFASDLKJCV
0004|djfksdaa1234|JDKFJALSDJFsddf
方法:
awk -F '|' 'BEGIN{OFS="|"}{sub(/[0-9]+/,"",$2);print $0}' sub_hm.txt
awk -F '|' -v OFS="|" '{sub(/[0-9]+/,"",$2);print $0}' sub_hm.txt
2.2 usage of if and slse
Contents:
AA
BC
AA
CB
CC
AA
Results:
AA YES
BC NO YES
AA YES
CB NO YES
CC NO YES
AA YES
1) [root@creditease awk]# awk '{if($0~/AA/){print $0" YES"}else{print $0" NO YES"}}' ifelse.txt
AA YES
BC NO YES
AA YES
CB NO YES
CC NO YES
AA YES
解析:使用if和else,if $0匹配到AA,则打印$0 "YES",else反之打印$0 " NO YES"。
2)[root@creditease awk]# awk '$0~/AA/{print $0" YES"}$0!~/AA/{print $0" NO YES"}' ifelse.txt
AA YES
BC NO YES
AA YES
CB NO YES
CC NO YES
AA YES
解析:使用正则匹配,当$0匹配AA时,打印出YES,反之,打印出“NO YES”
2.3 next usage
As mentioned above, use next to implement it.
Next: Skip all code behind it
[root@creditease awk]# awk '$0~/AA/{print $0" YES";next}{print $0" NO YES"}' ifelse.txt
AA YES
BC NO YES
AA YES
CB NO YES
CC NO YES
AA YES
解析:
{print $0" NO YES"}:此动作是默认执行的,当前边的$0~/AA/匹配,就会执行{print $0" YES";next}
因为action中有next,所以会跳过后边的action。
如果符合$0~/AA/则打印YES ,遇到next后,后边的动作不执行;如果不符合$0~/AA/,会执行next后边的动作;
next前边的(模式匹配),后边的就不执行,前边的不执行(模式不匹配),后边的就执行。
2.4 printf does not wrap output and next usage
Printf: no line breaks after printing
The following text, if Description: is empty afterwards, merges the contents of the following line into this line.
Packages: Hello-1
Owner: me me me me
Other: who care?
Description:
Hello world!
Other2: don't care
想要结果:
Packages: Hello-1
Owner: me me me me
Other: who care?
Description: Hello world!
Origial-Owner: me me me me
Other2: don't care
1)[root@creditease awk]# awk '/^Desc.*:$/{printf $0}!/Desc.*:$/{print $0}' printf.txt
Packages: Hello-1
Owner: me me me me
Other: who care?
Description:Hello world!
Other2: don't care
解析:使用正则匹配,匹配到'/^Desc.*:$/,就使用printf打印(不换行),不匹配的打印出整行。
2)使用if和else实现
[root@creditease awk]# awk '{if(/Des.*:$/){printf $0}else{print $0}}' printf.txt
Packages: Hello-1
Owner: me me me me
Other: who care?
Description:Hello world!
Other2: don't care
3)使用next实现
[root@creditease awk]# awk '/Desc.*:$/{printf $0;next}{print $0}' printf.txt
Packages: Hello-1
Owner: me me me me
Other: who care?
Description:Hello world!
Other2: don't care
注:可简写成awk '/Desc.*:$/{printf $0;next}1'
printf.txt ## 1是pattern(模式),默认action(动作)是{print $0}
2.5 After deduplication, the count is redirected to the specified file as required
The text is as follows. It is required to calculate the number of repetitions for each item, and then put those with more than 2 repetitions into gt2.txt file and those with less than or equal to 2 repetitions into le2.txt file.
[root@creditease files]# cat qcjs.txt
aaa
bbb
ccc
aaa
ddd
bbb
rrr
ttt
ccc
eee
ddd
rrr
bbb
rrr
bbb
[root@creditease awk]# awk '{a[$1]++}END{for(i in a){if(a[i]>2){print i,a[i]>"gt2.txt"}else{print i,a[i]>"le2.txt"}}}' qcjs.txt
[root@creditease awk]# cat gt2.txt
rrr 3
bbb 4
[root@creditease awk]# cat le2.txt
aaa 2
ccc 2
eee 1
ttt 1
ddd 2
解析:{print },或括号中打印后可直接重定向到一个新文件,文件名用双引号引起来。如: {print $1 >"xin.txt"}
Three, awk matters needing attention
A)NR==FNR ## cannot be written as NR=FNR(= meaning assignment in awk)
b)NR! =FNR ##NR is not equal to FNR
c){a=1; A[NR]} This will report an error: variables and array names cannot be duplicated in the same command
D)printf output does not wrap lines
E){print}, or after printing in brackets, you can directly redirect to a new file, and the file name is enclosed in double quotation marks. For example: {print $1 >”xin.txt”}
F) when the mode (condition) is 0, the following actions will not be executed! After 0, the action will be executed.
Author: Qin Wei
Source: Yixin Institute of Technology