Network programming 101 with GAWK (GNU AWK)

0
2582
12 min read

In today’s tutorial, we will learn about the networking aspects, for example working with TCP/IP for both client-side and server-side. We will also explore HTTP services to help you get going with networking in AWK.

This tutorial is an excerpt from a book written by Shiwang Kalkhanda, titled Learning AWK Programming.

The AWK programming language was developed as a pattern-matching language for text manipulation; however, GAWK has advanced features, such as file-like handling of network connections. We can perform simple TCP/IP connection handling in GAWK with the help of special filenames. GAWK extends the two-way I/O mechanism used with the |& operator to simple networking using these special filenames that hide the complex details of socket programming to the programmer.

The special filename for network communication is made up of multiple fields, all of which are mandatory. The following is the syntax of creating a filename for network communication:

/net-type/protocol/local-port/remote-host/remote-port

Each field is separated from another with a forward slash. Specifying all of the fields is mandatory. If any of the field is not valid for any protocol or you want the system to pick a default value for that field, it is set as 0. The following list illustrates the meaning of different fields used in creating the file for network communication:


  • net-type: Its value is inet4 for IPv4, inet6 for IPv6, or inet to use the system default (which is generally IPv4).
  • protocol: It is either tcp or udp for a TCP or UDP IP connection. It is advised you use the TCP protocol for networking. UDP is used when low overhead is a priority.
  • local-port: Its value decides which port on the local machine is used for communication with the remote system. On the client side, its value is generally set to 0 to indicate any free port to be picked up by the system itself. On the server side, its value is other than 0 because the service is provided to a specific publicly known port number or service name, such as http, smtp, and so on.
  • remote-host: It is the remote hostname which is to be at the other end of the connection. For the server side, its value is set to 0 to indicate the server is open for all other hosts for connection. For the client side, its value is fixed to one remote host and hence, it is always different from 0. This name can either be represented through symbols, such as www.google.com, or numbers, 123.45.67.89.
  • remote-port: It is the port on which the remote machine will communicate across the network. For clients, its value is other than 0, to indicate to which port they are connecting to the remote machine. For servers, its value is the port on which they want connection from the client to be established. We can use a service name here such as ftp, http, or a port number such as 80, 21, and so on.

TCP client and server (/inet/tcp)

TCP gaurantees that data is received at the other end and in the same order as it was transmitted, so always use TCP.

In the following example, we will create a tcp-server (sender) to send the current date time of the server to the client. The server uses the strftime() function with the coprocess operator to send to the GAWK server, listening on the 8080 port. The remote host and remote port could be any client, so its value is kept as 0.

The server connection is closed by passing the special filename to the close() function for closing the file as follows:

$ vi tcpserver.awk
#TCP-Server
BEGIN {
print strftime() |& "/inet/tcp/8080/0/0"
close("/inet/tcp/8080/0/0")
}

Now, open one Terminal and run this program before running the client program as follows:

$ awk -f  tcpserver.awk

Next, we create the tcpclient (receiver) to receive the data sent by the tcpserver. Here, we first create the client connection and pass the received data to the getline() using the coprocess operator. Here the local-port value is set to 0 to be automatically chosen by the system, the remote-host is set to the localhost, and the remote-port is set to the tcp-server port, 8080. After that, the received message is printed, using the print $0 command, and finally, the client connection is closed using the close command, as follows:

$ vi tcpclient.awk
#TCP-client
BEGIN {
"/inet/tcp/0/localhost/8080" |& getline
print $0
close("/inet/tcp/0/localhost/8080")
}

Now, execute the tcpclient program in another Terminal as follows :

$ awk -f  tcpclient.awk

The output of the previous code is as follows :

Fri Feb  9 09:42:22 IST 2018

UDP client and server ( /inet/udp )

The server and client programs that use the UDP protocol for communication are almost identical to their TCP counterparts, with the only difference being that the protocol is changed to udp from tcp. So, the UDP-server and UDP-client program can be written as follows:

$ vi udpserver.awk
#UDP-Server
BEGIN {
print strftime() |& "/inet/udp/8080/0/0"
"/inet/udp/8080/0/0" |& getline
print $0
close("/inet/udp/8080/0/0")
}

$ awk -f udpserver.awk

Here, only one addition has been made to the client program. In the client, we send the message hello from client ! to the server. So when we execute this program on the receiving Terminal, where the udpclient.awk program is run, we get the remote system date time. And on the Terminal where the udpserver.awk program is run, we get the hello message from the client:

$ vi udpclient.awk
#UDP-client
BEGIN {
print "hello from client!" |& "/inet/udp/0/localhost/8080"
"/inet/udp/0/localhost/8080" |& getline
print $0
close("/inet/udp/0/localhost/8080")
}

$ awk -f udpclient.awk

GAWK can be used to open direct sockets only. Currently, there is no way to access services available over an SSL connection such as https, smtps, pop3s, imaps, and so on.

Reading a web page using HttpService

To read a web page, we use the Hypertext Transfer Protocol (HTTP ) service which runs on port number 80. First, we redefine the record separators RS and ORS because HTTP requires CR-LF to separate lines. The program requests to the IP address 35.164.82.168 ( www.grymoire.com ) of a static website which, in turn, makes a GET request to the web page: http://35.164.82.168/Unix/donate.html . HTTP calls the GET request, a method which tells the web server to transmit the web page donate.html. The output is stored in the getline function using the co-process operator and printed on the screen, line by line, using the while loop. Finally, we close the http service connection. The following is the program to retrieve the web page:

$ vi  view_webpage.awk
BEGIN {
RS=ORS="rn"
http = "/inet/tcp/0/35.164.82.168/80"
print "GET http://35.164.82.168/Unix/donate.html" |& http
while ((http |& getline) > 0)
print $0
close(http)
}

$ awk -f view_webpage.awk

Upon executing the program, it fills the screen with the source code of the page on the screen as follows:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML lang="en-US">
<HEAD>
<TITLE> Welcome to The UNIX Grymoire!</TITLE>
<meta name="keywords" content="grymoire, donate, unix, tutorials, sed, awk">
<META NAME="Description" CONTENT="Please donate to the Unix Grymoire" >
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"> 
<link href="myCSS.css" rel="stylesheet" type="text/css"> 
<!-- Place this tag in your head or just before your close body tag -->
<script type="text/javascript" src="https://apis.google.com/js/plusone.js"></script>
<link rel="canonical" href="http://www.grymoire.com/Unix/donate.html">
<link href="myCSS.css" rel="stylesheet" type="text/css"> 
........
........

Profiling in GAWK

Profiling of code is done for code optimization. In GAWK, we can do profiling by supplying a profile option to GAWK while running the GAWK program. On execution of the GAWK program with that option, it creates a file with the name awkprof.out. Since GAWK is performing profiling of the code, the program execution is up to 45% slower than the speed at which GAWK normally executes.

Let’s understand profiling by looking at some examples. In the following example, we create a program that has four functions; two arithmetic functions, one function prints an array, and one function calls all of them. Our program also contains two BEGIN and two END statements. First, the BEGIN and END statement and then it contains a pattern action rule, then the second BEGIN and END statement, as follows:

$ vi codeprof.awk
func z_array(){

arr[30] = "volvo"
arr[10] = "bmw"
arr[20] = "audi"
arr[50] = "toyota"
arr["car"] = "ferrari"

n = asort(arr)
print "Array begins...!"
print "====================="
for ( v in arr )
print v, arr[v]
print "Array Ends...!"
print "====================="
}

function mul(num1, num2){
result = num1 * num2
printf ("Multiplication of %d * %d : %dn", num1,num2,result)
}
function all(){
add(30,10)
mul(5,6)
z_array()
}
BEGIN { print "First BEGIN statement"
print "====================="
}
END { print "First END statement " 
print "====================="
}
/maruti/{print $0 }

BEGIN {
print "Second BEGIN statement"
print "====================="
all()
}
END { print "Second END statement"
print "====================="
}
function add(num1, num2){
result = num1 + num2
printf ("Addition of %d + %d : %dn", num1,num2,result)
}

$ awk -- prof -f codeprof.awk cars.dat

The output of the previous code is as follows:

First BEGIN statement
=====================
Second BEGIN statement
=====================
Addition of 30 + 10 : 40
Multiplication of 5 * 6 : 30
Array begins...!
=====================
1 audi
2 bmw
3 ferrari
4 toyota
5 volvo
Array Ends...!
=====================
maruti          swift       2007        50000       5
maruti          dezire      2009        3100        6
maruti          swift       2009        4100        5
maruti          esteem      1997        98000       1
First END statement 
=====================
Second END statement
=====================

Execution of the previous program also creates a file with the name awkprof.out. If we want to create this profile file with a custom name, then we can specify the filename as an argument to the --profile option as follows:

$ awk   --prof=codeprof.prof  -f  codeprof.awk cars.dat

Now, upon execution of the preceding code we get a new file with the name codeprof.prof. Let’s try to understand the contents of the file codeprof.prof created by the profiles as follows:

# gawk profile, created Fri Feb  9 11:01:41 2018
# BEGIN rule(s)

BEGIN {
1 print "First BEGIN statement"
1 print "====================="
}

BEGIN {
1 print "Second BEGIN statement"
1 print "====================="
1 all()
}
# Rule(s)

12 /maruti/ { # 4
4 print $0
}

# END rule(s)

END {
1 print "First END statement "
1 print "====================="
}

END {
1 print "Second END statement"
1 print "====================="
}
# Functions, listed alphabetically

1 function add(num1, num2)
{
1 result = num1 + num2
1 printf "Addition of %d + %d : %dn", num1, num2, result
}

1 function all()
{
1 add(30, 10)
1 mul(5, 6)
1 z_array()
}

1 function mul(num1, num2)
{
1 result = num1 * num2
1 printf "Multiplication of %d * %d : %dn", num1, num2, result
}
1 function z_array()
{
1 arr[30] = "volvo"
1 arr[10] = "bmw"
1 arr[20] = "audi"
1 arr[50] = "toyota"
1 arr["car"] = "ferrari"
1 n = asort(arr)
1 print "Array begins...!"
1 print "====================="
5 for (v in arr) {
5 print v, arr[v]
}
1 print "Array Ends...!"
1 print "====================="
}

This profiling example explains the various basic features of profiling in GAWK. They are as follows:

  • The first look at the file from top to bottom explains the order of the program in which various rules are executed. First, the BEGIN rules are listed followed by the BEGINFILE rule, if any. Then pattern-action rules are listed. Thereafter, ENDFILE rules and END rules are printed. Finally, functions are listed in alphabetical order. Multiple BEGIN and END rules retain their places as separate identities. The same is also true for the BEGINFILE and ENDFILE rules.
  • The pattern-action rules have two counts. The first number, to the left of the rule, tells how many times the rule’s pattern was tested for the input file/record. The second number, to the right of the rule’s opening left brace, with a comment, shows how many times the rule’s action was executed when the rule evaluated to true. The difference between the two indicates how many times the rules pattern evaluated to false.
  • If there is an if-else statement then the number shows how many times the condition was tested. At the right of the opening left brace for its body is a count showing how many times the condition was true. The count for the else statement tells how many times the test failed.
  •  The count at the beginning of a loop header (for or while loop) shows how many times the loop conditional-expression was executed.
  • In user-defined functions, the count before the function keyword tells how many times the function was called. The counts next to the statements in the body show how many times those statements were executed.
  • The layout of each block uses C-style tabs for code alignment. Braces are used to mark the opening and closing of a code block, similar to C-style.
  • Parentheses are used as per the precedence rule and the structure of the program, but only when needed.
  • Printf or print statement arguments are enclosed in parentheses, only if the statement is followed by redirection.
  • GAWK also gives leading comments before rules, such as before BEGIN and END rules, BEGINFILE and ENDFILE rules, and pattern-action rules and before functions.

GAWK provides standard representation in a profiled version of the program. GAWK also accepts another option, --pretty-print. The following is an example of a pretty-printing AWK program:

$ awk  --pretty-print  -f  codeprof.awk cars.dat

When GAWK is called with pretty-print, the program generates awkprof.out, but this time without any execution counts in the output. Pretty-print output also preserves any original comments if they are given in a program while the profile option omits the original program’s comments. The file created on execution of the program with --pretty-print option is as follows:

# gawk profile, created Fri Feb  9 11:04:19 2018
# BEGIN rule(s)

BEGIN {
print "First BEGIN statement"
print "====================="
}

BEGIN {
print "Second BEGIN statement"
print "====================="
all()
}
# Rule(s)

/maruti/ {
print $0
}

# END rule(s)

END {
print "First END statement "
print "====================="
}

END {
print "Second END statement"
print "====================="
}
# Functions, listed alphabetically

function add(num1, num2)
{
result = num1 + num2
printf "Addition of %d + %d : %dn", num1, num2, result
}

function all()
{
add(30, 10)
mul(5, 6)
z_array()
}

function mul(num1, num2)
{
result = num1 * num2
printf "Multiplication of %d * %d : %dn", num1, num2, result
}
function z_array()
{
arr[30] = "volvo"
arr[10] = "bmw"
arr[20] = "audi"
arr[50] = "toyota"
arr["car"] = "ferrari"
n = asort(arr)
print "Array begins...!"
print "====================="
for (v in arr) {
print v, arr[v]
}
print "Array Ends...!"
print "====================="
}

To summarize, we looked at the basics of network programming and GAWK’s built-in command line debugger.

Do check out the book Learning AWK Programming to know more about the intricacies of AWK programming for text processing.

Read Next

20 ways to describe programming in 5 words

What is Mob Programming?

LEAVE A REPLY

Please enter your comment!
Please enter your name here