josuah.net

AWK

AWK is a surprising efficient language, for both performance and code efficiency. This comes with the ubiquitous array structure, and splitting the input in fields by default.

Not everything is parsed efficiently with AWK, Type-Length-Value for instance, but many things are. I use it for multiple projects:

Below are multiple ways of using awk for getting the best out of it. These are partly by myself, partly collected from what I saw in the wild.

CSV fields with header

Instead of trying to remember the number of the column, using the name of the column is much easier, and permit to have new columns inserted in the .csv file without breaking the script.

$ cat input.txt
domain_name,expiry_date,creation_date,owner,account_id
nowhere.com,2020-03,2019-05,me,23535
perdu.com,2020-04,2018-03,you,23535
pa.st,2020-09,2014-05,them,23535

$ awk '
	BEGIN { FS = "," }
	NR == 1 { for (i = 1; i <= NF; i++) F[$i] = i; next }
	$F["domain_name"] ~ /\.com$/ {
		print $F["expiry_date"], $F["owner"], $F["domain_name"]
	}
' input.txt
2020-03 me nowhere.com
2020-04 you perdu.com

UCL-style configuration

Parsing data that is not organised with line-column is also convenient and efficient with awk, convenient for selecting one kind of value out of a configuration file:

$ cat input.txt
connections {
	conn-faraway {
		children {
			localnet = fe80:123d:35d3::%vio1/64
			localnet = fe80:2e46:1d23::%vio2/64
		}
		children {
			localnet = fe80:546:23e4::%vio3/64
		}
	}
	conn-veryclose {
		children {
			localnet = fe80:b536:243f::%vio3/64
			localnet = fe80:34f3:23c3::%vio3/64
			localnet = fe80:546a:343d::%vio3/64
		}
	}
}
$ awk '
	$2 == "{" { F[lv++] = $1 }
	$1 == "}" { delete F[--lv] }
	F[0] == "connections" && F[2] == "children" && $1 == "localnet" {
		print F[1], $3
	}
' input.txt
conn-faraway fe80:123d:35d3::%vio1/64
conn-faraway fe80:2e46:1d23::%vio2/64
conn-faraway fe80:546:23e4::%vio3/64
conn-veryclose fe80:b536:243f::%vio3/64
conn-veryclose fe80:34f3:23c3::%vio3/64
conn-veryclose fe80:546a:343d::%vio3/64

Key-Value splitter

Parsing key-value pairs can be mapped rather directly to an awk array, for instance, to extract an abstract out of a basic iCal file:

$ cat input.txt
BEGIN:VEVENT
METHOD:PUBLISH
UID:9189@FOSDEM20@fosdem.org
TZID:Europe-Brussels
DTSTART:20200201T170000
DTEND:20200201T175000
SUMMARY:State of the Onion
DESCRIPTION:Building usable free software to fight surveillance and censorship.
CLASS:PUBLIC
STATUS:CONFIRMED
CATEGORIES:Internet
LOCATION:Janson
END:VEVENT
$ awk '
	BEGIN { FS = ":" }
	{ F[$1] = $2 }
	$1 == "END" {
		print F["SUMMARY"] " - " F["DESCRIPTION"]
		print F["DTSTART"], "(" F["TZID"] ")"
	}
' input.txt
State of the Onion - Building usable free software to fight surveillance and censorship.
20200201T170000 (Europe-Brussels)

Edit variables passed to functions

For languages that support references, pointers, or objects, it is possible to edit the variable passed to a function, so that the variable also gets edited in the function that called it.

void increment(int *i) { (*i)++; }

Awk does not support changing integers or strings, but supports editing the fields of an array:

function increment_first(arr) { arr[1]++ }

Local variables in functions

By default, all awk variables are global, which is inconvenient for writing functions. The solution is to add an extra function argument at the end for each local variable we need.

Functions can be called with fewer arguments than they have.

$ awk '
	function concat3(arg1, arg2, arg3,
		local1)
	{
		local1 = arg1 arg2 arg3
		return local1
	}

	BEGIN {
		local1 = 1
		print(concat3("a", "w", "k"))
		print(local1)
	}
'
awk
1

I learned this with the project.

A sort() function

A very convenient feature lacking to awk is support for sorting members of an array. Is possible to implement sort() in awk (this is a quicksort):

function swap(array, a, b,
	tmp)
{
	tmp = array[a]
	array[a] = array[b]
	array[b] = tmp
}

function sort(array, beg, end)
{
	if (beg >= end) # end recursion
		return
	a = beg + 1 # 1st is the pivot, so +1
	b = end
	while (a < b) {
		while (a < b && array[a] <= array[beg]) # beg: skip lesser
			a++
		while (a < b && array[b] > array[beg]) # end: skip greater
			b--
		swap(array, a, b) # found 2 misplaced
	}
	if (array[beg] > array[a]) # put the pivot back
		swap(array, beg, a)
	sort(array, beg, a - 1) # sort lower half
	sort(array, a, end) # sort higher half
}

This sorts the array values using integers keys: array[1], array[2], ... It sorts from array[beg] to array[end] included, so you can choose your array indices starting at 0 or 1, or sort just a part of the array.

Example usage: with the both function above:

{
	LINES[NR] = $0
}

END {
	sort(LINES, 1, NR)
	for (i = 1; i <= NR; i++)
		print(LINES[i])
}

Performance is far from terrible!

$ od -An /dev/urandom | head -n 1000000 | time ./test.awk >/dev/null
real    0m 19.23s
user    0m 17.90s
sys     0m 0.12s

$ od -An /dev/urandom | head -n 1000000 | time sort >/dev/null
real    0m 4.39s
user    0m 3.00s
sys     0m 0.10s

Fill a static array

With C and many other languages, there are convenient concise syntax such as { "a", "b", "c", ...}to fill an array with values. An well-known way is:

split("a b c ...", array, " ")

Note that this does not saves the length out of split, but in practice I realized I rarely need it:

for (i = 1; i in split; i++)
	print(i, array[i])

A fold_line() function

Convenient to work with text documents or emails. This version does not truncate very long words, such as some https:// links.

function fold_line(str, len,
	head, tail, i)
{
	head = substr(str, 1, len + 1)
	sub(" *$", "", head)
	if (length(head) == len + 1)
		sub(" *[^ ]*$", "", head)
	if (length(head) == 0) {
		tail = substr(str, len + 1)
		head = substr(str, 1, len)
		if ((i = index(tail, " ")) == 0)
			return str
		return head substr(tail, 1, i)
	}
	return head
}

To use it, call it in a while loop like this:

{
	while (line = fold_line($0, 72)) {
		print line
		$0 = substr($0, length(line) + 2)
	}
}

A gmtime() function

POSIX awk as well as many implementations lack the time functions present in GNU awk. This gmtime() function split an epoch integer value (1587302158) into the fields year, mon, mday, hour, min, sec (2020-04-19T15:15:58Z):

function isleap(year)
{
	return (year % 4 == 0) && (year % 100 != 0) || (year % 400 == 0)
}

function mdays(mon, year)
{
	return (mon == 2) ? (28 + isleap(year)) : (30 + (mon + (mon > 7)) % 2)
}

function gmtime(sec, tm)
{
	tm["year"] = 1970
	while (sec >= (s = 86400 * (365 + isleap(tm["year"])))) {
		tm["year"]++
		sec -= s
	}
	tm["mon"] = 1
	while (sec >= (s = 86400 * mdays(tm["mon"], tm["year"]))) {
		tm["mon"]++
		sec -= s
	}
	tm["mday"] = 1
	while (sec >= (s = 86400)) {
		tm["mday"]++
		sec -= s
	}
	tm["hour"] = 0
	while (sec >= 3600) {
		tm["hour"]++
		sec -= 3600
	}
	tm["min"] = 0
	while (sec >= 60) {
		tm["min"]++
		sec -= 60
	}
	tm["sec"] = sec
}

The tm array will be filled with field names following the gmtime function as you can see above.

A localtime() function

For printing functions in the user's favorite timezone, gmtime's time needs to be shifted. This can also be done in standard awk by calling the date(1) command:

function localtime(sec, tm,
	tz, h, m)
{
	if (!TZ) {
		"date +%z" | getline tz
		close("date +%z")
		h = substr(tz, 2, 2)
		m = substr(tz, 4, 2)
		TZ = substr(tz, 1, 1) (h * 3600 + m * 60)
	}
	return gmtime(sec + TZ, tm)
}

Note that date(1) will only be called the first time localtime() is called, and the TZ global variable will be used for the next calls.

A timegm() function

Complementary function to gmtime is timegm for converting a tm[] array back to an integer representation. This is useful for parsing time values back to an unix timestamp:

function isleap(year)
{
	return (year % 4 == 0) && (year % 100 != 0) || (year % 400 == 0)
}

function mdays(mon, year)
{
	return (mon == 2) ? (28 + isleap(year)) : (30 + (mon + (mon > 7)) % 2)
}

function timegm(tm,
	sec, mon, day)
{
	sec = tm["sec"] + tm["min"] * 60 + tm["hour"] * 3600
	day = tm["mday"] - 1
	for (mon = tm["mon"] - 1; mon > 0; mon--)
		day = day + mdays(mon, tm["year"])
	day = day + int(tm["year"] / 400) * 146097
	day = day + int(tm["year"] % 400 / 100) * 36524
	day = day + int(tm["year"] % 100 / 4) * 1461
	day = day + int(tm["year"] % 4 / 1) * 365
	return sec + (day - 719527) * 86400
}

All the following fields of tm[] must be defined: "year", "mon", "mday", "hour", "min", "sec".

Convert MAC address to brand name

MAC addresses are composed by a leading Organization Unique Identifier (OUI) of 3 bytes and a trailing 3-byte number, unique for that OUI.

Each vendor has its own OUI, so each OUI maps to a vendor. With the reference list the IEEE publishes, it is possible to convert MAC address OUI digits to a human-readable name:

function oui_table(path,
	url)
{
	url = "http://standards-oui.ieee.org/oui/oui.txt"
	if (system("test -f '" path "'") > 0)
		if (system("curl -L -o '" path "' " url) != 0)
			return -1
	while (getline <path)
		if ($2 " " $3 == "(base 16)")
			OUI[$1] = substr($0, 22)
	return 0
}

Then a global OUI array does the MAC addresss to vendor name mapping:

BEGIN {
	if (oui_table("/var/tmp/oui.txt") < 0)
		exit(1)
	print(OUI[toupper("84a991")])
}