tinyapps.org / blog


Who needs a database when you've got sed and awk? ;-) #

As a simple text-based site, TinyApps.Org is ideal for parsing with *nix text processing commands. Here is a one-liner to calculate the average file size of the Palm apps:

grep -o "\[.*\]" palm.html | sed 's/[^0-9]//g' | awk '{s+=$1}END{print s/NR}'

Let's break it down:

First, grep will return all patterns within palm.html matching an open bracket, followed by anything, followed by a close bracket. The -o option (only-matching) instructs grep to show only the part of a matching line that matches the pattern (as opposed to returning the entire line). This gives us output like:
[58k]
[62k]
[296k]
...
Next, we'll use sed to (s)ubstitute any non-numeric character with nothing, (g)lobally:

sed 's/[^0-9]//g'

which returns:
58
62
296
...
Finally, we'll use awk to total and average the numbers:

awk '{s+=$1}END{print s/NR}'

{s+=$1} sums the first (and only) column. {print s/NR} prints the sum divided by NR (an awk variable that returns the Number of Records (i.e., lines) read so far).

And the result is:
112.324
This worked beautifully on all of the other app pages as well (osx.html, internet.html, etc) save for one: system.html. On that page, one of the entries reads: "[various sizes]". This is the only instance in all of the app pages where a number is not listed within the brackets. As a result

grep -o "\[.*\]" system.html | sed 's/[^0-9]//g'

displays a blank for that line:
...
1454
     <-- Note the blank line here
30
...
which means that awk divides (in this case) 18729 by 55 records instead of the correct 54.

In order to remedy this problem, we will add a second command to sed, separated from the first by a semicolon:

sed 's/[^0-9]//g;/./!d'

/./!d deletes any line which does not contain at least one character, thus removing the single blank line. Now awk will divide the total by the correct number of lines to arrive at the true average (346.833). The full one liner is now:

grep -o "\[.*\]" palm.html | sed 's/[^0-9]//g;/./!d' | awk '{s+=$1}END{print s/NR}'

Do you have a faster/leaner/tinier method? Or perhaps another interesting one liner? Please share the goodness!

/nix | Mar 28, 2009


Subscribe or visit the archives