Expedient Set Operations in Unix Shell

Set theory rocks. It especially rocks for programmers, because set theory and computers have a special relationship. You see, if you can describe an answer you want in set-theoretic terms, using a programming language with support for sets, the computer can execute your description and give you an answer.

Compare this with what we normally need to do to get an answer from the computer: describe, with conditionals and jumps and counters, exactly the steps the computer must perform in order to answer our question.

Fortunately, most modern managed programming languages include a set data structure and set operations. The Unix shell does not, but Peteris Krumins

Here, we use Python's support for sets to find the students who are both on the Honor Roll and in the Drama Club (aka \(honorroll \cap dramaclub\)):

honorroll = set(['Bob', 'Joe', 'Sally'])
dramaclub = set(['Bob', 'Sally'])
print honorroll.intersection(dramaclub)
honorroll = ['Bob', 'Joe', 'Sally']
dramaclub = ['Bob', 'Sally']

def intersection(xs, ys):
    l, s = [xs, ys] if len(xs) > len(ys) else [ys, xs]
    result = []
    for x in l:
        for y in s:
            if x == y:
                result.append(x)
    return result

print intersection(honorroll, dramaclub)

For getting answers from the computer clearly and quickly, without many opportunities for us to introduce bugs, set-theoretic

Instead of describing how to get answers from the computer at a low level, with conditionals, counters, and jumps, we can do it at a really convenient level — set theory —

Most programming languages these days have a set data structure and operations to go with it.

Usually we need to instruct computers to give us answers at a low level: the level of conditionals, jumps, and counters.

The language of sets is a convenient middle-ground between us and the computer. We can state problems intuitively without resorting to the language of the computer: that of conditionals, jumps, and counters.

No conditionals, jumps, or counters needed.

You see, set theory gives us a vocabulary we can use to describe many kinds of computing problems in very clear ways. Beyond having an awesome way to describe our problems, we have an awesome way to get the answers we need. That's because computers are awesome at performing set operations.

State the problem in terms of set and set operations, and the computer evaluates the problem and gives us our answer. Radical, no?

Many problems can be solved easily by expressing them in set-theoretic terms. Even in humble Unix shell.

It's useful to compare files by their contents in shell. A typical question might be: what lines are in file A, but not in file B? Maybe A and B contain IP addresses, one per line, and we want to know which IPs appear in B but not in A, because we work at an ISP and we're tracking down people who illegally downloaded the latest Taylor Swift album or some other redonk thing.

We can think about the question in terms of set theory. The cool set theory way to pose the question is: what's the set of IP addresses that are in A but not in B?

There's a particular set-theoretic operation — relative complement — to express exactly this relationship. It's also known as difference, and it's written like this: \(B \setminus A\).

$$B \cap A$$

The awesome part about expressing answers we want in set-theoretic terms is that computers can easily perform these operations. It's almost like our questions become executable. Even in the humble Unix shell!

Peteris Krummins wrote a comprehensive article that demonstrates many set operations that can be performed in shell using various programs. It's a terrific article, but I found it hard to remember everything well enough to be able to use in a pinch.

Below, I demonstrate only union, intersection, and relative complement (a.k.a. difference), using only cat, sort, and uniq. My hope is that you can completely internalize the examples, and that you'll never need to refer to this article again.

The examples assume the following two files, A and B:

A contains:

a
b
c

B contains:

b
c
d

Sets in Shell

Commands you need:

  • cat
  • sort
  • uniq

Union

In either A or B

Intersection

In both A and B

Difference

In A but not in B

The shell It's frequently useful to compare files to one another. In many cases, the comparisons

Peteris Krummins wrote an excellent and comprehensive article

Like the article?

Get notified of future blog posts. Don't worry - we won't make it hard to get to inbox zero: no more than 2 e-mails a month. We promise.