Sep 17 2010

Unix/Linux Sort Multiple Columns, Tab Delimited and Reverse Sort Order

Published by michael at 8:57 am under Unix/Linux

Sorting a tab delimited file using the Unix sort command is easy once you which parameters to use. An advanced file sort can get difficult if it has multiple columns, uses tab characters as the column separator, you want to reverse the sort order on some columns, and where you want the columns sorted in non-sequential order.

Assume that we have the following file where each column is separated by a [TAB] character:

Group-ID   Category-ID   Text        Frequency
----------------------------------------------
200        1000          oranges     10
200        900           bananas     5
200        1000          pears       8
200        1000          lemons      10
200        900           figs        4
190        700           grapes      17

I’d like to have this file sorted by these columns and in this specific order (note that column 4 is sorted before column 3 and that column 4 is sorted in reverse order):

  • Group ID (integer)
  • Category ID (integer)
  • Frequency “sorted in reverse order” (integer)
  • Text (alpha-numeric)

This should sort the file into this format:

Group-ID   Category-ID   Text        Frequency
----------------------------------------------
190        700           grapes      17
200        900           bananas     5
200        900           figs        4
200        1000          lemons      10
200        1000          oranges     10
200        1000          pears       8

The quick answer is that these sort arguments would solve the problem:

sort -t $'\t' -k 1n,1 -k 2n,2 -k4rn,4 -k3,3 <my-file>

A description of what it all means please read on. The first thing we need to do is to tell sort to use TAB as a column separator (column separated or delimited) which we can do using:

sort -t $'\t' <my-file>

If our input file was comma separated we could have used:

sort -t "," <my-file>

The next step is define that we want the file sorted by columns 1, 2, 4 and 3 and in this particular order. The key argument “-k” allows us to do this. The tricky part is that you have to define the column index twice to limit the sort to any given column, e.g. like this “-k 1,1″. If you only specify it once like this “-k 1″ you’re telling Unix “sort” to sort the file from column 1 and until the end of the line which is not what we want. If you want to sort column 1 and 2 together you’d use “-k 1,2″.  To tell sort to sort multiple columns we have to define the key argument “-k” multiple times. The sort arguments required to sort our file in column order 1, 2, 4 and 3 will therefore look like this:

sort -t $'\t' -k 1,1 -k 2,2 -k 4,4 -k 3,3 <my-file>

We however want the 4th column sorted in reverse order which we can instruct sort to do by changing the argument from “-k 4,4″ to “-k 4r,4″. The “r” option reverses the sort order that column only. There’s only one problem left to solve and that is that sort by default will interpret numbers as text and will sort e.g.  the number 10 ahead of 2. We solve this by adding the “n” option to tell “sort” to sort a column using its numerical values e.g. “-k 1n,1″. Note that the “n” option is only attached to the first number to the left of the comma. Since the 4th column is sorted in both reversed order and using numerical values we can combine the options like this “-k 4rn,4″

So by adding all of these options together with end up with:

sort -t $'\t' -k 1n,1 -k 2n,2 -k 4rn,4 -k 3,3 <my-file>

I hope someone will find this useful. I tested this solution on both Linux and OS X. The documentation for the Unix sort command can be found using your man command “man sort” and “info sort”.

Share and Enjoy:
  • del.icio.us
  • Google Bookmarks
  • Technorati
  • Twitter
  • StumbleUpon

Related posts:

  1. How to automatically provide an answer to Unix commands How do you automatically provide an answer to a Unix...

4 responses so far

4 Responses to “Unix/Linux Sort Multiple Columns, Tab Delimited and Reverse Sort Order”

  1. thusharaon 16 Mar 2011 at 4:41 pm

    try sorting this by 2nd column:

    dora,4,0.9,5
    alice,3,5,0
    george,5,0,0
    alan,20,3,0

    with command:

    cat | sort -n -k 2 -t ,

    the 0.9 seems to confuse sort

  2. michaelon 16 Mar 2011 at 11:33 pm

    This will solve your sort problem

    sort -t , -k 2n,2 file-to-be-sorted.txt

    You need “-k 2,2″ to tell sort to only consider the second column. The “n” in “-k 2n,2″ tells the sort command to sort the column numerically. If you omit the “n” 20 gets sorted before 3.

    I did some testing and it seems that when you sort multiple numerical columns the numbers gets combined. When you specify “-k 2″ you tell sort to use all columns from column two until the end of the line. You also specified comma to be the column separator. When you do this then I think sort see the data this way:

    dora,4,0.9,5
    alice,3,5,0
    george,5,0,0
    alan,20,3,0

    becomes

    dora,40.95
    alice,350
    george,500
    alan,2030

    Which is in order. I did a test with different numbers and still got the same behavior.

  3. thusharaon 17 Mar 2011 at 4:29 pm

    I found a case where even LC_ALL=C does not help…

    echo -e “alan\t20\t3\t0\ngeorge\t5\t0\t0\nalice\t3\t5\t0\ndora\t4\t9\t5″ | LC_ALL=C sort -n -k 2 -t \t

    alan 20 3 0
    alice 3 5 0
    dora 4 9 5
    george 5 0 0

  4. shubhaon 09 May 2011 at 10:12 pm

    This was a great example. I really got what information I was looking for especially with sort -k options.

    Thanks,

Trackback URI | Comments RSS

Leave a Reply