Unix/Linux Sort Multiple Columns, Tab Delimited and Reverse Sort Order
Sorting a tab delimited file using the Unix sort command is easy once you know which parameters to use. An advanced file sort can get difficult to define if it has multiple columns, uses tab characters as column separators, uses reverse sort order on some columns, and where you want the columns sorted in non-sequential order.
Assume that we have the following file where each column is separated by a [TAB] character:
Group-ID Category-ID Text Frequency ---------------------------------------------- 200 1000 oranges 10 200 900 bananas 5 200 1000 pears 8 200 1000 lemons 10 200 900 figs 4 190 700 grapes 17
I’d like to have this file sorted by these columns and in this specific order. I want column 4 sorted before column 3, and column 4 to be sorted in reverse order:
- Group ID (integer)
- Category ID (integer)
- Frequency “sorted in reverse order” (integer)
- Text (alpha-numeric)
I want the file sorted this way:
Group-ID Category-ID Text Frequency ---------------------------------------------- 190 700 grapes 17 200 900 bananas 5 200 900 figs 4 200 1000 lemons 10 200 1000 oranges 10 200 1000 pears 8
To sort the file that way we have to define the sort parameters like this:
sort -t $'\t' -k 1n,1 -k 2n,2 -k4rn,4 -k3,3 <my-file>
The first thing we need to do is to tell sort to use TAB as a column separator (column separated or delimited) which we can do using:
sort -t $'\t' <my-file>
If our input file was comma separated we could have used:
sort -t "," <my-file>
The next step is define that we want the file sorted by columns 1, 2, 4 and 3 and in this particular order. The key argument “-k” allows us to do this. The tricky part is that you have to define the column index twice to limit the sort to any given column, e.g. like this “-k 1,1”. If you only specify it once like this “-k 1” you’re telling Unix “sort” to sort the file from column 1 and until the end of the line which is not what we want. If you want to sort column 1 and 2 together you’d use “-k 1,2”. To tell sort to sort multiple columns we have to define the key argument “-k” multiple times. The sort arguments required to sort our file in column order 1, 2, 4 and 3 will therefore look like this:
sort -t $'\t' -k 1,1 -k 2,2 -k 4,4 -k 3,3 <my-file>
We however want the 4th column sorted in reverse order. We instruct sort to do by changing the argument from “-k 4,4” to “-k 4r,4”. The “r” option reverses the sort order for that column only. There’s only one problem left to solve and that is that sort by default will interpret numbers as text and will sort e.g. the number 10 ahead of 2. We solve this by adding the “n” option to tell “sort” to sort a column using its numerical values e.g. “-k 1n,1”. Note that the “n” option is only attached to the first number to the left of the comma. Since the 4th column is sorted in both reversed order and using numerical values we can combine the options like this “-k 4rn,4”
So by adding all of these options together with end up with:
sort -t $'\t' -k 1n,1 -k 2n,2 -k 4rn,4 -k 3,3 <my-file>
I hope someone will find this useful. I tested this solution on both Linux and OS X. The documentation for the Unix sort command can be found using your man command “man sort” and “info sort”.
Comments
17 Comments
try sorting this by 2nd column:
dora,4,0.9,5
alice,3,5,0
george,5,0,0
alan,20,3,0
with command:
cat | sort -n -k 2 -t ,
the 0.9 seems to confuse sort
This will solve your sort problem
sort -t , -k 2n,2 file-to-be-sorted.txt
You need “-k 2,2” to tell sort to only consider the second column. The “n” in “-k 2n,2” tells the sort command to sort the column numerically. If you omit the “n” 20 gets sorted before 3.
I did some testing and it seems that when you sort multiple numerical columns the numbers gets combined. When you specify “-k 2” you tell sort to use all columns from column two until the end of the line. You also specified comma to be the column separator. When you do this then I think sort see the data this way:
dora,4,0.9,5
alice,3,5,0
george,5,0,0
alan,20,3,0
becomes
dora,40.95
alice,350
george,500
alan,2030
Which is in order. I did a test with different numbers and still got the same behavior.
I found a case where even LC_ALL=C does not help…
echo -e “alant20t3t0ngeorget5t0t0nalicet3t5t0ndorat4t9t5” | LC_ALL=C sort -n -k 2 -t t
alan 20 3 0
alice 3 5 0
dora 4 9 5
george 5 0 0
This was a great example. I really got what information I was looking for especially with sort -k options.
Thanks,
great! well explained!
Thanks ! That was helpful !
I did something like this,
sort -t $’t’ -n -k 3nr,3 -k1,1 -k2,2 STop_nonX > STop_nonX_sorted
3rd column is descending and then sort by first two columns ascending.
Clearly explained by taking a very good example..! Well done!
I have googled the SORT feature so much, especially on multi-column sorting. This has got to be the BEST tutorial — thoroughly explained, step-by-step, using examples — I’ve seen on it!! I truly appreciate it!!
In a tab-delimited file containing four columns, how would you sort the first column alphabetically ascending, and then second column by date descending?
Thanks again.
As far as I know the Unix sort command does not have an option for date sorting. However if you were to use a date format that looks like this YYYY-MM-DD or YYYYMMDD then you could do it like this:
sort -t $'\t' -k1,1 -kr2,2
Simple and well explained..! Thanks
Very well explained. I understood completely. Was checking for this from 1 day.
Thanks a lot.
I like the detailing you have put in there to explain parameters. Job Well done.
What is I want to sort using a column which is greater that 10? Say I want to sort my file with the 11th column,What will the command look like?
If tab separated columns it’s:
sort -t $'\t' -k 11,11
[…] But, what if your data has millions of rows of data? One solution is to set the -Xmx JVM flag to a big enough value so that all of your data fits into memory. Another solution is to use command line tools to sort the data for you e.g. the Unix/Linux sort command. I documented in an earlier post the various flags that gives you fine grained control over how the data can be sorted using the sort command. […]
Simple and well explained..!
Now, I want to sort my file (file A) based on a given order in the file B for the first column of the file A. How can I do that? any hints? Thanks!
File A:
3 3 0 5 5.416666667
19 1 0 5 5.67
31 3 0 5 5.833333333
104 14 0 5 5.5
37 18 0 5 5.333333333
111 12 0 6 5.75
7311 22 0 6 5.85
7309 42 0 6 5.85
File B:(order of column 1 in file A)
3
7309
19
31
37
104
7311
111
Thank you so much for the detailed explanation. All UNIX systems should have this as their explanation.
Leave a Comment