Jan 10 2014
How do you sort a lot of data in Java? It’s easy to sort small amounts of data in memory. For example if your data is in a List object, then with the help of the Collections class you can easily sort the data. You can provide a custom Comparator to control in what order your data is sorted.
But, what if your data has millions of rows of data? One solution is to set the -Xmx JVM flag to a big enough value so that all of your data fits into memory. Another solution is to use command line tools to sort the data for you e.g. the Unix/Linux sort command. I documented in an earlier post the various flags that gives you fine grained control over how the data can be sorted using the sort command.
You can also call the Unix sort command from your Java application using the Java API Runtime class. This will only work if your Java application is running on a machine that has the “sort” command available on the command line. You may also run into issues where different machines use different options to control how sorting is performed. This solution is not very portable.
I found myself in a situation where I needed to sort many millions of rows of data, and I wanted a portable solution that would work on both Windows, OS X and Linux machines. So I wrote a small library that performs an external file sort. To use it in your own application you can either add a build dependency, or add the JAR file to your class path. In this example I’m going to assume that you use Maven, but various other dependency types are supported and you can find them in the Maven repository.
Add a dependency on “data-util” to your pom.xml
<dependency> <groupId>com.btaz.util</groupId> <artifactId>data-util</artifactId> <version>0.3.11</version> </dependency>
Then in your class import the SortController class
Here’s the actual sort method:
SortController.sortFile(File sortDir, File inputFile, File outputFile, Comparator<String> comparator, boolean skipHeader, long maxBytes, int mergeFactor)
If the file you’re sorting is really big then in order to sort it first has to be split into smaller files that can be sorted independently. This method takes care of all the details for you. Here is a quick explanation what all of the parameters are used for:
- sortDir – this is a work directory for temporary files that the sort command uses
- inputFile – the file you want to sort
- outputFile – the sorted output file
- comparator – is used to determine how to sort the data. This is the standard Java API comparator. For a quick test you can use the built-in comparator “Lexical.ascending()”
- skipHeader – if set to true the sorting algorithm will assume that the first row in the input file is a header row and skip it
- maxBytes – to control how much memory this method is allowed to use. If set to 40,000 then the sort algorithm will attempt to not use more than 40 MB of memory per sort. The more memory you give the sort algorithm the faster it will run.
- mergeFactor – really big input files combined with a low maxBytes value will lead to many small files in the sortDir directory. Once all of them have been sorted they’re merged into the final output file. However, in order to minimize how many open file handles are needed you can specify the merge factor. A number in the 4-12 range is pretty decent for files that has 1-10 million rows of data.
The library is released under the MIT license which puts very few restrictions on what you can do with the source code. You can find it on GitHub here: https://github.com/btaz/data-util
If you want to control sorting to an even higher degree you will find that the underlying algorithm has it’s own APIs. The file split, file sort and file merge operations can be accessed directly for alternate methods of sorting your data. For example let’s say that you have a process that output many files where each one already is sorted e.g. HTTP access logs that are already sorted by timestamp, but that you want to combine them into a single file. If this is the case you can invoke the merge method directly since it knows how to merge pre-sorted files into a single output file.