Jan 10 2014

External File Sort In Java

Published by under data-util,Java

Merge_sort_animation2

How do you sort a lot of data in Java? It’s easy to sort small amounts of data in memory. For example if your data is in a List object, then with the help of the Collections class you can easily sort the data. You can provide a custom Comparator to control in what order your data is sorted.

But, what if your data has millions of rows of data? One solution is to set the -Xmx JVM flag to a big enough value so that all of  your data fits into memory. Another solution is to use command line tools to sort the data for you e.g. the Unix/Linux sort command. I documented in an earlier post the various flags that gives you fine grained control over how the data can be sorted using the sort command.

You can also call the Unix sort command from your Java application using the Java API Runtime class. This will only work if your Java application is running on a machine that has the “sort” command available on the command line. You may also run into issues where different machines use different options to control how sorting is performed. This solution is not very portable.

I  found myself in a situation where I needed to sort many millions of rows of data, and I wanted a portable solution that would work on both Windows, OS X and Linux machines. So I wrote a small library that performs an external file sort. To use it in your own application you can either add a build dependency, or add the JAR file to your class path. In this example I’m going to assume that you use Maven, but various other dependency types are supported and you can find them in the Maven repository.

Add a dependency on “data-util” to your pom.xml

<dependency>
    <groupId>com.btaz.util</groupId>
    <artifactId>data-util</artifactId>
    <version>0.3.11</version>
</dependency>

Then in your class import the SortController class

import com.btaz.util.files.SortController;

Here’s the actual sort method:

SortController.sortFile(File sortDir, File inputFile, File outputFile, Comparator<String> comparator,
 boolean skipHeader, long maxBytes, int mergeFactor)

If the file you’re sorting is really big then in order to sort it first has to be split into smaller files that can be sorted independently. This method takes care of all the details for you. Here is a quick explanation what all of the parameters are used for:

  • sortDir – this is a work directory for temporary files that the sort command uses
  • inputFile – the file you want to sort
  • outputFile – the sorted output file
  • comparator – is used to determine how to sort the data. This is the standard Java API comparator. For a quick test you can use the built-in comparator “Lexical.ascending()”
  • skipHeader – if set to true the sorting algorithm will assume that the first row in the input file is a header row and skip it
  • maxBytes – to control how much memory this method is allowed to use. If set to 40,000 then the sort algorithm will attempt to not use more than 40 MB of memory per sort. The more memory you give the sort algorithm the faster it will run.
  • mergeFactor – really big input files combined with a low maxBytes value will lead to many small files in the sortDir directory. Once all of them have been sorted they’re merged  into the final output file. However, in order to minimize how many open file handles are needed you can specify the merge factor. A number in the 4-12 range is pretty decent for files that has 1-10 million rows of data.

The library is released under the MIT license which puts very few restrictions on what you can do with the source code. You can find it on GitHub here: https://github.com/btaz/data-util

If you want to control sorting to an even higher degree you will find that the underlying algorithm has it’s own APIs. The file split, file sort and file merge operations can be accessed directly for alternate methods of sorting your data. For example let’s say that you have a process that output many files where each one already is sorted e.g. HTTP access logs that are already sorted by timestamp, but that you want to combine them into a single file. If this is the case you can invoke the merge method directly since it knows how to merge pre-sorted files into a single output file.

Happy sorting!

Sorting Image by: © CobaltBlue / http://en.wikipedia.org/wiki/File:Merge_sort_animation2.gif / CC-BY-SA-3.0

No responses yet

Feb 09 2011

Los Angeles Hadoop Users Group- LA-HUG

Published by under Hadoop

LA now has its own HUG. The first meetup will be held on Wednesday 2/9/2011. This is a great opportunity for anyone in the Los Angeles area with interest in Hadoop and related technologies to discuss and meet.

The first talk is:  “Operationalizing Hadoop” with Charles Zedlewski (Cloudera’s VP Product)

http://www.meetup.com/LA-HUG/

No responses yet

Nov 16 2010

fatal: The remote end hung up unexpectedly

Published by under Miscellaneous

I committed changes to my GIT project, tried to push them to the remote server (git push) and got the following cryptic error message:

fatal: The remote end hung up unexpectedly

The GitFaq states that:

Git push fails with “fatal: The remote end hung up unexpectedly”?
If, when attempting git push, you get a message that says:
fatal: The remote end hung up unexpectedly

There are a couple of reasons for that, but the most common is that authorization failed. You might be using a git:// URL to push, which has no authorization whatsoever and hence has write access disabled by default. Or you might be using an ssh URL, but either your public key was not installed correctly, or your account does not have write access to that repository/branch.

I used “git config –list” to review my project configuration

core.repositoryformatversion=0
core.filemode=true
core.bare=false
core.logallrefupdates=true
core.ignorecase=true
remote.origin.fetch=+refs/heads/*:refs/remotes/origin/*
remote.origin.url=git://git.some-domain.com/my-project
branch.master.remote=origin
branch.master.merge=refs/heads/master
branch.v2.remote=origin
branch.v2.merge=refs/heads/v2
gui.geometry=1374×727+34+82 301 201

The culprit here is the “remote.origin.url” property, it’s pointing to a read-only repository. We can change this using “git config –edit”. We want to change:

remote.origin.url=git://git.some-domain.com/my-project

to

remote.origin.url=git@git.some-domain.com:/my-project

Please note that the specific user “git” is specific to my setup and your GIT setup/configuration is most likely different. After this change I was able to successfully push my changes to the remote server.

The reason to why I got the read-only version of the project was because I used “git clone git://git.some-domain.com/my-project” when I should have used “git clone git@git.some-domain.com:/my-project”

7 responses so far

Oct 17 2010

Hadoop World

Published by under Miscellaneous

Hadoop_logo.svg

I just came back from the Hadoop World conference in New York and I have to say that it was quite exciting. Processing huge amounts of data used to be a problem for just a few companies like Google, Yahoo, Facebook and a few others, but has now become a problem for many. The conference topics were interesting and the training held by Cloudera was really good. My personal recommendation is to get up to speed on Hadoop and related technologies e.g. HBase, Hive, Pig etc. quickly since I think that the ever growing data sizes will soon make these tools commonplace. It takes time to learn how think at scale and to use these tools properly. I’ve now seen how “big data” has grown to such sizes that not even big clustered databases like Oracle RAC provide the ability to quickly process and extract information for our needs. Hadoop is not a universal tool for big data problems, but for a certain set of problems it’s quite powerful and provides almost linear performance as you scale up your compute cluster. Cloudera has excellent videos for Hadoop here: http://www.cloudera.com/resources/?media=Video to get you started. Tom White’s “Hadoop: The Definitive Guide (2nd Edition)” is excellent and I can highly recommend it.

No responses yet

Sep 17 2010

Unix/Linux Sort Multiple Columns, Tab Delimited and Reverse Sort Order

Published by under Unix/Linux

Sorting a tab delimited file using the Unix sort command is easy once you know which parameters to use. An advanced file sort can get difficult to define if it has multiple columns, uses tab characters as column separators, uses reverse sort order on some columns, and where you want the columns sorted in non-sequential order.

Assume that we have the following file where each column is separated by a [TAB] character:

Group-ID   Category-ID   Text        Frequency
----------------------------------------------
200        1000          oranges     10
200        900           bananas     5
200        1000          pears       8
200        1000          lemons      10
200        900           figs        4
190        700           grapes      17

I’d like to have this file sorted by these columns and in this specific order. I want column 4 sorted before column 3, and column 4 to be sorted in reverse order:

  • Group ID (integer)
  • Category ID (integer)
  • Frequency “sorted in reverse order” (integer)
  • Text (alpha-numeric)

I want the file sorted this way:

Group-ID   Category-ID   Text        Frequency
----------------------------------------------
190        700           grapes      17
200        900           bananas     5
200        900           figs        4
200        1000          lemons      10
200        1000          oranges     10
200        1000          pears       8

To sort the file that way we have to define the sort parameters like this:

sort -t $'\t' -k 1n,1 -k 2n,2 -k4rn,4 -k3,3 <my-file>

The first thing we need to do is to tell sort to use TAB as a column separator (column separated or delimited) which we can do using:

sort -t $'\t' <my-file>

If our input file was comma separated we could have used:

sort -t "," <my-file>

The next step is define that we want the file sorted by columns 1, 2, 4 and 3 and in this particular order. The key argument “-k” allows us to do this. The tricky part is that you have to define the column index twice to limit the sort to any given column, e.g. like this “-k 1,1″. If you only specify it once like this “-k 1″ you’re telling Unix “sort” to sort the file from column 1 and until the end of the line which is not what we want. If you want to sort column 1 and 2 together you’d use “-k 1,2″.  To tell sort to sort multiple columns we have to define the key argument “-k” multiple times. The sort arguments required to sort our file in column order 1, 2, 4 and 3 will therefore look like this:

sort -t $'\t' -k 1,1 -k 2,2 -k 4,4 -k 3,3 <my-file>

We however want the 4th column sorted in reverse order. We instruct sort to do by changing the argument from “-k 4,4″ to “-k 4r,4″. The “r” option reverses the sort order for that column only. There’s only one problem left to solve and that is that sort by default will interpret numbers as text and will sort e.g.  the number 10 ahead of 2. We solve this by adding the “n” option to tell “sort” to sort a column using its numerical values e.g. “-k 1n,1″. Note that the “n” option is only attached to the first number to the left of the comma. Since the 4th column is sorted in both reversed order and using numerical values we can combine the options like this “-k 4rn,4″

So by adding all of these options together with end up with:

sort -t $'\t' -k 1n,1 -k 2n,2 -k 4rn,4 -k 3,3 <my-file>

I hope someone will find this useful. I tested this solution on both Linux and OS X. The documentation for the Unix sort command can be found using your man command “man sort” and “info sort”.

15 responses so far

Jul 16 2010

ls full path

Published by under Unix/Linux

How do you get the Unix command ls to show you the full path? Unfortunately there’s no argument for ls that will do this directly.
However this will work fine and give you what you want.

ls -d $PWD/*

or

ls -ld $PWD/*

No responses yet

Jul 07 2010

Window Stuck Under Toolbar in OS X

Published by under OS X / Apple OS

Sometimes a window can get stuck under the top toolbar in OS X. This often happens when I use Citrix in OS X to run Windows applications. When this happens it’s not possible to grab the window nor  to close it. A simple solution for this is to press [fn] [shift] [F2] which will move the application window a bit which allows you to grab it.

No responses yet

Apr 14 2010

fatal: git checkout: updating paths is incompatible with switching branches.

Published by under GIT

Using GIT I tried to pull down a new remote branch using:

git checkout --track -b my-branch-name origin/my-branch-name

When I did this I got this error message:

fatal: git checkout: updating paths is incompatible with switching branches.
Did you intend to checkout 'origin/my-branch-name' which can not be resolved as commit?

This error message was a tad confusing. The solution in my case was simple though, apparently you can’t switch to a different remote branch if your local master is not up-to-date with the remote origin/master so performing:

git pull

resolved the issue and after this I was able to successfully pull down the remote branch.

3 responses so far

Mar 04 2010

DbVisualizer auto commit problem

Published by under JDBC

I had some issues with DbVisualizer and auto commit. I wanted to be able to turn it off from the SQL commander. The official documentation states that you can do this using:

The Auto Commit setting is enabled by default and can be adjusted in the Connection Properties. You may also adjust the  auto commit state for the SQL editor you are using in the SQL Commander with the following command:

@set autocommit true/false

Unfortunately this didn’t work for  me in either 6.5.12 or 7.04 (I’m using OS X and Java 6) against an Oracle 10g database. I get an error alert stating “/application/set autocommit false (No such file or directory)”
I was finally able to figure out that you can get it to work using:

@set autocommit off/on

I’m not sure if this is a problem that only occur on OS X.

No responses yet

Feb 04 2010

GIT Fatal You Have not Concluded Your Merge MERGE_HEAD Exists

Published by under GIT

fatal: You have not concluded your merge. (MERGE_HEAD exists)

I got this message because when I performed a “git pull”. I searched for a solution for this problem on the Internet and it wasn’t until I found this post that I was able to resolve this issue. The problem was that I:

  1. Performed a “git pull” and the automatic merge failed and I ended up with merge conflicts
  2. I resolved the merge conflicts and added the resolved files back using “git add”
  3. Performed a new “git pull” and got the “Fatal You Have not Concluded Your Merge MERGE_HEAD Exists” error

Apparently step 3 overrides MERGE_HEAD, starting a new merge with a dirty index. According to the post this is a common mistake made by programmers that are used to version control systems where the user follows an “update” and “commit” work flow.

So how do we resolve this issue? What worked for me was to follow the instructions for how to “Undo a merge or pull inside a dirty work tree” found here.

  1. I used “git reset –merge ORIG_HEAD”
  2. I resolved the merge conflicts again and added the resolved files back using “git add”
  3. I was then finally able to “push” my changes!

According to the documentation if you run a “git reset –hard ORIG_HEAD” it will let you go back to where you were before you were trying to commit your changes, however you will lose local changes. Most likely not what you want to do. Using “git reset –merge” will let you keep your local changes. You will however have to re-resolve your conflicting merge files.

Some additional information on this topic can be found here.

11 responses so far

Older Entries »