Java - Sort Strings like Windows Explorer
Asked Answered
G

5

11

I am trying to use code suggested by Sander Pham on another question. I need my java ArrayList of string names to be sorted like Windows Explorer does. His code worked for everything but for one issue. I would have liked to comment onto that question, but I need more reputation points to comment. Anyways... He suggested to use a custom comparator implemented class and use that to compare the string names. Here is the code of that class:

class IntuitiveStringComparator implements Comparator<String>
{
    private String str1, str2;
    private int pos1, pos2, len1, len2;

    public int compare(String s1, String s2)
    {
        str1 = s1;
        str2 = s2;
        len1 = str1.length();
        len2 = str2.length();
        pos1 = pos2 = 0;

        int result = 0;
        while (result == 0 && pos1 < len1 && pos2 < len2)
        {
            char ch1 = str1.charAt(pos1);
            char ch2 = str2.charAt(pos2);

            if (Character.isDigit(ch1))
            {
                result = Character.isDigit(ch2) ? compareNumbers() : -1;
            }
            else if (Character.isLetter(ch1))
            {
                result = Character.isLetter(ch2) ? compareOther(true) : 1;
            }
            else
            {
                result = Character.isDigit(ch2) ? 1
                : Character.isLetter(ch2) ? -1
                : compareOther(false);
            }

            pos1++;
            pos2++;
        }

        return result == 0 ? len1 - len2 : result;
    }

    private int compareNumbers()
    {
        // Find out where the digit sequence ends, save its length for
        // later use, then skip past any leading zeroes.
        int end1 = pos1 + 1;
        while (end1 < len1 && Character.isDigit(str1.charAt(end1)))
        {
            end1++;
        }
        int fullLen1 = end1 - pos1;
        while (pos1 < end1 && str1.charAt(pos1) == '0')
        {
            pos1++;
        }

        // Do the same for the second digit sequence.
        int end2 = pos2 + 1;
        while (end2 < len2 && Character.isDigit(str2.charAt(end2)))
        {
            end2++;
        }
        int fullLen2 = end2 - pos2;
        while (pos2 < end2 && str2.charAt(pos2) == '0')
        {
            pos2++;
        }

        // If the remaining subsequences have different lengths,
        // they can't be numerically equal.
        int delta = (end1 - pos1) - (end2 - pos2);
        if (delta != 0)
        {
            return delta;
        }

        // We're looking at two equal-length digit runs; a sequential
        // character comparison will yield correct results.
        while (pos1 < end1 && pos2 < end2)
        {
            delta = str1.charAt(pos1++) - str2.charAt(pos2++);
            if (delta != 0)
            {
                return delta;
            }
        }

        pos1--;
        pos2--;

        // They're numerically equal, but they may have different
        // numbers of leading zeroes. A final length check will tell.
        return fullLen2 - fullLen1;
    }

    private int compareOther(boolean isLetters)
    {
        char ch1 = str1.charAt(pos1);
        char ch2 = str2.charAt(pos2);

        if (ch1 == ch2)
        {
            return 0;
        }

        if (isLetters)
        {
            ch1 = Character.toUpperCase(ch1);
            ch2 = Character.toUpperCase(ch2);
            if (ch1 != ch2)
            {
                ch1 = Character.toLowerCase(ch1);
                ch2 = Character.toLowerCase(ch2);
            }
        }

        return ch1 - ch2;
    }   
}

In using this, it works great except for if the string name does not have a number after it. If it does not have a number, it is put at the end of the list, which is wrong. If it doesn't have a number, it should be at the beginning.

i.e.

filename.jpg
filename2.jpg
filename03.jpg
filename3.jpg

Currently it sorts that...

filename2.jpg
filename03.jpg
filename3.jpg
filename.jpg

What do I need to change in the code to correct this behavior?

Thanks

Geibel answered 21/4, 2014 at 20:2 Comment(3)
Is there a ruleset for this kind of sorting available? What if there are names like file5b7.jpg, what about other extensions? Is always the last number before the extension point special treated? Wouldnt it be much more simpler to split the filename in three parts name, number, ext and let the Comparator compare first name and on equality go for the number and then the ext. The number would be converted to int.Muslim
Right. The filename and number before the extension point is the point thats being sorted. Pretty much just needs to mimic exactly how Windows Explorer sorts. More examples would be... filename00.jpg filename0.jpg filename0b.jpg filename0b1.jpg filename0b02.jpg filename0c.jpg filename1.jpg I believe, in the current code given, that is the behavior. The only thing I noticed not working, is that if it doesn't have a number after the filename at all, it sorts after everything else, instead of before.Geibel
There will not be other extensions. So different extensions aren't really a concern.Geibel
M
20

This is my second try to answer this. I used http://www.interact-sw.co.uk/iangblog/2007/12/13/natural-sorting as a start. Unfortunatly I think I found there problems as well. But I think in my code these problems are correctly adressed.

Info: Windows Explorer uses the API function StrCmpLogicalW() function to do its sorting. There it is called natural sort order.

So here is my unterstanding of the WindowsExplorerSort - Algorithm:

  • Filenames are compared part wise. As for now I identified the following parts: numbers, '.', spaces and the rest.
  • Each number within the filename is considered for a possible number compare.
  • Numbers are compared as numbers but if they are equal, the longer base string comes first. This happens with leading zeros.
    • filename00.txt, filename0.txt
  • If one compares a number part with a non number part, it will be compared as text.
  • Text will be compared case insensitive.

This list is based partly on try and error. I increased the number of test filenames, to adress more of the in comments mentioned pitfalls and the result was checked against a Windows Explorer.

So here is the output of this:

filename
filename 00
filename 0
filename 01
filename.jpg
filename.txt
filename00.jpg
filename00a.jpg
filename00a.txt
filename0
filename0.jpg
filename0a.txt
filename0b.jpg
filename0b1.jpg
filename0b02.jpg
filename0c.jpg
filename01.0hjh45-test.txt
filename01.0hjh46
filename01.1hjh45.txt
filename01.hjh45.txt
Filename01.jpg
Filename1.jpg
filename2.hjh45.txt
filename2.jpg
filename03.jpg
filename3.jpg

The new comparator WindowsExplorerComparator splits the filename in the already mentioned parts and does a part wise comparing of two filenames. To be correct, the new comparator uses Strings as its input so one has to create an adaptor Comparator like

new Comparator<File>() {
    private final Comparator<String> NATURAL_SORT = new WindowsExplorerComparator();

    @Override
    public int compare(File o1, File o2) {;
        return NATURAL_SORT.compare(o1.getName(), o2.getName());
    }
}

So here is the new Comparators source code and its test:

import java.io.File;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.Comparator;
import java.util.Iterator;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class WindowsSorter {

    public static void main(String args[]) {
        //huge test data set ;)
        List<File> filenames = Arrays.asList(new File[]{new File("Filename01.jpg"),
            new File("filename"), new File("filename0"), new File("filename 0"),
            new File("Filename1.jpg"), new File("filename.jpg"), new File("filename2.jpg"), 
            new File("filename03.jpg"), new File("filename3.jpg"), new File("filename00.jpg"),
            new File("filename0.jpg"), new File("filename0b.jpg"), new File("filename0b1.jpg"),
            new File("filename0b02.jpg"), new File("filename0c.jpg"), new File("filename00a.jpg"),
            new File("filename.txt"), new File("filename00a.txt"), new File("filename0a.txt"),
            new File("filename01.0hjh45-test.txt"), new File("filename01.0hjh46"),
            new File("filename2.hjh45.txt"), new File("filename01.1hjh45.txt"),
            new File("filename01.hjh45.txt"), new File("filename 01"),
            new File("filename 00")});

        //adaptor for comparing files
        Collections.sort(filenames, new Comparator<File>() {
            private final Comparator<String> NATURAL_SORT = new WindowsExplorerComparator();

            @Override
            public int compare(File o1, File o2) {;
                return NATURAL_SORT.compare(o1.getName(), o2.getName());
            }
        });

        for (File f : filenames) {
            System.out.println(f);
        }
    }

    public static class WindowsExplorerComparator implements Comparator<String> {

        private static final Pattern splitPattern = Pattern.compile("\\d+|\\.|\\s");

        @Override
        public int compare(String str1, String str2) {
            Iterator<String> i1 = splitStringPreserveDelimiter(str1).iterator();
            Iterator<String> i2 = splitStringPreserveDelimiter(str2).iterator();
            while (true) {
                //Til here all is equal.
                if (!i1.hasNext() && !i2.hasNext()) {
                    return 0;
                }
                //first has no more parts -> comes first
                if (!i1.hasNext() && i2.hasNext()) {
                    return -1;
                }
                //first has more parts than i2 -> comes after
                if (i1.hasNext() && !i2.hasNext()) {
                    return 1;
                }

                String data1 = i1.next();
                String data2 = i2.next();
                int result;
                try {
                    //If both datas are numbers, then compare numbers
                    result = Long.compare(Long.valueOf(data1), Long.valueOf(data2));
                    //If numbers are equal than longer comes first
                    if (result == 0) {
                        result = -Integer.compare(data1.length(), data2.length());
                    }
                } catch (NumberFormatException ex) {
                    //compare text case insensitive
                    result = data1.compareToIgnoreCase(data2);
                }

                if (result != 0) {
                    return result;
                }
            }
        }

        private List<String> splitStringPreserveDelimiter(String str) {
            Matcher matcher = splitPattern.matcher(str);
            List<String> list = new ArrayList<String>();
            int pos = 0;
            while (matcher.find()) {
                list.add(str.substring(pos, matcher.start()));
                list.add(matcher.group());
                pos = matcher.end();
            }
            list.add(str.substring(pos));
            return list;
        }
    }
}
Muslim answered 23/4, 2014 at 15:33 Comment(7)
Eureka! Perfection! Thank you so much for your help. You are the best.Geibel
You are right. But as you see, I tested around and there is no concrete doc, at least at that time, how this natural sort works. I am sure there are more problematic characters, e.g. $.Muslim
Hey, your solution work at some extend.....i come up with name showing in window are different from showing using this solution..Maunsell
So give some examples. However it is a try to mimic the behaviour.Muslim
Thank you for your solution. I tried it with these filenames, and it almost works. but only some of it has problems. see if you can come up with a solution.Blate
csvExamp: 01_arrow_top_left-512.png,1-2.png,1-3.png,1-4.png,1f188ecbd724b074e650415915638fb2.jpg,2.jpg,4-512-300x300.png,4a0272ab317c20a334f9388eb3f673e3.png,64.png,160_F_86503119_tjcyu7YgwrShWH7c9GzcHn8UAjVpSSQC.jpg,164-colorful-abstract-logo-vector-design.png,00205-3d-art-logo-design-free-logos-online-04.png,457ec0ef2a9a4891bc3f3d5a41c0f9cc.png,51217c2f4b0007850ae2e8e2f959cac0.jpg,356425_wookmark.jpg,469906_11015239_2085273_0a2f8603_thumbnail.png,95496784.png,95496784-1.png,db34461f830ec59c518cf33e7e35f487.png,eagle_wings_art_vector_logo.png,education-infographic-template_23-2147488675.jpgBlate
csvExamp Continue: education-logo-with-books_23-2147502567.jpg,education_leaf_school_inspiration_vector_logo_design.png,education_logo_6826483.jpg,FUEL-Education-logo-square-pantone.png,ictu_vector_logo.png,logo(1).png,logo.png,logo1.jpg,Logo-People-3.jpg,picture12963389781683.jpg,picture14014813586956.jpg,shot_1288979562.jpg,stock-vector-pencil-logo-585907025.jpg,stock-vector-vector-pencil-logo-design-template-263931704.jpg,Untitled-5.png,Untitled-8.png,v4l-125347.jpg,vector-abstract-logo-icon-design-set-10984762.jpg,vector_education_book_tree.pngBlate
C
0

Switch the signs of the first -1 and 1 in the compare method:

if (Character.isDigit(ch1))
{
    result = Character.isDigit(ch2) ? compareNumbers() : 1;
}
else if (Character.isLetter(ch1))
{
    result = Character.isLetter(ch2) ? compareOther(true) : 1;
}

These determine the ordering when the first string has a number but the second one does not, or the first one does not but the second one does.

Cardiomegaly answered 21/4, 2014 at 20:20 Comment(4)
Thanks. That did work for that issue, but it also ended up breaking something else. When I switched the -1 and 1 around, the ordering of 01 and 01a got messed up. It was sorting correctly being 01 and then 01a. When switching the -1 and 1, it sorted 01a and then 01.Geibel
Nevermind. With your help, I figured it out. Instead of switching the -1 and 1 around, I just changed the -1 to 1 and left the other 1 alone. This fixed the issue with the 01 and 01a sorting. ThanksGeibel
Ah. It's a bit unclear to me why that is, but I'm glad you solved it. I've edited my answer to reflect the solution you found.Cardiomegaly
As it turns out, this method of having both 1 and 1, instead of -1 and 1 only worked sometimes. Then eventually, it produced an exception saying that the comparison method violates its general contract. If -1 is for "isDigit", then filename will come after filename2 and filename 3, instead of before. If -1 is for "isLetter", then filename0a comes before filename0, instead of after.Geibel
F
0

If what you're sorting is or can be represented as a Collection of Files, you might want to take a look at the Apache Commons IO library NameFileComparator class. This provides several pre-built comparators that you can probably leverage to accomplish what you're looking for. For example, the NAME_INSENSITIVE_COMPARATOR should do what you want.

List<File> filenames = Arrays.asList(new File[] {
        new File("Filename01.jpg"), 
        new File("Filename1.jpg"), 
        new File("filename.jpg"),
        new File("filename2.jpg"),
        new File("filename03.jpg"),
        new File("filename3.jpg")});
Collections.sort(filenames, NameFileComparator.NAME_INSENSITIVE_COMPARATOR);
for (File f : filenames) {
    System.out.println(f);
}

Output:

filename.jpg
Filename01.jpg
filename03.jpg
Filename1.jpg
filename2.jpg
filename3.jpg
Fatsoluble answered 21/4, 2014 at 20:34 Comment(1)
Thanks. That would probably be the closest to what I want in that library. However, the filename03.jpg would need to be sorted right before the 3, instead of the 1.Geibel
M
0

Just to complete my suggestion from the comment. Here is a IMHO better readable version of the Comparator that (hopefully) sorts the way you need. The main logic is like I suggested it:

//Compare the namepart caseinsensitive.
int result = data1.name.compareToIgnoreCase(data2.name);
//If name is equal, then compare by number
if (result == 0) {
    result = data1.number.compareTo(data2.number);
}
//If numbers are equal then compare by length text of number. This
//is valid because it differs only by heading zeros. Longer comes
//first.
if (result == 0) {
    result = -Integer.compare(data1.numberText.length(), data2.numberText.length());
}
//If all above is equal, compare by ext.
if (result == 0) {
    result = data1.ext.compareTo(data2.ext);
}

As you see, this is a dynamic version, that handles names and extensions as well without any assumptions. I included in this little test program your first and your in the comments added test datas.

So here is the sorted output for your test data:

filename.jpg
filename00.jpg
filename0.jpg
Filename01.jpg
Filename1.jpg
filename2.jpg
filename03.jpg
filename3.jpg
filename0b.jpg
filename0b1.jpg
filename0b02.jpg
filename0c.jpg

And last but not least the complete code:

import java.io.File;
import java.util.Arrays;
import java.util.Collections;
import java.util.Comparator;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class WindowsSorter {

    public static void main(String args[]) {
        List<File> filenames = Arrays.asList(new File[]{new File("Filename01.jpg"),
            new File("Filename1.jpg"), new File("filename.jpg"), new File("filename2.jpg"),
            new File("filename03.jpg"), new File("filename3.jpg"), new File("filename00.jpg"),
            new File("filename0.jpg"), new File("filename0b.jpg"), new File("filename0b1.jpg"),
            new File("filename0b02.jpg"), new File("filename0c.jpg")});
        Collections.sort(filenames, new WindowsLikeComparator());
        for (File f : filenames) {
            System.out.println(f);
        }
    }

    private static class WindowsLikeComparator implements Comparator<File> {

        //Regexp to make the 3 part split of the filename.
        private static final Pattern splitPattern = Pattern.compile("^(.*?)(\\d*)(?:\\.([^.]*))?$");

        @Override
        public int compare(File o1, File o2) {
            SplitteFileName data1 = getSplittedFileName(o1);
            SplitteFileName data2 = getSplittedFileName(o2);

            //Compare the namepart caseinsensitive.
            int result = data1.name.compareToIgnoreCase(data2.name);
            //If name is equal, then compare by number
            if (result == 0) {
                result = data1.number.compareTo(data2.number);
            }
            //If numbers are equal then compare by length text of number. This
            //is valid because it differs only by heading zeros. Longer comes
            //first.
            if (result == 0) {
                result = -Integer.compare(data1.numberText.length(), data2.numberText.length());
            }
            //If all above is equal, compare by ext.
            if (result == 0) {
                result = data1.ext.compareTo(data2.ext);
            }
            return result;
        }

        private SplitteFileName getSplittedFileName(File f) {
            Matcher matcher = splitPattern.matcher(f.getName());
            if (matcher.matches()) {
                return new SplitteFileName(matcher.group(1), matcher.group(2), matcher.group(3));
            } else {
                return new SplitteFileName(f.getName(), null, null);
            }
        }

        static class SplitteFileName {

            String name;
            Long number;
            String numberText;
            String ext;

            public SplitteFileName(String name, String numberText, String ext) {
                this.name = name;
                if ("".equals(numberText)) {
                    this.number = -1L;
                } else {
                    this.number = Long.valueOf(numberText);
                }

                this.numberText = numberText;
                this.ext = ext;
            }
        }
    }
}

Edit 1: The algorithm was changed to adress the filename00, filename0 sorting issue.

Edit 2: After diving deeper into Windows Explorers sorting algorithm it is clear, that this answer is indeed a solution for the original post and testdata - thats why I will not delete it - but not a complete solution for mimicing Windows Explorers behaviour. Therefore I will provide another hopefully more complete solution to the problem.

Muslim answered 22/4, 2014 at 7:48 Comment(7)
Wow thanks. That does work quite nice. There is only one issue that I can see with this. And that is that filename00 is coming after filename0, when it should come before. Is there an easy fix for this?Geibel
Found another slight issue. filename00a comes at the end of the list, instead of right after filename00. So it sorts filename > filename0 > filename00 > filename2 > filename03 > filename3 > filename00a Need to sort filename > filename00 > filename00a > filename0 > filename2 > filename03 > filename3Geibel
But you said that first compare the name and because filename00a is the name (no number at the end) this follows the original rules. Or am I missing something? For the 00 issue I will change my post in a while.Muslim
Thanks for the 00 update. I guess it's difficult to explain. If you have a windows system, you could see the way it sorts files in its folders. I want it to mimic that. I'll break it down.. I guess its more like comparing every character, than if there's a number at the end or not. In the example below, you can see that the the sorter is first seeing that there are 3 zeroes.. So, anything that has 3 zeroes, will come before anything with 2 zeroes, rather or not it has a letter following it. Then it will sort after the zeroes, to see which filename goes first. (filename000 > filename000a)Geibel
filename000 > filename000a >filename00 > filename00a > filename00b > filename0 > filename1b2a3 > filename1b2b > filename1c > filename2 I hope this is clearer.Geibel
I googled a bit around and found some explanations on how it is implemented. So if this answer did you any good vote it up. I will provide a solution as a new answer because it is a complete new algorithm. But now its time to sleep.Muslim
Thank you for your time. I appreciate it. I would vote up, but I don't have enough reputation points. I encourage others viewing this to vote this answer up, though.Geibel
S
0

A Windows-only solution using OS native calls: https://mcmap.net/q/670675/-java-file-list-same-order-like-window-explorer

Sorting by name in Windows is tricky and far more complicated than your implementation. It's also configurable and version dependent.

NOTE: I created a demo for what follows in this post. Check it out on GitHub.

Sorting file names using StrCmpLogicalWComparator function

According to some (e.g. here) Windows uses StrCmpLogicalW for sorting files by name.

You could try to implement your comparator by calling this system function using JNA (don't forget to include JNA library in your project):

Comparator:

public class StrCmpLogicalWComparator implements Comparator<String> {

    @Override
    public int compare(String o1, String o2) {
        return Shlwapi.INSTANCE.StrCmpLogicalW(
            new WString(o1), new WString(o2));
    }
}

JNA part:

import com.sun.jna.WString;
import com.sun.jna.win32.StdCallLibrary;

public interface Shlwapi extends StdCallLibrary {

    Shlwapi INSTANCE = Native.load("Shlwapi", Shlwapi.class);

    int StrCmpLogicalW(WString psz1, WString psz2);
}

Handling file names that contain digits

I mentioned earlier that the way that Windows Explorer sorts files is configurable. You can change how numbers in file names are handled and toggle so-called "numerical sorting". You can read how to configure this here. Numerical sorting as explained in the docs:

Treat digits as numbers during sorting, for example, sort "2" before "10".

-- https://learn.microsoft.com/en-us/windows/win32/api/stringapiset/nf-stringapiset-comparestringex#SORT_DIGITSASNUMBERS

With numerical sorting enabled the result is:

file names sorted with numerical sorting enabled

whereas with numerical sorting disabled it looks like this:

file names sorted with numerical sorting disabled

This makes me think that Windows Explorer in fact uses CompareStringEx function for sorting which can be parameterized to enable this feature.

Sorting file names using CompareStringEx function

JNA part:

import com.sun.jna.Pointer;
import com.sun.jna.WString;
import com.sun.jna.win32.StdCallLibrary;

public interface Kernel32 extends StdCallLibrary {

    Kernel32 INSTANCE = Native.load("Kernel32", Kernel32.class);
    WString INVARIANT_LOCALE = new WString("");

    int CompareStringEx(WString lpLocaleName,
                        int dwCmpFlags,
                        WString lpString1,
                        int cchCount1,
                        WString lpString2,
                        int cchCount2,
                        Pointer lpVersionInformation,
                        Pointer lpReserved,
                        int lParam);

    default int CompareStringEx(int dwCmpFlags,
                                String str1,
                                String str2) {
        return Kernel32.INSTANCE
            .CompareStringEx(
                INVARIANT_LOCALE,
                dwCmpFlags,
                new WString(str1),
                str1.length(),
                new WString(str2),
                str2.length(),
                Pointer.NULL,
                Pointer.NULL,
                0);
    }
}

Numeric sort comparator:

public class CompareStringExNumericComparator implements Comparator<String> {

    private static int SORT_DIGITSASNUMBERS = 0x00000008;

    @Override
    public int compare(String o1, String o2) {
        int compareStringExComparisonResult =
            Kernel32.INSTANCE.CompareStringEx(SORT_DIGITSASNUMBERS, o1, o2);

        // CompareStringEx returns 1, 2, 3 respectively instead of -1, 0, 1
        return compareStringExComparisonResult - 2;
    }
}

Non-numeric sort comparator:

public class CompareStringExNonNumericComparator implements Comparator<String> {

    private static String INVARIANT_LOCALE = "";
    private static int NO_OPTIONS = 0;

    @Override
    public int compare(String o1, String o2) {
        int compareStringExComparisonResult =
            Kernel32.INSTANCE.CompareStringEx(NO_OPTIONS, o1, o2);

        // CompareStringEx returns 1, 2, 3 respectively instead of -1, 0, 1
        return compareStringExComparisonResult - 2;
    }
}

References

-- https://mcmap.net/q/670675/-java-file-list-same-order-like-window-explorer

Shores answered 6/2, 2020 at 16:53 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.