Sorting for Localized Websites

Web Development | February 22, 2017

When I first started out as a software developer, I asked one of the Senior Developers that I worked with at the time, “What is the most important thing I should know really well?’. Now I was totally expecting him to tell me the name of some cutting edge web framework, a new emerging technology, or tell me about complex data structures and algorithms. To my surprise his response was only two words: “String manipulation”.

At the time, I didn’t give it much thought and found his answer to be rather humorous in fact.  What I soon found out is there is much, much more to those two words.  

As I grew in my career, I found out there was a lot more to learn about pattern matching and replacement, casting strings to numbers and dates, parsing data, sorting, and internationalization (i18n) and localization (i10n) of strings.

When you get right down to it, String manipulation covers many different topics that developers will be faced with on a daily basis.

Using the Java language, typically when we want to sort strings we would create a String array and call its sort method as such:

String[] myArray = [“Antelope”,”Cat”,”Bear”];

Arrays.sort(myArray);

System.out.println(“Sorted Output: “ + Arrays.toString(myArray));

Sorted Output: [Antelope, Bear, Cat]

 

Most of the time this is perfectly acceptable and gives us the result we are looking for; except when it comes to sorting strings that have been translated into other languages and/or other alphabets. The sort method of the Arrays class uses the classes “natural ordering”, which for strings means our array will be sorted based on its unicode value.

Now, why is this a problem?

If you are designing a website that will be localized and have visitors from all around the world there is a good chance you will end up with some mixed content where more than one language and/or alphabet is represented on the same site.

For instance, if you have a page that lists technical articles from different authors on a site localized for Russia; eventually we may start to see some of those articles in English while others will be in the native language for that site. If we are relying solely on the “natural ordering” when sorting these articles, what we will see is all articles in the Latin alphabet will be listed first, then all articles written in the Cyrillic alphabet. But is this correct?

Depending on the business needs, the product owner(s) may determine this is perfectly acceptable. Then again, since the site is for Russian visitors, the Product Owner(s) may determine that the Russian language should take precedence in any sorting of lists, etc.

How do we solve this issue?

We could write our own Comparator, which would take awhile to ensure we have all the rules complete and accurate. Or we could simply look to the Java APIs and employ the use of the RuleBasedCollator class.

The RuleBasedCollator gives us a few options to overcome the ‘natural order’ of sorting a string array. We can manually define the rules on a character-by-character basis by identifying the ordering. The example rule below is meant to simply give you an idea of what is possible and does not incorporate every identifier nor may be correct for your purposes.

Example rule:  

“A,a < B,b < E,e;è=ë”. 

Explanation:

    • ‘A’ is greater than ‘a’ as a matter of case difference (tertiary)
    • ‘a’ is greater than ‘B’ as a matter of letter difference (primary)
    • ‘B’ is greater than ‘b’ as a matter of case difference
    • ‘B’ is greater than ‘E’ as a matter of letter difference
    • ‘E’ is greater than ‘e’ as a matter of case difference
    • ‘e’ is greater than ‘è’ as a matter of accent different (secondary)
    • ‘è’ is equal to ‘ë’

This granular detail for sorting is great but not very dynamic. Since we expect our website to scale across many Locales we will need to be a bit more dynamic.  For our next example we will use ‘en_US’ as our system default Locale and ‘ru_RU as the locale for our localized site in Russian. In this example we simply want the language/alphabet that has been localized sorted first followed by the default.

String alphabets = [“C”,”Ц”,”б”,”A”];

RuleBasedCollator ru = Collator.getInstance(new Locale(‘ru’,’RU’));

RuleBasedCollator defaultColl = Collator.getInstance();

RuleBasedCollator ruledCollator = new RuleBasedCollator(

ru.getRules() + defaultColl.getRules());

Collections.sort(Arrays.asList(alphabets), ruledCollator);

System.out.println(“Output: “ + Arrays.toString(alphabets));

Output: [б, Ц, A, C]

In the above example we created three different RuleBasedCollator objects, the first one based on the ‘ru_RU’ Locale; the second creates an instance from the default Locale of the JVM, this is the equivalent of ‘Locale.getDefault();’

To create the third RuleBasedCollator we concatenated the rules for our target Locale and the default Locale and pass to the constructor.

Finally, we use the Java Collections class to sort our strings based on this new RuleBasedCollator and print out the sorted array.

While this isn’t entirely dynamic, with a little work you could obtain the targeted Locale from the website and use this to build your RuleBasedCollator.  So you should now have a good idea of how to provide your Localized site(s) with better sorting than simply relying on the ‘natural ordering’ of things and learned something new in the realm of String Manipulation.

Brien McCurdy is a Program Team Lead at iCiDIGITAL. He develops enterprise solutions in the Web experience management (WEM) space and has over seven years experience developing enterprise applications.