Thursday, October 30, 2014

Sorting Algorithms: Ascii, Alphabetical, and Title

It bugs me when I see people use the phrases "sort alphabetically" and "ascii sort" or "by character code" interchangeably, they are not the same thing. But instead of ranting about it I decided to define them. I will be assuming all sorts are ascending. The definition for Ascii is fairly well accepted and this is the simplest sort.

Sort by Character code


Input must only be strings of known length.

Compare the character codes of each character until they are different at which point the lower code number comes first. If one string starts with the other then the shorter string comes first. Note that this includes the null character.

Implementation problem: some programming languages use null terminated strings but this sorting algorithm allows null within the string therefore special care will be needed.

Note that the algorithm can correctly sort strings of any fixed width character encoded without needing to know how to read the characters. Variable width character encoding, on the other hand, would need to be interpreted to get the correct order. Because of the ability to handle other encodings beyond ascii it would more accurately be called character code sorting.

Here is a javascript implementation for it (javascript allows nulls in strings):
function characterCodeAscending(a,b)
{
   for (var i=0; i < a.length; i++)
   {
       if(i >= b.length) return 1;
       if(a.charAt(i) !== b.charAt(i)) return (a.charCodeAt(i) - b.charCodeAt(i));
   }
    if(b.length > a.length) return -1;
    return 0;
}
//that was for the sake of illustration. for brevity use this instead:
function compareTo(a,b)
{
    if(a === b) return 0;
    if(a > b) return 1;
    return -1;
}


Alphabetical sort algorithm definition.


Input must only be strings of known length. Strings can't contain non-printable characters (such as null). Case is ignored and the rest is the same as character code sorting. This definition is accepted as far as I know.

Since strings don't normally contain non-printable characters the similarity of definitions is why people sometimes confuse them. But the big difference between them is that Alphabetical is case insensitive. This means that "STRING VALUE" has the same order as "string value".

Note that null terminated strings don't need special handling for alphabetical. But encoding must always be known to determine upper and lower case.

javascript implementation (input is not validated for printable characters):
function alphabetical(a,b){return compareTo(a.toLowerCase(),b.toLowerCase());}  //see above for compareTo definition


Book Title sort algorithm definition.


It is used for books, movies, and other media. Input must only be strings of known length. Strings can't contain non-printable characters (such as null). The only whitespace allowed is a space. A space may not be next to another space. Trailing and leading spaces are not allowed. Start with the definition for alphabetical with some differences.

Punctuation is ignored, only alphabetical and numeric symbols are used when determining order. Eg, "Joe's?" is the same as "j!O,e...s". What is considered alphabetical is determined by the language (linguistic). Note that the number of alphanumeric characters must be counted to determine the effective string length.

Leading articles are ignored: "a", "an", and "the". If the string starts with one of these 3 articles (or equivalent if not using English) it will be ignored for name comparison. Only the first word will be ignored and only if it contains at least 2 words. Eg, "The the name" only ignores the first word but for "The" and "But I..." nothing is ignored. If 2 titles differ only by leading article then they should be sorted with the article included thus "The Wind" will come before "Wind".

Titles should be split by the first colon. The part before the colon is the series name and the part after is the subtitle. The subtitle is ignored unless the series name is the same. If a colon does not exist then the entire thing is considered the series name. The series name may end in a number in which case they should be sorted numerically ascending. For example "1999 Encyclopedia Volume 2" comes before "1999 Encyclopedia Volume 10" even though alphabetically 10 comes before 2. This should take into account any numbering system such as Roman numerals, numeric, and numbers as words. Humans can do their best to determine the numbering system but this is where computer implementation becomes impossible. If the series names are the same then the subtitle is compared alphabetically (not via title sort, eg do not ignore the word "the" etc). A series name that does not have a number should come first (ie be considered 1).

Example difficulties. There is a video game named "Final Fantasy X-2" (ten two) it is a sequel to ten but is not related to eleven, a human would sort them "Final Fantasy X", "Final Fantasy X-2", "Final Fantasy XI" (apparently FF11 has a box). But even if a computer recognized the Roman numeral numbering it would see "X-2" as unrelated text and place it before all of the ones in the number sequence (it would actually make more sense to have "-2" be the subtitle and therefore be placed after 10). These Metal Gear Solid games should be in this order:
Metal Gear
Metal Gear 2: Solid Snake
Metal Gear Rising: Revengeance
Metal Gear Solid
Metal Gear Solid: Peace Walker
Metal Gear Solid 2: Sons of Liberty
Metal Gear Solid 3: Snake Eater
Metal Gear Solid 4: Guns of the Patriots
Metal Gear Solid V: Ground Zeroes
Metal Gear Solid V: The Phantom Pain
A title without a numeric should be considered 1 (or rather negative infinity). Notice how it switches from numeric digits to Roman numerals and how there are 2 "Metal Gear Solid V"s (and 1s even though Peace Walker isn't actually).

Despite these programmatic difficulties the algorithm should not go to great lengths to account for everything. It is impossible to account for all possible scenarios therefore use a fairly simple algorithm and then manually rearrange the exceptions afterwards.

Here is a javascript implementation for Book Title sorting. It only uses non-extended ascii and is English.
function titleSort(titleA, titleB)
{
    var stripA = titleA.toLowerCase().replace(/[^a-z0-9 :]/g, '').trim().replace(/ +/g, ' ');
    var stripB = titleB.toLowerCase().replace(/[^a-z0-9 :]/g, '').trim().replace(/ +/g, ' ');

    if(stripA !== stripB && stripA.replace(/^(?:an?|the) /, '') === stripB.replace(/^(?:an?|the) /, '')) return compareTo(stripA, stripB);
    stripA = stripA.replace(/^(?:an?|the) /, '');
    stripB = stripB.replace(/^(?:an?|the) /, '');

    var seriesA, seriesB, subA, subB;
   if (stripA.indexOf(':') !== -1)
   {
       seriesA = stripA.split(':')[0];
       subA = stripA.substr(stripA.indexOf(':')+1);
   }
    else{seriesA = stripA; subA = '';}
   if (stripB.indexOf(':') !== -1)
   {
       seriesB = stripB.split(':')[0];
       subB = stripB.substr(stripB.indexOf(':')+1);
   }
    else{seriesB = stripB; subB = '';}

    var numberA, numberB;

    if((/\d+$/).test(seriesA)){numberA = parseInt((/ (\d+)$/).exec(seriesA)[1]); seriesA = seriesA.replace(/ \d+$/, '');}
    else numberA = -Infinity;
    if((/\d+$/).test(seriesB)){numberB = parseInt((/ (\d+)$/).exec(seriesB)[1]); seriesB = seriesB.replace(/ \d+$/, '');}
    else numberB = -Infinity;

    if(seriesA !== seriesB) return compareTo(seriesA, seriesB);
    if(numberA !== numberB) return compareTo(numberA, numberB);
    //only if from the same series. don't subtract due to -Infinity
    //note that if you have different numbering schemes then set numberA/B to the numeric value for comparison sake

    return compareTo(subA, subB);
    //see above for compareTo definition
}
//This one has too much meat for plain text so paste it into a syntax highlighter.

No comments:

Post a Comment