In programming, it is very common to deal with strings. One often comes across the concept of lexicographic order, which is a way of ordering sequences as words appear in a regular dictionary. Lexicographic order is important when talking about string sorting, comparisons and various operations in programming. However, below the surface, string ordering involves a fascinating teamwork between a programming language, Unicode, and Machine-level operations.

This post will emphasize how exactly a programming language like Javascript manages lexicographic order right down to the hardware language.

What is Lexicographic Order?

Lexicographic order, also known as dictionary order, is a technique to order sequences of strings by comparing each of their elements. In a dictionary, we find that words are generally ordered based on their alphabetical order. In programming, it’s the same, strings are ordered by comparing the numeric values of their characters' representations. Each character’s code value is compared from left to right, and the sequence is ordered based on these comparisons.

lexicographic order explanation

In the example above, we can see that numbers that arithmetically should appear first in the ordered list (such as 3), are displayed after bigger numbers like 11, or even looking appetizer at the bottom of the list when actually it starts with a, this is because, like in a dictionary, computers compare the digits one by one, not thinking about how many digits a number has.

Let’s see another example using 2 words.

lexicographic order explanation for words

When there is no match while comparing 2 graphemes, things get interesting, and it’s here where we start wondering about how computers define which should go first and why. In the above example, we can notice that in the dictionary, the e goes before i, and that’s how we (as humans) can infer that relocated should be listed before relocation. But, how does it work behind the scenes in programming languages?

The Role of Unicode

Most modern programming languages, like Javascript, use the Unicode standard to represent characters. Unicode assigns a unique Code Point (a numeric value) to every character in every writing system. For example, the letter “A” has a code point of 65, while “a” has a code point of 97. These code points are essential for determining the lexicographic order of strings; these code points are encoded as code units to be stored and transmitted as pieces of data with a fixed length.[1]

The lexicographic order of Unicode characters is generally based on these code point values. Here’s a small table showing some Unicode code points:

unicode code points

Additionally, Unicode introduces another layer of complexity with combining characters. A grapheme is what we define as a single character, however, it might be composed of multiple Unicode code points.

For example, é can be represented in two ways:

  • As a single code point: U+00E9.
  • As two code points: U+0065 (lowercase ‘e’) + U+0301 (combining acute accent). [2]

This can affect lexicographic ordering, as different representations of the same visual character might be sorted differently. Let’s see how Javascript Engines handle Code Points when sorting strings.

Code Points and Lexicographic Comparison in Javascript

In Javascript, strings are sequences of 16-bit code units. Each character in a string is represented by one or more code units, this is because sometimes, a code point is too large to fit in a single code unit, so it needs to be broken down into multiple units, so the number of code units needed to represent a single code point may be different.

When comparing strings lexicographically, Javascript compares their code units one by one until it finds a difference. Strings are compared lexicographically using the less than (<) and greater than (>) operators.

Let’s see a few examples:

console.log("apple" < "banana"); // true

console.log("come" < "came"); // false

console.log("9" < "apple"); // true
  1. In the first example, Javascript compares the strings apple and banana by comparing their characters’ Unicode code points. a (97) vs. b (98) - Since 97 is less than 98, apple is considered smaller than banana and will appear first when ordering them.

  2. When the first characters are the same, Javascript moves to the next characters in both strings and repeats the process until a difference is found or the end of the strings is reached. o (111) vs. a (97), “came” goes first.

  3. Last but not least, this example demonstrates how numbers (when treated as strings) are ordered in lexicographic order based on their Unicode code points as well. 9 (57) vs. a (97) reflects that 9 is considered smaller than apple. Consequently, 9 would appear before apple in a lexicographically sorted list.

Note: In Javascript, when comparing strings lexicographically, the comparison process generally stops as soon as it finds the first differing characters between the two strings. This is because the lexicographic order is determined by the first position where the strings differ. In other languages like SQL, it’s possible to compare strings using a variety of collations.

Now that we understand the basics, let’s explore how Javascript performs these comparisons at a lower level, closer to the machine language.

Javascript Engines and Machine Code

Javascript code runs inside a Javascript engine, with V8 being one of the most popular. The V8 engine is the powerhouse behind JavaScript execution in Chrome and Node.js. Developed by Google, V8 compiles JavaScript directly to native machine code, providing high performance and efficiency. [3]

Understanding how lexicographic order works requires us to look at how V8 handles string comparisons under the hood.

  1. String Representation: V8 uses different internal representations for strings, like ASCII and UTF-8 encodings to efficiently text data in character sets.

  2. Comparison Algorithm: When comparing strings, V8 first checks if the strings have the same encoding. If they do, it can perform a fast memory comparison. If not, it falls back to a more complex comparison routine.

  3. Branching: Depending on the result of the comparison, the CPU may execute a branch to either continue comparing the next characters or to return the result of the comparison.

  4. Unicode Processing: For Unicode strings, V8 follows the ECMAScript specification and the Unicode Collation Algorithm to handle comparisons.

  5. JIT Compilation: In many cases, string comparisons are optimized by V8’s just-in-time (JIT) compiler, translating them into efficient machine code.

Down to the Metal: Machine Code

At the lowest level, when Javascript code is compiled down to machine code, string comparison translates into a series of memory loads and comparisons. Here’s a simplified example of what the machine code might look like for comparing two strings:

loop:
    mov al, [rsi]    ; Load a byte from string 1
    mov bl, [rdi]    ; Load a byte from string 2
    cmp al, bl       ; Compare the bytes
    jne done         ; If not equal, we're done
    test al, al      ; Check if we've reached the end of string 1
    jz equal         ; If so, strings are equal
    inc rsi          ; Move to next character in string 1
    inc rdi          ; Move to next character in string 2
    jmp loop         ; Continue loop

done:
    sbb rax, rax     ; Set result based on last comparison
    or rax, 1        ; Ensure result is 1 (greater) or -1 (less)
    ret

equal:
    xor rax, rax     ; Set result to 0 (equal)
    ret

This assembly code demonstrates the character-by-character comparison that happens at the machine level.

In this pseudo-assembly, mov instructions load the characters into registers, cmp compares them, and jne branches to handle the result if they are not equal.

Conclusion

Lexicographic order in JavaScript is an important concept that connects how we deal with strings in our code to how computers actually operate at low-level. Understanding how programming languages like JavaScript compares strings from the functions we write to CPU instructions, allows us to provide a clearer perspective into both the performance and behavior of our code. The next time you sort a list of strings or compare them, think about the steps involved, from Unicode code points to CPU instructions. That way, you can write better and faster code and understand how the language interacts with the machine’s hardware.

References

  1. Chapter 24. Unicode and JavaScript
  2. Unicode Character “é”
  3. Understanding the V8 Engine: Optimizing JavaScript for Peak Performance