Sunday, April 12, 2015

Chess Notation

First you must know how to play Western Chess (link). Then you need to know chess Algebraic Notation (link) you only need to read the Algebraic Notation section.

I'll start by complaining about SAN (Standard Algebraic Notation) which is used by PGN (Portable Game Notation). It is designed to be slightly compact however it only saves 2 characters (3 for pawns) per move but doing so makes the definition complicated. SAN does not allow the source coordinate unless necessary due to ambiguity (link). This makes the notation difficult for both computers and humans to read and write because they must first determine every possible move of every piece of the same kind. So you have to jump through a flaming hoop to save a few pennies. Usability is much more important than a few bytes especially in modern times! The obvious defense is that this notation was made a long time ago when bytes were important but this notation is not even close to compressed enough for that excuse (see my Binary Compressed Coordinate Format below). Now if the source coordinate was ignored when unneeded I would not be complaining. There is no reason for it to not allow the source coordinate, programs (and humans) already know the coordinates but they must use extra code (or thought) in order to not allow the source coordinate based on these rules.

</Rant> So let's make the world a better place by using a better notation. Here are the ones I've defined (only BCCF and FCN are useful). Note that PGN allows comments and turn numbers and stuff none of which I'll talk about.

MCN: Minimum Coordinate Notation

This is what PGN should be using. This is the minimum information required while maintaining human usability. It uses coordinates without hyphen and ignores case using King's position for castling. The source and destination coordinates are both always required (therefore is easy to use unlike above mentioned SAN) there is no piece indicator except for pawn promotion. Example of a pawn being promoted to a queen: a7A8q, example of a King's Castle: e1g1.
Here is the Perl regular expression definition: /^[A-H][1-8][A-H][1-8][QBNR]?$/i
Rational: This notation is the smallest that can uniquely do everything (and be usable) but is not reversible because capturing isn't recorded.

BCCF: Binary Compressed Coordinate Format

This is only useful for computers and is not human usable. Based on MCN, each move takes 2 bytes.
The first 3 bits are the octal source file followed by 3 bits of the octal source rank (total of 6 bits so far). Then 6 bits for the octal destination coordinate. The next 2 bits are what the pawn was promoted to (ignored if does not apply). The last 2 bits have no defined meaning (thus are always ignored). Note that the rank/file are 0 indexed. The promotion pieces in octal are: 0 Rook, 1 Knight, 2 Bishop, 3 Queen. 00 should be used when promotion does not apply.

The last 2 bits exist to keep the bits divisible by bytes, they have no official meaning however an optional meaning can be given to them. For example a bit could represent white/black or yes/no for if a piece was captured. Or they could be yes/no this move caused check and yes/no this move ended the game, so that all 4 states are: 00 normal state, 10 check, 11 check mate, 01 stale mate. They could represent en passant or castling occurred.

Example: in binary, grouped by meaning 000 000 111 111 10 00 (grouped by nibble: 0000 0011 1111 1000 in hex: 0x03F8) this translates to A1H8B (with optional bits as 00) of course this is not a valid move. In order to send a byte array over a network look up Base64 encoding.
Compression compared to SAN: BCCF always takes 2 bytes, SAN takes between 2 and 6 bytes (queen's castle is 0-0-0 the next longest is Rc1c3).
Rational: an entire chess game can be compressed and stored or sent over a network.

FCN: Friendly Coordinate Notation

This is the verbose form of MCN, it is reversible and is the most human friendly notation (it is also computer usable). Begin with the definition of MCN but the notation always starts with the piece symbol. It also has a hyphen between the source and destination positions (example: Pa2-a3). If a capture occurred indicate it after the destination with an "x" then the piece symbol (such as Ra1-a8xQ). For an en passant use "en" instead of "xP" for example: Pa5-b6en. If pawn promotion occurred add an equal sign between capture indication (if any) and the piece symbol promoted to (eg Pa7-B8xR=q or Pa7-A8=N). For castling instead of using the King's coordinates special notation is used: either "KC" for King's Castle or "QC" for Queen's Castle. The notation may end (after pawn promotion) with the 2 symbols: "+#" where "+" means check and "#" means that the game is over. Therefore all 4 states are covered: Ra1-a8+ is check, Ra1-a8+# is check mate, Ra1-a8# is stale mate, and Ra1-a8 is used otherwise.

Here is the Perl regular expression definition:
/^(?:[KQ]C|P[A-H][1-8]-[A-H][1-8](?:EN|(?:X[QBNRP])?(?:=[QBNR])?)|[KQBNR][A-H][1-8]-[A-H][1-8](?:X[QBNRP])?)\+?#?$/i

And yes it is possible to win the game by performing a castling. Observe this FEN: 4rkr1/4p1p1/8/8/8/8/8/3K2R1 w K - ? ? then 4rkr1/4p1p1/8/8/8/8/8/5RK1 b - - ? ? checkmate. Which brings me to my next notation definition.
Rational: this contains as much information as possible, includes "=x-" for padding, and includes special notations "en", "KC", "QC" all of which maximized human usability.

SFEN: Shortened FEN

Since this is based on Forsyth-Edwards Notation you will first need to know how that works (link) (don't worry about the Half move clock and Full move number). FEN is useful for board snapshots (such the example above in FCN) but I would make some modifications. The board is the same (and still required) but it is now the only part that is required. The rest is a big chunk that is optional. The same things exist in the same order except that Half move clock and Full move number are not allowed and en passant is optional. I'm also adding the "+#" at the end of the notation which functions the same as in my FCN (see above).

Here is a Perl regular expression for some validation:
/^(?:[KQBNRPkqbnrp1-8]{1,8}\/){7}[KQBNRPkqbnrp1-8]{1,8}(?: [WBwb] (?:-|K?Q?k?q?)(?: [a-hA-H][1-8])?)?(?: (?:\+#|\+|#))?$/

Here are some examples:
4rkr1/4p1p1/8/8/8/8/8/5RK1 b - +#
4rkr1/4p1p1/8/8/8/8/8/5RK1 +#
4rkr1/4p1p1/8/8/8/8/8/5RK1 B -
rnbqkbnr/pppppppp/8/8/8/P7/1PPPPPPP/RNBQKBNR b KQkq a2
4rkr1/4p1p1/8/8/8/8/8/5RK1

Although a regular expression could do more validation for castling ability it would be wasteful to list each possibility and even then it can't account for invalid piece movement. Therefore it doesn't account for a single invalid case of the empty string instead of a hyphen for castling ability.
Rational: for completeness and minor improvements. The main thing is adding "+#". I removed the Full move number for being useless and the Half move clock for being virtually useless (is only used in tournaments which wouldn't be using board snapshots especially not for this number). I made the en passant indicator optional because it is insane not to be and made the turn indicator (and en passant coordinate) case insensitive because why not.

BCFEN: Binary Compressed FEN

This is only useful for computers and is not human usable. Based on FEN/SFEN, each rank takes 4 bytes and each board takes 32 bytes. There are 6 pieces: Rook, Knight, Bishop, Queen, King, Pawn. It needs to account for both colors and an empty square. Therefore the total number of square states are 13 which needs 4 bits. These states are defined as (binary hex):
0000 0: Empty square
0001 1: White Rook
0010 2: White Knight
0011 3: White Bishop
0100 4: White Queen
0101 5: White King
0110 6: White Pawn
0111 7
1000 8
1001 9: Black Rook
1010 A: Black Knight
1011 B: Black Bishop
1100 C: Black Queen
1101 D: Black King
1110 E: Black Pawn
1111 F
There are 3 unused square states which are: 0x7, 0x8, 0xF. White and black differ by a sign bit with 0x0 being empty square. Feature: an empty rank is all 0, also that the negative 0 is not used. In this notation there is no rank separator (forward slash) since it must be binary and there is no simple way to represent multiple blank squares. Additionally unlike FEN/SFEN the information after the board can't be provided (since it isn't necessary and this format must stay condensed and simple).

The unused states have no official definition but an optional meaning can be given to them such as: threatened by white or black (7/F), possible to en passant here (the empty square that was double moved over), or possible to castle across this square (note that even then castling ability must be tracked separately for example the starting board has none of these squares).

The entire board will be 8*8=64 hex digits (32 bytes). For example here is the starting board in hex (grouped by rank): 9ABCDBA9 EEEEEEEE 00000000 00000000 00000000 00000000 66666666 12345321
Compression comparison: BCFEN always takes 32 bytes per board, SFEN (board only) takes between 17 and 71 bytes. The extra information of SFEN takes 0 to 13 bytes. FEN's half clock goes up to 100 (3 bytes) and the full move clock has a huge unknown limit (in free play these numbers have no limit at all).
Rational: for completeness mostly. For compressing a game BCCN is much shorter and I doubt a large amount of boards will need to be stored or sent. Actually it might be nice to store the boards like this to determine threefold repetition (could need to store 99 half moves) although strings are double size it isn't that bad to store another 1 KiB or so. But this compression can be stored in a binary search tree for fast comparison.