Sunday, May 10, 2015

Chess Notation Updates (1.1)

After reading the specification for Portable Game Notation (PGN; see previous blog post for more information) it got me thinking about some things I hadn't considered when I originally wrote my chess notations. So now I think I'll update them (the new versions of the notations are 1.1).

FCN and SFEN have the symbol + for check and # for "the game is over" but now I'm adding another digit which is either +, #, or absent. The first 2 digits are the same, for the final digit the + means a winner was determined but not because of a checkmate and # means the game is a draw but not because of a stalemate. The new digit allows +# to unambiguously mean checkmate. Examples of non-checkmate wins are by disqualification, surrender, or running out of time. Examples of non-stalemate draws are by agreement, unwinnable game (eg insufficient material), 3 fold repetition, or the 50 moves rule. Most games would not need to change their notation because all of these are rare outside of tournaments. The final digit being either + or # aids memory: +# is the most common win and #+ is uncommon, # is stalemate and ## is another type of draw. The first 2 digits have not changed meaning but the second digit being # is required for a third digit to exist. So Ra1-a8+#+ means that a check occurred, the game is over, and a winner was determined without checkmate. This makes a total of 6 end game combinations and 2 non-end game (instead of 2 and 2). An unfinished game should not use ## since that would mean the game is over instead in VGN use "*" (for non VGN just have no more moves).

BCFEN game termination (something used for VGN etc) is 0x88 (previously unused) although only 0x8 would be necessary, it needs to be divisible by bytes (even though it isn't divisible by board size). PGC uses 0x06 assuming you are at the top level in the game text section (ie not inside of a string, move number, etc). For BCCF the last bit (previously unused) will be 1: the game is over 0: it isn't. Therefore the game termination symbol is a single bit and doesn't change the length of the input at all.

I'll make another change to BCCF. Previously promotion bits of 00 either meant promoted to a rook or no promotion. Thus to determine if promotion was occurring the parser needed to check if the destination was either rank 1 or 8 (depending on color) and if the piece moving was a pawn. But now I'll use the penultimate bit (the bit right after the promotion bits) to be 1: promotion occurred 0: it didn't. This makes parsing easier. So now all of the bits of BCCF are used.

A note on size comparison. PGN explained SAN better. I was a bit off in my initial "Chess Notation" blog post. SAN maximum length is 7 (eg Qa6xb7# and fxg1=Q+) with an average of slightly more than 3. PGC was poorly explained but it sounded like the moves are variable length (thus slower parsing), includes NAG and other unnecessary things (thus larger), the moves are the index of every possible SAN move if sorted alphabetically (thus much more difficult than SAN), and each move takes either 2 or 3 bytes (thus is on average larger than BCCF). This all means PGC is inferior in pretty every way to BCCF. If you want a compressed game use BCCF and store any additional data elsewhere. Or just use VGN with MCN and zip the file or something but there's no need to go through the pain of PGC.

For SFEN I decided that no one cares about en passant and therefore it is not required even if a double move occurs. If you would like to use "-" for en passant (to mark that there isn't one) then you must also have the castling information so that it is not ambiguous. The castling information is now also optional. Thus the new regex for SFEN looks like this:
/^(?:[KQBNRPkqbnrp1-8]{1,8}\/){7}[KQBNRPkqbnrp1-8]{1,8}(?: [WBwb])?(?: (?:-|K?Q?k?q?)(?: -| [a-hA-H][1-8])?)?(?: \+?(?:#[+#]?)?)?$/
and again the empty string is not valid castling information and it should not have trailing spaces. FCN's regex is now:
/^(?:P[A-H][1-8]-[A-H][1-8](?:EN|(?:X[QBNRP])?(?:=[QBNR])?)|[KQBNR][A-H][1-8]-[A-H][1-8](?:X[QBNRP])?|[KQ]C)\+?(?:#[+#]?)?$/i
the castling was moved to the end because otherwise Qc1-a1 would match as a queen's castle. FCN has a minimum length of 2 (eg KC) with a maximum of 13 (eg Pa7-B8xR=q+#+) with the average being more than 6.

For the record, of my notations I just made changes to: BCCF, FCN, SFEN, and BCFEN (they are all now version 1.1). My only notation that didn't change was MCN because there's nothing to change, it does exactly what it is intended to do perfectly: express a chess move with a notation as simple as possible, as short as possible, and in a human friendly way (in that order). (I also didn't change VGN because I just made it).

VGN: Variable Game Notation

I looked up and read the (original) specification for Portable Game Notation (PGN) the original website was taken down but can be read here (link) and other places. As I started reading I thought I would make a rant blog post called "PGN: Painful Game Notation" and start with a quote of God's response to Babel but it turned out that PGN really isn't that bad. The problem is that PGN uses SAN for the move text section (and little else) despite being a terrible format.

Actually I did have to face palm particularly hard in the PGN spec section 3.2.3 Speed of processing which states that programs can "easily scan PGN data without having to have a full chess engine or other complex parsing routines" aaaarg it still hurts to look at. How could it even say such a thing? Was this part written before the writer knew what SAN was and this section wasn't updated afterward? Parsing (and writing) SAN ALWAYS requires a complex parsing routine in addition to a full chess engine (AI, GUI, etc not required). To extract just the coordinates from SAN requires this regex:
/^[QBNRK]?([A-H]?)([1-8]?)[x-]([A-H][1-8])(?:=[QBNR])?[+#]?$/i
which is far more complex parsing than Long Algebraic Notation (LAN) which would be (in javascript):
move(text.substr(1, 2), text.substr(4, 2))
assuming the pawn was padded to have a symbol (MCN is even more simple).

PGN also violates a couple of its goals. Mainly 2.2.1 because SAN is designed to be unnecessarily complex. There's also a problem with goal 2.2.3 because it doesn't allow any move text format except SAN (actually there are 2 others: FEN and PGC but the support is lacking as explained later).

There's also a list of things to nitpick. Section 6 defines a completely useless token "%" which does exactly the same thing as an already existing token ";" (semicolon) which is for a single line comment. There was no reason to reserve <> (in 7) since stray character are invalid anyway. Illegal moves are not allowed by SAN but PGN itself shouldn't care. (Petty paragraphs like this are not the reason this blog post is 5ish pages in length excluding formal definitions.)

VGN: Variable Game Notation

Based on PGN, this is a game notation specification which is different than move text notations. Although PGN is almost usable there is 1 really important difference that is required: the ability to use move text notations that aren't SAN.

This is done by the use of a new tag named MoveFormat eg: [MoveFormat "FCN:Friendly Coordinate Notation"]. Note that it is called MoveFormat even though FEN and others describe the board instead of the moves. The first value is what the program reads to determine which format the move text will be in, the second value is the human readable name and is not used by the parser (there are only 2 values). Because the second value is ignored by the parser it is optional. If the tag doesn't exist then the default is SAN for the sake of compatibility with PGN. In the event that multiple notations use the same acronym then number them such as: SAN, SAN2, SAN3 a parser should make public which formats are supported and how the duplicates are mapped. Multiple versions of the same format should be treated in the same way: numbers should be used (if multiple versions are supported) and make public how they map.

If the move format is a binary format (instead of plain text) then the MoveFormat tag must be the last tag and the encoding will start immediately after the closing square brace, no plain text comments of either kind are allowed after it and there is also no end line. Some example binary formats are BCCF, BCFEN, and PGC. After a binary format is terminated the parsing goes back to plain text, if there is another game to follow it is a good idea to use some end lines to separate the games so that a human can more easily find the plain text games in a file (although it would be better to segregate the games into 2 files for binary and plain text). Each binary format specifies its own game termination marker.

Binary move text sections are required to have at least 1 byte for the game termination marker (just like plain text). A binary move section is also required to be divisible by bytes for the sake of sanity. If a binary format is not normally divisible by bytes then it must contain 1 to 7 padding bits after the game termination marker. VGN parsers will send data to the binary parser byte by byte (which is one reason why it needs to be as a whole divisible by byte) so it would be a good idea for the binary format to also be divisible by bytes.

In PGN the SetUp tag always requires the FEN tag so a parser can ignore the SetUp and detect FEN because SetUp was 100% redundant/useless. For VGN the SetUp tag contains the board string directly. The first value of SetUp is the move format abbreviation and the second value is the board text. If the first value is omitted SFEN is used (for brevity and FEN compatibility). The board string may contain end game markers (the move text kind such as +# not the game terminators such as *) in which case the move text section will be empty (with only a game terminator). Note that even without this tag it is possible to have a game without any moves.

Minor things. % is no longer a symbol use semicolon instead (a semicolon is also valid PGN). Each format has its own parser but in all cases stray text is considered invalid (this includes < and >). Since the Result tag wasn't being used by parsers it should contain human readable information (such as "black ran out of time", "white surrendered", "Stalemate", or "White wins") instead of the game termination symbols (however the game termination symbols located at the end of the move text section remain the same and are still required). The Date tag is the start date. There is also a tag for EndDate, if omitted it is assumed to be the same as the start date. VGN does not disallow illegal moves, it is the move text format to decide that, and they should all allow it (SAN has a limitation of not allowing illegal moves thus making some games impossible to record in SAN).

About PGN move text formats. Earlier I said that PGN had partial support for notations that are not SAN. What PGN does is change the extension type of the file. ".pgn" is for normal games (that use SAN) ".fen" is for FEN and ".pgc" for PGC. While it is a good idea to separate binary from plain text it isn't necessary to require it. But more importantly using the extension type is bad design since not all of the information a parser needs is included in the file's contents. This is especially problematic for tools that allow text to be pasted since there will be no file extension at all. Additionally with many types of supported move text formats there would be many extension types all tied to the same program (your Operating System will require a lot of file type associations). Tags already have information about the game, the format should be located in a tag. Therefore instead of all of this VGN has only a single extension type which is ".vgn" and has the move text format included in a tag.

Which brings up a good point: VGN should indicate itself within the file contents. The tag is named GameFormat with the value defaulted to "PGN" for compatibility. It has 2 values just like the MoveFormat tag: [GameFormat "VGN:Variable Game Notation"] with the second being human readable and parser ignored. Note that this tag is required for each game (not once per file but once per game). If the file contains at least 1 VGN then the extension should be ".vgn"to indicate that a PGN parser can't read it but a VGN parser can. Note that having both FEN and SAN in the same file requires VGN since PGN doesn't have the MoveFormat tag.

VGN is very similar to PGN and is mostly backwards compatible. So you might ask "Why specify a whole new format? Why not simply call this PGN version 1.1?" because I don't own PGN. If I was to call this PGN 1.1 then later the owners of PGN may come out with an official 1.1 then my format would be incorrectly labeled.

More random stuff. White's move number is still optional and should have a single period; black's move number (if present) should be the whole move number followed by ".5" this is shorter than 3 periods and intuitively matches the half moves. Since I'm actually writing a program to parse VGN I've simplified a few things from PGN: comments can only appear before or between tags, before white's move number, or after black's NAG (PGN was unclear about where comments could exist since they weren't defined in the incomplete formal definition section). NAG can only appear after the move.

I ought to rewrite the Numeric Annotation Glyphs (NAG) entirely. NAG should be composite (taking multiple arguments) so that the NAG list doesn't need every possible commentary but only some formats. For example PGN NAG 26 is "White has a moderate space advantage" it should be more like $5.1.2.1.1 where 5 is the pattern "{0} has a {1} {2} {3}", 1 means white, 2 means moderate, 1 means space, 1 means advantage. Another example: $5.0.3.2.0 means "Black has a great piece disadvantage" (which should really be reworded as White's advantage). The text parsing would be pretty much the same, the look up table would be more complicated (but not unbearably so), the NAG list can be shortened and simplified since parameters can be reused, and it would be easier to translate into other human languages. Composite NAG also allows things that can't be otherwise done such as $1.1.300.20 to mean "White has 300 seconds and 20 milliseconds of time left". But I haven't finalized all the details or thought about how to improve or simplify it. Composite NAG is just an idea, for now VGN (1.0) will use PGN's NAG as is.

There should also be NAG to represent "only legal move" (7 and 8 are not this). There should also be NAGs for "ended turn in check" (which is legal in timed games) and "illegal move".

For the sake of versatility VGN can also support chess variants and fairy chess with the tags named "Rules" and "Pieces". For the Rules tag take every rule of normal western chess and number them (the normal rules should occupy the lowest numbers) then make a numbered list for every other rule supported. The Rules tag's values would be a list of which rules are used and there should be some short hand for starting with the normal rules. A number base above 10 could be used for compression since the value won't be human usable anyway. For the Pieces tag the normal pieces should use their normal symbol and there should be a short hand for starting with the normal pieces. The other supported piece types need to be labeled in some way. I'm not sure how pawn promotion would work but I'm thinking in the Pieces tag the symbol should start with an "=" (equal sign) to show that pawns can promote to it. Also note that without the SetUp tag the board will be the normal starting one and will not contain any special pieces. I'm a bit vague in this paragraph because my parser doesn't support it so anyone who wants to be able to do so will need to work out the details.

VGN Formal Definition 1

(* Using Extended Backus–Naur Form. And yes I realize the blog isn't wide enough for this. So paste it somewhere else.  *)
(* Note that only required white space is included, there are many places with optional whiteSpace *)
(* Special characters in ?? were borrowed from Augmented Backus–Naur Form *)
vgn_file = {vgn_game};
vgn_game = tag_section, (move_plain_text_section | move_binary_section);
(* Unfortunately the binary section will cause false positives for validation since invalid plain text will match as binary. *)

comment = ('{', {? VCHAR ? | ? WSP ? | ? EOL ?}, '}')
        | (';', {? VCHAR ? | ? WSP ?});  (* block comments can have '{' but not '}' *)
(* EOL is any kind of end line *)
tag_section = {{comment}, tag_pair, {comment}}; (* The tag section for a binary move format can't end with a comment or whitespace. *)
tag_pair = '[', identifier, ? WSP ?, tag_value, ']';
tag_value = '"', {? VCHAR ? | ? WSP ?}, '"'; (* the value can't have '"' unless preceded by '\' *)

game_termination = '1-0' | '0-1' | '1/2-1/2' | '*';
numeric_annotation_glyph = '$', digit, {digit}, ? WSP ?;
white_move_number = digit, {digit}, '.', ? WSP ?;  (* Can't be '0.' and must be in order *)
black_move_number = digit, {digit}, '.5', ? WSP ?;  (* Can't be '0.5' and must be in order *)
(* move = ???; This syntax depends on the move text format *)

move_element = [white_move_number], move, ? WSP ?, {numeric_annotation_glyph},
               [black_move_number], move, ? WSP ?, {numeric_annotation_glyph}, {comment};
last_move_element = [white_move_number], move, ? WSP ?, {numeric_annotation_glyph},
                   [[black_move_number], move, ? WSP ?, {numeric_annotation_glyph}], {comment};

recursive_element = move_element | recursive_annotation_variation | comment;
recursive_annotation_variation = '(', {recursive_element}, [last_move_element, {recursive_annotation_variation}, [game_termination]], {comment}, ')';

move_plain_text_section = {comment},
    [
       {move_element | recursive_annotation_variation},
       last_move_element,
       {recursive_annotation_variation}
    ],
    game_termination;
(* Note that comments that appear after game-termination are considered part of the next game (even if there isn't one). *)
move_binary_section = ? OCTET ?, {? OCTET ?};
(* OCTET is any byte *)

VGN Formal Definition 2

//in JavaScript. Paste into a syntax highlighter for best results.
var comment = '(?:\\s*(?:\\{[^}]*\\}|;.*)\\s*)';
var tag_value = '"(?:\\\\"|[^"\\r\\n])*"';  //tag_value can't contain end lines
var tag_pair = '\\[\\w+\\s+' + tag_value + '\\s*\\]';
var tag_section = '(?:\\s*' + comment + '*' + tag_pair + comment + '*\\s*)*';
    //The tag section for a binary move format can't end with a comment or whitespace.

var game_termination = '(?:1-0|0-1|1/2-1/2|\\*)';
var numeric_annotation_glyph = '(?:$\\d+\\s)';
var white_move_number = '(?:[1-9]\\d*\\.\\s)';
var black_move_number = '(?:[1-9]\\d*\\.5\\s)';
//var move is regex based on the move text format

var move_element = '(?:' + white_move_number + '?\\s*' + move.source + '\\s+' + numeric_annotation_glyph + '*\\s*' +
                           black_move_number + '?\\s*' + move.source + '\\s+' + numeric_annotation_glyph + '*' + comment + '*)';
var last_move_element = '(?:' + white_move_number + '?\\s*' + move.source + '\\s+' + numeric_annotation_glyph + '*\\s*' +
                        '(?:' + black_move_number + '?\\s*' + move.source + '\\s+' + numeric_annotation_glyph + '*)?' + comment + '*)';

var recursive_annotation_variation = '(?:\\s+|[()]|' + comment + '|' + last_move_element + '|' + game_termination + ')';
//note that rav can't be perfectly validated by regex because the rav can't contain itself
var move_plain_text_section = comment + '*(?:' +
    '(?:\\s*' + move_element + '\\s*|' + recursive_annotation_variation + ')*' +
    '\\s*' + last_move_element + '\\s*' + recursive_annotation_variation + '*)?\\s*' +
    game_termination;
//Note that comments that appear after game-termination are considered part of the next game (even if there isn't one).

var move_binary_section = '[\\s\\S]*';
//Unfortunately the binary regex will cause vgn_file.test() to always return true

var vgn_game = tag_section + '(?:' + move_plain_text_section + '|' + move_binary_section + ')';
var vgn_file = new RegExp('^(?:' + vgn_game + ')*$', 'i');