After reading the specification for Portable Game Notation (PGN; see previous blog post for more information) it got me thinking about some things I hadn't considered when I originally wrote my chess notations. So now I think I'll update them (the new versions of the notations are 1.1).
FCN and SFEN have the symbol + for check and # for "the game is over" but now I'm adding another digit which is either +, #, or absent. The first 2 digits are the same, for the final digit the + means a winner was determined but not because of a checkmate and # means the game is a draw but not because of a stalemate. The new digit allows +# to unambiguously mean checkmate. Examples of non-checkmate wins are by disqualification, surrender, or running out of time. Examples of non-stalemate draws are by agreement, unwinnable game (eg insufficient material), 3 fold repetition, or the 50 moves rule. Most games would not need to change their notation because all of these are rare outside of tournaments. The final digit being either + or # aids memory: +# is the most common win and #+ is uncommon, # is stalemate and ## is another type of draw. The first 2 digits have not changed meaning but the second digit being # is required for a third digit to exist. So Ra1-a8+#+ means that a check occurred, the game is over, and a winner was determined without checkmate. This makes a total of 6 end game combinations and 2 non-end game (instead of 2 and 2). An unfinished game should not use ## since that would mean the game is over instead in VGN use "*" (for non VGN just have no more moves).
BCFEN game termination (something used for VGN etc) is 0x88 (previously unused) although only 0x8 would be necessary, it needs to be divisible by bytes (even though it isn't divisible by board size). PGC uses 0x06 assuming you are at the top level in the game text section (ie not inside of a string, move number, etc). For BCCF the last bit (previously unused) will be 1: the game is over 0: it isn't. Therefore the game termination symbol is a single bit and doesn't change the length of the input at all.
I'll make another change to BCCF. Previously promotion bits of 00 either meant promoted to a rook or no promotion. Thus to determine if promotion was occurring the parser needed to check if the destination was either rank 1 or 8 (depending on color) and if the piece moving was a pawn. But now I'll use the penultimate bit (the bit right after the promotion bits) to be 1: promotion occurred 0: it didn't. This makes parsing easier. So now all of the bits of BCCF are used.
A note on size comparison. PGN explained SAN better. I was a bit off in my initial "Chess Notation" blog post. SAN maximum length is 7 (eg Qa6xb7# and fxg1=Q+) with an average of slightly more than 3. PGC was poorly explained but it sounded like the moves are variable length (thus slower parsing), includes NAG and other unnecessary things (thus larger), the moves are the index of every possible SAN move if sorted alphabetically (thus much more difficult than SAN), and each move takes either 2 or 3 bytes (thus is on average larger than BCCF). This all means PGC is inferior in pretty every way to BCCF. If you want a compressed game use BCCF and store any additional data elsewhere. Or just use VGN with MCN and zip the file or something but there's no need to go through the pain of PGC.
For SFEN I decided that no one cares about en passant and therefore it is not required even if a double move occurs. If you would like to use "-" for en passant (to mark that there isn't one) then you must also have the castling information so that it is not ambiguous. The castling information is now also optional. Thus the new regex for SFEN looks like this:
/^(?:[KQBNRPkqbnrp1-8]{1,8}\/){7}[KQBNRPkqbnrp1-8]{1,8}(?: [WBwb])?(?: (?:-|K?Q?k?q?)(?: -| [a-hA-H][1-8])?)?(?: \+?(?:#[+#]?)?)?$/
and again the empty string is not valid castling information and it should not have trailing spaces. FCN's regex is now:
/^(?:P[A-H][1-8]-[A-H][1-8](?:EN|(?:X[QBNRP])?(?:=[QBNR])?)|[KQBNR][A-H][1-8]-[A-H][1-8](?:X[QBNRP])?|[KQ]C)\+?(?:#[+#]?)?$/i
the castling was moved to the end because otherwise Qc1-a1 would match as a queen's castle. FCN has a minimum length of 2 (eg KC) with a maximum of 13 (eg Pa7-B8xR=q+#+) with the average being more than 6.
For the record, of my notations I just made changes to: BCCF, FCN, SFEN, and BCFEN (they are all now version 1.1). My only notation that didn't change was MCN because there's nothing to change, it does exactly what it is intended to do perfectly: express a chess move with a notation as simple as possible, as short as possible, and in a human friendly way (in that order). (I also didn't change VGN because I just made it).
Sunday, May 10, 2015
VGN: Variable Game Notation
I looked up and read the (original) specification for Portable Game Notation (PGN) the original website was taken down but can be read here (link) and other places. As I started reading I thought I would make a rant blog post called "PGN: Painful Game Notation" and start with a quote of God's response to Babel but it turned out that PGN really isn't that bad. The problem is that PGN uses SAN for the move text section (and little else) despite being a terrible format.
Actually I did have to face palm particularly hard in the PGN spec section 3.2.3 Speed of processing which states that programs can "easily scan PGN data without having to have a full chess engine or other complex parsing routines" aaaarg it still hurts to look at. How could it even say such a thing? Was this part written before the writer knew what SAN was and this section wasn't updated afterward? Parsing (and writing) SAN ALWAYS requires a complex parsing routine in addition to a full chess engine (AI, GUI, etc not required). To extract just the coordinates from SAN requires this regex:
/^[QBNRK]?([A-H]?)([1-8]?)[x-]([A-H][1-8])(?:=[QBNR])?[+#]?$/i
which is far more complex parsing than Long Algebraic Notation (LAN) which would be (in javascript):
move(text.substr(1, 2), text.substr(4, 2))
assuming the pawn was padded to have a symbol (MCN is even more simple).
PGN also violates a couple of its goals. Mainly 2.2.1 because SAN is designed to be unnecessarily complex. There's also a problem with goal 2.2.3 because it doesn't allow any move text format except SAN (actually there are 2 others: FEN and PGC but the support is lacking as explained later).
There's also a list of things to nitpick. Section 6 defines a completely useless token "%" which does exactly the same thing as an already existing token ";" (semicolon) which is for a single line comment. There was no reason to reserve <> (in 7) since stray character are invalid anyway. Illegal moves are not allowed by SAN but PGN itself shouldn't care. (Petty paragraphs like this are not the reason this blog post is 5ish pages in length excluding formal definitions.)
This is done by the use of a new tag named MoveFormat eg: [MoveFormat "FCN:Friendly Coordinate Notation"]. Note that it is called MoveFormat even though FEN and others describe the board instead of the moves. The first value is what the program reads to determine which format the move text will be in, the second value is the human readable name and is not used by the parser (there are only 2 values). Because the second value is ignored by the parser it is optional. If the tag doesn't exist then the default is SAN for the sake of compatibility with PGN. In the event that multiple notations use the same acronym then number them such as: SAN, SAN2, SAN3 a parser should make public which formats are supported and how the duplicates are mapped. Multiple versions of the same format should be treated in the same way: numbers should be used (if multiple versions are supported) and make public how they map.
If the move format is a binary format (instead of plain text) then the MoveFormat tag must be the last tag and the encoding will start immediately after the closing square brace, no plain text comments of either kind are allowed after it and there is also no end line. Some example binary formats are BCCF, BCFEN, and PGC. After a binary format is terminated the parsing goes back to plain text, if there is another game to follow it is a good idea to use some end lines to separate the games so that a human can more easily find the plain text games in a file (although it would be better to segregate the games into 2 files for binary and plain text). Each binary format specifies its own game termination marker.
Binary move text sections are required to have at least 1 byte for the game termination marker (just like plain text). A binary move section is also required to be divisible by bytes for the sake of sanity. If a binary format is not normally divisible by bytes then it must contain 1 to 7 padding bits after the game termination marker. VGN parsers will send data to the binary parser byte by byte (which is one reason why it needs to be as a whole divisible by byte) so it would be a good idea for the binary format to also be divisible by bytes.
In PGN the SetUp tag always requires the FEN tag so a parser can ignore the SetUp and detect FEN because SetUp was 100% redundant/useless. For VGN the SetUp tag contains the board string directly. The first value of SetUp is the move format abbreviation and the second value is the board text. If the first value is omitted SFEN is used (for brevity and FEN compatibility). The board string may contain end game markers (the move text kind such as +# not the game terminators such as *) in which case the move text section will be empty (with only a game terminator). Note that even without this tag it is possible to have a game without any moves.
Minor things. % is no longer a symbol use semicolon instead (a semicolon is also valid PGN). Each format has its own parser but in all cases stray text is considered invalid (this includes < and >). Since the Result tag wasn't being used by parsers it should contain human readable information (such as "black ran out of time", "white surrendered", "Stalemate", or "White wins") instead of the game termination symbols (however the game termination symbols located at the end of the move text section remain the same and are still required). The Date tag is the start date. There is also a tag for EndDate, if omitted it is assumed to be the same as the start date. VGN does not disallow illegal moves, it is the move text format to decide that, and they should all allow it (SAN has a limitation of not allowing illegal moves thus making some games impossible to record in SAN).
About PGN move text formats. Earlier I said that PGN had partial support for notations that are not SAN. What PGN does is change the extension type of the file. ".pgn" is for normal games (that use SAN) ".fen" is for FEN and ".pgc" for PGC. While it is a good idea to separate binary from plain text it isn't necessary to require it. But more importantly using the extension type is bad design since not all of the information a parser needs is included in the file's contents. This is especially problematic for tools that allow text to be pasted since there will be no file extension at all. Additionally with many types of supported move text formats there would be many extension types all tied to the same program (your Operating System will require a lot of file type associations). Tags already have information about the game, the format should be located in a tag. Therefore instead of all of this VGN has only a single extension type which is ".vgn" and has the move text format included in a tag.
Which brings up a good point: VGN should indicate itself within the file contents. The tag is named GameFormat with the value defaulted to "PGN" for compatibility. It has 2 values just like the MoveFormat tag: [GameFormat "VGN:Variable Game Notation"] with the second being human readable and parser ignored. Note that this tag is required for each game (not once per file but once per game). If the file contains at least 1 VGN then the extension should be ".vgn"to indicate that a PGN parser can't read it but a VGN parser can. Note that having both FEN and SAN in the same file requires VGN since PGN doesn't have the MoveFormat tag.
VGN is very similar to PGN and is mostly backwards compatible. So you might ask "Why specify a whole new format? Why not simply call this PGN version 1.1?" because I don't own PGN. If I was to call this PGN 1.1 then later the owners of PGN may come out with an official 1.1 then my format would be incorrectly labeled.
More random stuff. White's move number is still optional and should have a single period; black's move number (if present) should be the whole move number followed by ".5" this is shorter than 3 periods and intuitively matches the half moves. Since I'm actually writing a program to parse VGN I've simplified a few things from PGN: comments can only appear before or between tags, before white's move number, or after black's NAG (PGN was unclear about where comments could exist since they weren't defined in the incomplete formal definition section). NAG can only appear after the move.
I ought to rewrite the Numeric Annotation Glyphs (NAG) entirely. NAG should be composite (taking multiple arguments) so that the NAG list doesn't need every possible commentary but only some formats. For example PGN NAG 26 is "White has a moderate space advantage" it should be more like $5.1.2.1.1 where 5 is the pattern "{0} has a {1} {2} {3}", 1 means white, 2 means moderate, 1 means space, 1 means advantage. Another example: $5.0.3.2.0 means "Black has a great piece disadvantage" (which should really be reworded as White's advantage). The text parsing would be pretty much the same, the look up table would be more complicated (but not unbearably so), the NAG list can be shortened and simplified since parameters can be reused, and it would be easier to translate into other human languages. Composite NAG also allows things that can't be otherwise done such as $1.1.300.20 to mean "White has 300 seconds and 20 milliseconds of time left". But I haven't finalized all the details or thought about how to improve or simplify it. Composite NAG is just an idea, for now VGN (1.0) will use PGN's NAG as is.
There should also be NAG to represent "only legal move" (7 and 8 are not this). There should also be NAGs for "ended turn in check" (which is legal in timed games) and "illegal move".
For the sake of versatility VGN can also support chess variants and fairy chess with the tags named "Rules" and "Pieces". For the Rules tag take every rule of normal western chess and number them (the normal rules should occupy the lowest numbers) then make a numbered list for every other rule supported. The Rules tag's values would be a list of which rules are used and there should be some short hand for starting with the normal rules. A number base above 10 could be used for compression since the value won't be human usable anyway. For the Pieces tag the normal pieces should use their normal symbol and there should be a short hand for starting with the normal pieces. The other supported piece types need to be labeled in some way. I'm not sure how pawn promotion would work but I'm thinking in the Pieces tag the symbol should start with an "=" (equal sign) to show that pawns can promote to it. Also note that without the SetUp tag the board will be the normal starting one and will not contain any special pieces. I'm a bit vague in this paragraph because my parser doesn't support it so anyone who wants to be able to do so will need to work out the details.
Actually I did have to face palm particularly hard in the PGN spec section 3.2.3 Speed of processing which states that programs can "easily scan PGN data without having to have a full chess engine or other complex parsing routines" aaaarg it still hurts to look at. How could it even say such a thing? Was this part written before the writer knew what SAN was and this section wasn't updated afterward? Parsing (and writing) SAN ALWAYS requires a complex parsing routine in addition to a full chess engine (AI, GUI, etc not required). To extract just the coordinates from SAN requires this regex:
/^[QBNRK]?([A-H]?)([1-8]?)[x-]([A-H][1-8])(?:=[QBNR])?[+#]?$/i
which is far more complex parsing than Long Algebraic Notation (LAN) which would be (in javascript):
move(text.substr(1, 2), text.substr(4, 2))
assuming the pawn was padded to have a symbol (MCN is even more simple).
PGN also violates a couple of its goals. Mainly 2.2.1 because SAN is designed to be unnecessarily complex. There's also a problem with goal 2.2.3 because it doesn't allow any move text format except SAN (actually there are 2 others: FEN and PGC but the support is lacking as explained later).
There's also a list of things to nitpick. Section 6 defines a completely useless token "%" which does exactly the same thing as an already existing token ";" (semicolon) which is for a single line comment. There was no reason to reserve <> (in 7) since stray character are invalid anyway. Illegal moves are not allowed by SAN but PGN itself shouldn't care. (Petty paragraphs like this are not the reason this blog post is 5ish pages in length excluding formal definitions.)
VGN: Variable Game Notation
Based on PGN, this is a game notation specification which is different than move text notations. Although PGN is almost usable there is 1 really important difference that is required: the ability to use move text notations that aren't SAN.This is done by the use of a new tag named MoveFormat eg: [MoveFormat "FCN:Friendly Coordinate Notation"]. Note that it is called MoveFormat even though FEN and others describe the board instead of the moves. The first value is what the program reads to determine which format the move text will be in, the second value is the human readable name and is not used by the parser (there are only 2 values). Because the second value is ignored by the parser it is optional. If the tag doesn't exist then the default is SAN for the sake of compatibility with PGN. In the event that multiple notations use the same acronym then number them such as: SAN, SAN2, SAN3 a parser should make public which formats are supported and how the duplicates are mapped. Multiple versions of the same format should be treated in the same way: numbers should be used (if multiple versions are supported) and make public how they map.
If the move format is a binary format (instead of plain text) then the MoveFormat tag must be the last tag and the encoding will start immediately after the closing square brace, no plain text comments of either kind are allowed after it and there is also no end line. Some example binary formats are BCCF, BCFEN, and PGC. After a binary format is terminated the parsing goes back to plain text, if there is another game to follow it is a good idea to use some end lines to separate the games so that a human can more easily find the plain text games in a file (although it would be better to segregate the games into 2 files for binary and plain text). Each binary format specifies its own game termination marker.
Binary move text sections are required to have at least 1 byte for the game termination marker (just like plain text). A binary move section is also required to be divisible by bytes for the sake of sanity. If a binary format is not normally divisible by bytes then it must contain 1 to 7 padding bits after the game termination marker. VGN parsers will send data to the binary parser byte by byte (which is one reason why it needs to be as a whole divisible by byte) so it would be a good idea for the binary format to also be divisible by bytes.
In PGN the SetUp tag always requires the FEN tag so a parser can ignore the SetUp and detect FEN because SetUp was 100% redundant/useless. For VGN the SetUp tag contains the board string directly. The first value of SetUp is the move format abbreviation and the second value is the board text. If the first value is omitted SFEN is used (for brevity and FEN compatibility). The board string may contain end game markers (the move text kind such as +# not the game terminators such as *) in which case the move text section will be empty (with only a game terminator). Note that even without this tag it is possible to have a game without any moves.
Minor things. % is no longer a symbol use semicolon instead (a semicolon is also valid PGN). Each format has its own parser but in all cases stray text is considered invalid (this includes < and >). Since the Result tag wasn't being used by parsers it should contain human readable information (such as "black ran out of time", "white surrendered", "Stalemate", or "White wins") instead of the game termination symbols (however the game termination symbols located at the end of the move text section remain the same and are still required). The Date tag is the start date. There is also a tag for EndDate, if omitted it is assumed to be the same as the start date. VGN does not disallow illegal moves, it is the move text format to decide that, and they should all allow it (SAN has a limitation of not allowing illegal moves thus making some games impossible to record in SAN).
About PGN move text formats. Earlier I said that PGN had partial support for notations that are not SAN. What PGN does is change the extension type of the file. ".pgn" is for normal games (that use SAN) ".fen" is for FEN and ".pgc" for PGC. While it is a good idea to separate binary from plain text it isn't necessary to require it. But more importantly using the extension type is bad design since not all of the information a parser needs is included in the file's contents. This is especially problematic for tools that allow text to be pasted since there will be no file extension at all. Additionally with many types of supported move text formats there would be many extension types all tied to the same program (your Operating System will require a lot of file type associations). Tags already have information about the game, the format should be located in a tag. Therefore instead of all of this VGN has only a single extension type which is ".vgn" and has the move text format included in a tag.
Which brings up a good point: VGN should indicate itself within the file contents. The tag is named GameFormat with the value defaulted to "PGN" for compatibility. It has 2 values just like the MoveFormat tag: [GameFormat "VGN:Variable Game Notation"] with the second being human readable and parser ignored. Note that this tag is required for each game (not once per file but once per game). If the file contains at least 1 VGN then the extension should be ".vgn"to indicate that a PGN parser can't read it but a VGN parser can. Note that having both FEN and SAN in the same file requires VGN since PGN doesn't have the MoveFormat tag.
VGN is very similar to PGN and is mostly backwards compatible. So you might ask "Why specify a whole new format? Why not simply call this PGN version 1.1?" because I don't own PGN. If I was to call this PGN 1.1 then later the owners of PGN may come out with an official 1.1 then my format would be incorrectly labeled.
More random stuff. White's move number is still optional and should have a single period; black's move number (if present) should be the whole move number followed by ".5" this is shorter than 3 periods and intuitively matches the half moves. Since I'm actually writing a program to parse VGN I've simplified a few things from PGN: comments can only appear before or between tags, before white's move number, or after black's NAG (PGN was unclear about where comments could exist since they weren't defined in the incomplete formal definition section). NAG can only appear after the move.
I ought to rewrite the Numeric Annotation Glyphs (NAG) entirely. NAG should be composite (taking multiple arguments) so that the NAG list doesn't need every possible commentary but only some formats. For example PGN NAG 26 is "White has a moderate space advantage" it should be more like $5.1.2.1.1 where 5 is the pattern "{0} has a {1} {2} {3}", 1 means white, 2 means moderate, 1 means space, 1 means advantage. Another example: $5.0.3.2.0 means "Black has a great piece disadvantage" (which should really be reworded as White's advantage). The text parsing would be pretty much the same, the look up table would be more complicated (but not unbearably so), the NAG list can be shortened and simplified since parameters can be reused, and it would be easier to translate into other human languages. Composite NAG also allows things that can't be otherwise done such as $1.1.300.20 to mean "White has 300 seconds and 20 milliseconds of time left". But I haven't finalized all the details or thought about how to improve or simplify it. Composite NAG is just an idea, for now VGN (1.0) will use PGN's NAG as is.
There should also be NAG to represent "only legal move" (7 and 8 are not this). There should also be NAGs for "ended turn in check" (which is legal in timed games) and "illegal move".
For the sake of versatility VGN can also support chess variants and fairy chess with the tags named "Rules" and "Pieces". For the Rules tag take every rule of normal western chess and number them (the normal rules should occupy the lowest numbers) then make a numbered list for every other rule supported. The Rules tag's values would be a list of which rules are used and there should be some short hand for starting with the normal rules. A number base above 10 could be used for compression since the value won't be human usable anyway. For the Pieces tag the normal pieces should use their normal symbol and there should be a short hand for starting with the normal pieces. The other supported piece types need to be labeled in some way. I'm not sure how pawn promotion would work but I'm thinking in the Pieces tag the symbol should start with an "=" (equal sign) to show that pawns can promote to it. Also note that without the SetUp tag the board will be the normal starting one and will not contain any special pieces. I'm a bit vague in this paragraph because my parser doesn't support it so anyone who wants to be able to do so will need to work out the details.
VGN Formal Definition 1
(* Using Extended Backus–Naur Form. And yes I realize the blog isn't wide enough for this. So paste it somewhere else. *) (* Note that only required white space is included, there are many places with optional whiteSpace *) (* Special characters in ?? were borrowed from Augmented Backus–Naur Form *) vgn_file = {vgn_game}; vgn_game = tag_section, (move_plain_text_section | move_binary_section); (* Unfortunately the binary section will cause false positives for validation since invalid plain text will match as binary. *) comment = ('{', {? VCHAR ? | ? WSP ? | ? EOL ?}, '}') | (';', {? VCHAR ? | ? WSP ?}); (* block comments can have '{' but not '}' *) (* EOL is any kind of end line *) tag_section = {{comment}, tag_pair, {comment}}; (* The tag section for a binary move format can't end with a comment or whitespace. *) tag_pair = '[', identifier, ? WSP ?, tag_value, ']'; tag_value = '"', {? VCHAR ? | ? WSP ?}, '"'; (* the value can't have '"' unless preceded by '\' *) game_termination = '1-0' | '0-1' | '1/2-1/2' | '*'; numeric_annotation_glyph = '$', digit, {digit}, ? WSP ?; white_move_number = digit, {digit}, '.', ? WSP ?; (* Can't be '0.' and must be in order *) black_move_number = digit, {digit}, '.5', ? WSP ?; (* Can't be '0.5' and must be in order *) (* move = ???; This syntax depends on the move text format *) move_element = [white_move_number], move, ? WSP ?, {numeric_annotation_glyph}, [black_move_number], move, ? WSP ?, {numeric_annotation_glyph}, {comment}; last_move_element = [white_move_number], move, ? WSP ?, {numeric_annotation_glyph}, [[black_move_number], move, ? WSP ?, {numeric_annotation_glyph}], {comment}; recursive_element = move_element | recursive_annotation_variation | comment; recursive_annotation_variation = '(', {recursive_element}, [last_move_element, {recursive_annotation_variation}, [game_termination]], {comment}, ')'; move_plain_text_section = {comment}, [ {move_element | recursive_annotation_variation}, last_move_element, {recursive_annotation_variation} ], game_termination; (* Note that comments that appear after game-termination are considered part of the next game (even if there isn't one). *) move_binary_section = ? OCTET ?, {? OCTET ?}; (* OCTET is any byte *)
VGN Formal Definition 2
//in JavaScript. Paste into a syntax highlighter for best results. var comment = '(?:\\s*(?:\\{[^}]*\\}|;.*)\\s*)'; var tag_value = '"(?:\\\\"|[^"\\r\\n])*"'; //tag_value can't contain end lines var tag_pair = '\\[\\w+\\s+' + tag_value + '\\s*\\]'; var tag_section = '(?:\\s*' + comment + '*' + tag_pair + comment + '*\\s*)*'; //The tag section for a binary move format can't end with a comment or whitespace. var game_termination = '(?:1-0|0-1|1/2-1/2|\\*)'; var numeric_annotation_glyph = '(?:$\\d+\\s)'; var white_move_number = '(?:[1-9]\\d*\\.\\s)'; var black_move_number = '(?:[1-9]\\d*\\.5\\s)'; //var move is regex based on the move text format var move_element = '(?:' + white_move_number + '?\\s*' + move.source + '\\s+' + numeric_annotation_glyph + '*\\s*' + black_move_number + '?\\s*' + move.source + '\\s+' + numeric_annotation_glyph + '*' + comment + '*)'; var last_move_element = '(?:' + white_move_number + '?\\s*' + move.source + '\\s+' + numeric_annotation_glyph + '*\\s*' + '(?:' + black_move_number + '?\\s*' + move.source + '\\s+' + numeric_annotation_glyph + '*)?' + comment + '*)'; var recursive_annotation_variation = '(?:\\s+|[()]|' + comment + '|' + last_move_element + '|' + game_termination + ')'; //note that rav can't be perfectly validated by regex because the rav can't contain itself var move_plain_text_section = comment + '*(?:' + '(?:\\s*' + move_element + '\\s*|' + recursive_annotation_variation + ')*' + '\\s*' + last_move_element + '\\s*' + recursive_annotation_variation + '*)?\\s*' + game_termination; //Note that comments that appear after game-termination are considered part of the next game (even if there isn't one). var move_binary_section = '[\\s\\S]*'; //Unfortunately the binary regex will cause vgn_file.test() to always return true var vgn_game = tag_section + '(?:' + move_plain_text_section + '|' + move_binary_section + ')'; var vgn_file = new RegExp('^(?:' + vgn_game + ')*$', 'i');
Sunday, April 12, 2015
Chess Notation
First you must know how to play Western Chess (link). Then you need to know chess Algebraic Notation (link) you only need to read the Algebraic Notation section.
I'll start by complaining about SAN (Standard Algebraic Notation) which is used by PGN (Portable Game Notation). It is designed to be slightly compact however it only saves 2 characters (3 for pawns) per move but doing so makes the definition complicated. SAN does not allow the source coordinate unless necessary due to ambiguity (link). This makes the notation difficult for both computers and humans to read and write because they must first determine every possible move of every piece of the same kind. So you have to jump through a flaming hoop to save a few pennies. Usability is much more important than a few bytes especially in modern times! The obvious defense is that this notation was made a long time ago when bytes were important but this notation is not even close to compressed enough for that excuse (see my Binary Compressed Coordinate Format below). Now if the source coordinate was ignored when unneeded I would not be complaining. There is no reason for it to not allow the source coordinate, programs (and humans) already know the coordinates but they must use extra code (or thought) in order to not allow the source coordinate based on these rules.
</Rant> So let's make the world a better place by using a better notation. Here are the ones I've defined (only BCCF and FCN are useful). Note that PGN allows comments and turn numbers and stuff none of which I'll talk about.
Here is the Perl regular expression definition: /^[A-H][1-8][A-H][1-8][QBNR]?$/i
Rational: This notation is the smallest that can uniquely do everything (and be usable) but is not reversible because capturing isn't recorded.
The first 3 bits are the octal source file followed by 3 bits of the octal source rank (total of 6 bits so far). Then 6 bits for the octal destination coordinate. The next 2 bits are what the pawn was promoted to (ignored if does not apply). The last 2 bits have no defined meaning (thus are always ignored). Note that the rank/file are 0 indexed. The promotion pieces in octal are: 0 Rook, 1 Knight, 2 Bishop, 3 Queen. 00 should be used when promotion does not apply.
The last 2 bits exist to keep the bits divisible by bytes, they have no official meaning however an optional meaning can be given to them. For example a bit could represent white/black or yes/no for if a piece was captured. Or they could be yes/no this move caused check and yes/no this move ended the game, so that all 4 states are: 00 normal state, 10 check, 11 check mate, 01 stale mate. They could represent en passant or castling occurred.
Example: in binary, grouped by meaning 000 000 111 111 10 00 (grouped by nibble: 0000 0011 1111 1000 in hex: 0x03F8) this translates to A1H8B (with optional bits as 00) of course this is not a valid move. In order to send a byte array over a network look up Base64 encoding.
Compression compared to SAN: BCCF always takes 2 bytes, SAN takes between 2 and 6 bytes (queen's castle is 0-0-0 the next longest is Rc1c3).
Rational: an entire chess game can be compressed and stored or sent over a network.
Here is the Perl regular expression definition:
/^(?:[KQ]C|P[A-H][1-8]-[A-H][1-8](?:EN|(?:X[QBNRP])?(?:=[QBNR])?)|[KQBNR][A-H][1-8]-[A-H][1-8](?:X[QBNRP])?)\+?#?$/i
And yes it is possible to win the game by performing a castling. Observe this FEN: 4rkr1/4p1p1/8/8/8/8/8/3K2R1 w K - ? ? then 4rkr1/4p1p1/8/8/8/8/8/5RK1 b - - ? ? checkmate. Which brings me to my next notation definition.
Rational: this contains as much information as possible, includes "=x-" for padding, and includes special notations "en", "KC", "QC" all of which maximized human usability.
Here is a Perl regular expression for some validation:
/^(?:[KQBNRPkqbnrp1-8]{1,8}\/){7}[KQBNRPkqbnrp1-8]{1,8}(?: [WBwb] (?:-|K?Q?k?q?)(?: [a-hA-H][1-8])?)?(?: (?:\+#|\+|#))?$/
Here are some examples:
4rkr1/4p1p1/8/8/8/8/8/5RK1 b - +#
4rkr1/4p1p1/8/8/8/8/8/5RK1 +#
4rkr1/4p1p1/8/8/8/8/8/5RK1 B -
rnbqkbnr/pppppppp/8/8/8/P7/1PPPPPPP/RNBQKBNR b KQkq a2
4rkr1/4p1p1/8/8/8/8/8/5RK1
Although a regular expression could do more validation for castling ability it would be wasteful to list each possibility and even then it can't account for invalid piece movement. Therefore it doesn't account for a single invalid case of the empty string instead of a hyphen for castling ability.
Rational: for completeness and minor improvements. The main thing is adding "+#". I removed the Full move number for being useless and the Half move clock for being virtually useless (is only used in tournaments which wouldn't be using board snapshots especially not for this number). I made the en passant indicator optional because it is insane not to be and made the turn indicator (and en passant coordinate) case insensitive because why not.
0000 0: Empty square
0001 1: White Rook
0010 2: White Knight
0011 3: White Bishop
0100 4: White Queen
0101 5: White King
0110 6: White Pawn
0111 7
1000 8
1001 9: Black Rook
1010 A: Black Knight
1011 B: Black Bishop
1100 C: Black Queen
1101 D: Black King
1110 E: Black Pawn
1111 F
There are 3 unused square states which are: 0x7, 0x8, 0xF. White and black differ by a sign bit with 0x0 being empty square. Feature: an empty rank is all 0, also that the negative 0 is not used. In this notation there is no rank separator (forward slash) since it must be binary and there is no simple way to represent multiple blank squares. Additionally unlike FEN/SFEN the information after the board can't be provided (since it isn't necessary and this format must stay condensed and simple).
The unused states have no official definition but an optional meaning can be given to them such as: threatened by white or black (7/F), possible to en passant here (the empty square that was double moved over), or possible to castle across this square (note that even then castling ability must be tracked separately for example the starting board has none of these squares).
The entire board will be 8*8=64 hex digits (32 bytes). For example here is the starting board in hex (grouped by rank): 9ABCDBA9 EEEEEEEE 00000000 00000000 00000000 00000000 66666666 12345321
Compression comparison: BCFEN always takes 32 bytes per board, SFEN (board only) takes between 17 and 71 bytes. The extra information of SFEN takes 0 to 13 bytes. FEN's half clock goes up to 100 (3 bytes) and the full move clock has a huge unknown limit (in free play these numbers have no limit at all).
Rational: for completeness mostly. For compressing a game BCCN is much shorter and I doubt a large amount of boards will need to be stored or sent. Actually it might be nice to store the boards like this to determine threefold repetition (could need to store 99 half moves) although strings are double size it isn't that bad to store another 1 KiB or so. But this compression can be stored in a binary search tree for fast comparison.
I'll start by complaining about SAN (Standard Algebraic Notation) which is used by PGN (Portable Game Notation). It is designed to be slightly compact however it only saves 2 characters (3 for pawns) per move but doing so makes the definition complicated. SAN does not allow the source coordinate unless necessary due to ambiguity (link). This makes the notation difficult for both computers and humans to read and write because they must first determine every possible move of every piece of the same kind. So you have to jump through a flaming hoop to save a few pennies. Usability is much more important than a few bytes especially in modern times! The obvious defense is that this notation was made a long time ago when bytes were important but this notation is not even close to compressed enough for that excuse (see my Binary Compressed Coordinate Format below). Now if the source coordinate was ignored when unneeded I would not be complaining. There is no reason for it to not allow the source coordinate, programs (and humans) already know the coordinates but they must use extra code (or thought) in order to not allow the source coordinate based on these rules.
</Rant> So let's make the world a better place by using a better notation. Here are the ones I've defined (only BCCF and FCN are useful). Note that PGN allows comments and turn numbers and stuff none of which I'll talk about.
MCN: Minimum Coordinate Notation
This is what PGN should be using. This is the minimum information required while maintaining human usability. It uses coordinates without hyphen and ignores case using King's position for castling. The source and destination coordinates are both always required (therefore is easy to use unlike above mentioned SAN) there is no piece indicator except for pawn promotion. Example of a pawn being promoted to a queen: a7A8q, example of a King's Castle: e1g1.Here is the Perl regular expression definition: /^[A-H][1-8][A-H][1-8][QBNR]?$/i
Rational: This notation is the smallest that can uniquely do everything (and be usable) but is not reversible because capturing isn't recorded.
BCCF: Binary Compressed Coordinate Format
This is only useful for computers and is not human usable. Based on MCN, each move takes 2 bytes.The first 3 bits are the octal source file followed by 3 bits of the octal source rank (total of 6 bits so far). Then 6 bits for the octal destination coordinate. The next 2 bits are what the pawn was promoted to (ignored if does not apply). The last 2 bits have no defined meaning (thus are always ignored). Note that the rank/file are 0 indexed. The promotion pieces in octal are: 0 Rook, 1 Knight, 2 Bishop, 3 Queen. 00 should be used when promotion does not apply.
The last 2 bits exist to keep the bits divisible by bytes, they have no official meaning however an optional meaning can be given to them. For example a bit could represent white/black or yes/no for if a piece was captured. Or they could be yes/no this move caused check and yes/no this move ended the game, so that all 4 states are: 00 normal state, 10 check, 11 check mate, 01 stale mate. They could represent en passant or castling occurred.
Example: in binary, grouped by meaning 000 000 111 111 10 00 (grouped by nibble: 0000 0011 1111 1000 in hex: 0x03F8) this translates to A1H8B (with optional bits as 00) of course this is not a valid move. In order to send a byte array over a network look up Base64 encoding.
Compression compared to SAN: BCCF always takes 2 bytes, SAN takes between 2 and 6 bytes (queen's castle is 0-0-0 the next longest is Rc1c3).
Rational: an entire chess game can be compressed and stored or sent over a network.
FCN: Friendly Coordinate Notation
This is the verbose form of MCN, it is reversible and is the most human friendly notation (it is also computer usable). Begin with the definition of MCN but the notation always starts with the piece symbol. It also has a hyphen between the source and destination positions (example: Pa2-a3). If a capture occurred indicate it after the destination with an "x" then the piece symbol (such as Ra1-a8xQ). For an en passant use "en" instead of "xP" for example: Pa5-b6en. If pawn promotion occurred add an equal sign between capture indication (if any) and the piece symbol promoted to (eg Pa7-B8xR=q or Pa7-A8=N). For castling instead of using the King's coordinates special notation is used: either "KC" for King's Castle or "QC" for Queen's Castle. The notation may end (after pawn promotion) with the 2 symbols: "+#" where "+" means check and "#" means that the game is over. Therefore all 4 states are covered: Ra1-a8+ is check, Ra1-a8+# is check mate, Ra1-a8# is stale mate, and Ra1-a8 is used otherwise.Here is the Perl regular expression definition:
/^(?:[KQ]C|P[A-H][1-8]-[A-H][1-8](?:EN|(?:X[QBNRP])?(?:=[QBNR])?)|[KQBNR][A-H][1-8]-[A-H][1-8](?:X[QBNRP])?)\+?#?$/i
And yes it is possible to win the game by performing a castling. Observe this FEN: 4rkr1/4p1p1/8/8/8/8/8/3K2R1 w K - ? ? then 4rkr1/4p1p1/8/8/8/8/8/5RK1 b - - ? ? checkmate. Which brings me to my next notation definition.
Rational: this contains as much information as possible, includes "=x-" for padding, and includes special notations "en", "KC", "QC" all of which maximized human usability.
SFEN: Shortened FEN
Since this is based on Forsyth-Edwards Notation you will first need to know how that works (link) (don't worry about the Half move clock and Full move number). FEN is useful for board snapshots (such the example above in FCN) but I would make some modifications. The board is the same (and still required) but it is now the only part that is required. The rest is a big chunk that is optional. The same things exist in the same order except that Half move clock and Full move number are not allowed and en passant is optional. I'm also adding the "+#" at the end of the notation which functions the same as in my FCN (see above).Here is a Perl regular expression for some validation:
/^(?:[KQBNRPkqbnrp1-8]{1,8}\/){7}[KQBNRPkqbnrp1-8]{1,8}(?: [WBwb] (?:-|K?Q?k?q?)(?: [a-hA-H][1-8])?)?(?: (?:\+#|\+|#))?$/
Here are some examples:
4rkr1/4p1p1/8/8/8/8/8/5RK1 b - +#
4rkr1/4p1p1/8/8/8/8/8/5RK1 +#
4rkr1/4p1p1/8/8/8/8/8/5RK1 B -
rnbqkbnr/pppppppp/8/8/8/P7/1PPPPPPP/RNBQKBNR b KQkq a2
4rkr1/4p1p1/8/8/8/8/8/5RK1
Although a regular expression could do more validation for castling ability it would be wasteful to list each possibility and even then it can't account for invalid piece movement. Therefore it doesn't account for a single invalid case of the empty string instead of a hyphen for castling ability.
Rational: for completeness and minor improvements. The main thing is adding "+#". I removed the Full move number for being useless and the Half move clock for being virtually useless (is only used in tournaments which wouldn't be using board snapshots especially not for this number). I made the en passant indicator optional because it is insane not to be and made the turn indicator (and en passant coordinate) case insensitive because why not.
BCFEN: Binary Compressed FEN
This is only useful for computers and is not human usable. Based on FEN/SFEN, each rank takes 4 bytes and each board takes 32 bytes. There are 6 pieces: Rook, Knight, Bishop, Queen, King, Pawn. It needs to account for both colors and an empty square. Therefore the total number of square states are 13 which needs 4 bits. These states are defined as (binary hex):0000 0: Empty square
0001 1: White Rook
0010 2: White Knight
0011 3: White Bishop
0100 4: White Queen
0101 5: White King
0110 6: White Pawn
0111 7
1000 8
1001 9: Black Rook
1010 A: Black Knight
1011 B: Black Bishop
1100 C: Black Queen
1101 D: Black King
1110 E: Black Pawn
1111 F
There are 3 unused square states which are: 0x7, 0x8, 0xF. White and black differ by a sign bit with 0x0 being empty square. Feature: an empty rank is all 0, also that the negative 0 is not used. In this notation there is no rank separator (forward slash) since it must be binary and there is no simple way to represent multiple blank squares. Additionally unlike FEN/SFEN the information after the board can't be provided (since it isn't necessary and this format must stay condensed and simple).
The unused states have no official definition but an optional meaning can be given to them such as: threatened by white or black (7/F), possible to en passant here (the empty square that was double moved over), or possible to castle across this square (note that even then castling ability must be tracked separately for example the starting board has none of these squares).
The entire board will be 8*8=64 hex digits (32 bytes). For example here is the starting board in hex (grouped by rank): 9ABCDBA9 EEEEEEEE 00000000 00000000 00000000 00000000 66666666 12345321
Compression comparison: BCFEN always takes 32 bytes per board, SFEN (board only) takes between 17 and 71 bytes. The extra information of SFEN takes 0 to 13 bytes. FEN's half clock goes up to 100 (3 bytes) and the full move clock has a huge unknown limit (in free play these numbers have no limit at all).
Rational: for completeness mostly. For compressing a game BCCN is much shorter and I doubt a large amount of boards will need to be stored or sent. Actually it might be nice to store the boards like this to determine threefold repetition (could need to store 99 half moves) although strings are double size it isn't that bad to store another 1 KiB or so. But this compression can be stored in a binary search tree for fast comparison.
Subscribe to:
Posts (Atom)