Assembler specification — benbridle.com

This is the specification for the standard assembler of the Bedrock computer system.

This document is aimed at people who are implementing the assembler from scratch. For people who are learning about or writing program for the Bedrock system, the assembler manual will generally be more useful.

Overview

The assembler converts a UTF-8 encoded text file into Bedrock bytecode.

The file extension used for Bedrock source code files is .brc, and the file extension used for assembled Bedrock programs is .br.

Concepts

Token

A token is an instance of a language element. Each token is separated from the next by the whitespace characters tab (0x09, newline (0x0A), carriage-return (0x0D), and space (0x20), or by the next token beginning with a delimiter character (, ), [, ], {, }, ;, or by the previous token ending with a delimiter character or the terminator character :.

String and comment tokens are not split by these rules.

Identifier

An identifier is a sequence of zero or more characters. There is no restriction to the characters that can be used in an identifier. The maximum length of an identifier is 63 characters.

An identifier longer than 63 characters is an error.

Address

The address of a language element is equal to the number of bytes that were assembled before that element.

Character

A character is a Unicode scalar value.

Hexadecimal digit

A hexadecimal digit is a character in the ranges 0-9, A-F, or a-f.

Language elements

Literal

A byte literal is two hexadecimal digits. It assembles to a byte with value equal to the hexadecimal value.

A double literal is four hexadecimal digits. It assembles to a double with value equal to the hexadecimal value.

Spacer

A spacer is a # character followed by two or four hexadecimal digits. It assembles to a number of zero bytes equal to the hexadecimal value.

Mark

A mark is a single [ or ] character. It assembles to nothing.

Comment

A comment is a ( character, followed by zero or more characters, followed by a ) character. It assembles to nothing.

An unmatched ( or ) character is an error.

Block

A block is a { character called the block start, matched with a } character called the block end. Each block end matches the most recent unmatched block start. The block start assembles to a double with value equal to the address of the block end.

An unmatched block start or block end is an error. A block start or block end in a macro definition that is not matched in the same definition is an error. A block end with an address greater than 0xFFFF is an error.

String

A raw string is a ' character, followed by zero or more characters called the string content, followed by a ' character. It assembles to the string content as a UTF-8 encoded byte sequence.

A terminated string is a " character, followed by zero or more characters called the string content, followed by a " character. It assembles to the string content as a UTF-8 encoded byte sequence, followed by a zero byte.

An unclosed string is an error.

Label

A global label is a @ character followed by an identifier. The name of the label is given by the identifier. It assembles to nothing.

A local label is a & character followed by an identifier. The name of the label is the name of the most recent global label if any, followed by a / character, followed by the identifier. It assembles to nothing.

A label that shares a name with another label or macro definition is an error. A label with an address greater than 0xFFFF is an error.

Macro definition

A macro definition is a % character, followed by an identifier, followed by zero or more macro body tokens called the macro body, followed by a ; character. The name of the macro is given by the identifier. It assembles to nothing.

A macro body token can be a literal, spacer, mark, comment, block, string, or symbol.

A macro definition that shares a name with a label or another macro definition is an error. A macro definition with a body that contains a label or a macro definition is an error. An unmatched ; character is an error.

Symbol

A symbol is a plain identifier, and is any token that is not another language element. The name of a symbol is given by the identifier. If the identifier begins with a ~ character, the ~ character is replaced with the name of the most recent global label if any, followed by a / character.

If the symbol shares a name with a label, it assembles to a double with value equal to the address of that label. If the symbol shares a name with a macro definition, the symbol is replaced with the macro body and is then assembled.

A symbol with the same name as a macro definition, but that comes before that definition in the program, is an error. A symbol that does not share a name with a label or a macro definition is an error.

Instruction mnemonics

The following macro definitions must be predefined by the assembler:

%HLT 00;  %NOP  20;  %DB1  40;  %DB2   60;  %DB3  80;  %DB4   A0;  %DB5   C0;  %DB6    E0;
%PSH 01;  %PSH: 21;  %PSH* 41;  %PSH*: 61;  %PSHr 81;  %PSHr: A1;  %PSHr* C1;  %PSHr*: E1;
             %: 21;                %*: 61;                %r: A1;                 %r*: E1;
%POP 02;  %POP: 22;  %POP* 42;  %POP*: 62;  %POPr 82;  %POPr: A2;  %POPr* C2;  %POPr*: E2;
%CPY 03;  %CPY: 23;  %CPY* 43;  %CPY*: 63;  %CPYr 83;  %CPYr: A3;  %CPYr* C3;  %CPYr*: E3;
%DUP 04;  %DUP: 24;  %DUP* 44;  %DUP*: 64;  %DUPr 84;  %DUPr: A4;  %DUPr* C4;  %DUPr*: E4;
%OVR 05;  %OVR: 25;  %OVR* 45;  %OVR*: 65;  %OVRr 85;  %OVRr: A5;  %OVRr* C5;  %OVRr*: E5;
%SWP 06;  %SWP: 26;  %SWP* 46;  %SWP*: 66;  %SWPr 86;  %SWPr: A6;  %SWPr* C6;  %SWPr*: E6;
%ROT 07;  %ROT: 27;  %ROT* 47;  %ROT*: 67;  %ROTr 87;  %ROTr: A7;  %ROTr* C7;  %ROTr*: E7;
%JMP 08;  %JMP: 28;  %JMP* 48;  %JMP*: 68;  %JMPr 88;  %JMPr: A8;  %JMPr* C8;  %JMPr*: E8;
%JMS 09;  %JMS: 29;  %JMS* 49;  %JMS*: 69;  %JMSr 89;  %JMSr: A9;  %JMSr* C9;  %JMSr*: E9;
%JCN 0A;  %JCN: 2A;  %JCN* 4A;  %JCN*: 6A;  %JCNr 8A;  %JCNr: AA;  %JCNr* CA;  %JCNr*: EA;
%JCS 0B;  %JCS: 2B;  %JCS* 4B;  %JCS*: 6B;  %JCSr 8B;  %JCSr: AB;  %JCSr* CB;  %JCSr*: EB;
%LDA 0C;  %LDA: 2C;  %LDA* 4C;  %LDA*: 6C;  %LDAr 8C;  %LDAr: AC;  %LDAr* CC;  %LDAr*: EC;
%STA 0D;  %STA: 2D;  %STA* 4D;  %STA*: 6D;  %STAr 8D;  %STAr: AD;  %STAr* CD;  %STAr*: ED;
%LDD 0E;  %LDD: 2E;  %LDD* 4E;  %LDD*: 6E;  %LDDr 8E;  %LDDr: AE;  %LDDr* CE;  %LDDr*: EE;
%STD 0F;  %STD: 2F;  %STD* 4F;  %STD*: 6F;  %STDr 8F;  %STDr: AF;  %STDr* CF;  %STDr*: EF;
%ADD 10;  %ADD: 30;  %ADD* 50;  %ADD*: 70;  %ADDr 90;  %ADDr: B0;  %ADDr* D0;  %ADDr*: F0;
%SUB 11;  %SUB: 31;  %SUB* 51;  %SUB*: 71;  %SUBr 91;  %SUBr: B1;  %SUBr* D1;  %SUBr*: F1;
%INC 12;  %INC: 32;  %INC* 52;  %INC*: 72;  %INCr 92;  %INCr: B2;  %INCr* D2;  %INCr*: F2;
%DEC 13;  %DEC: 33;  %DEC* 53;  %DEC*: 73;  %DECr 93;  %DECr: B3;  %DECr* D3;  %DECr*: F3;
%LTH 14;  %LTH: 34;  %LTH* 54;  %LTH*: 74;  %LTHr 94;  %LTHr: B4;  %LTHr* D4;  %LTHr*: F4;
%GTH 15;  %GTH: 35;  %GTH* 55;  %GTH*: 75;  %GTHr 95;  %GTHr: B5;  %GTHr* D5;  %GTHr*: F5;
%EQU 16;  %EQU: 36;  %EQU* 56;  %EQU*: 76;  %EQUr 96;  %EQUr: B6;  %EQUr* D6;  %EQUr*: F6;
%NQK 17;  %NQK: 37;  %NQK* 57;  %NQK*: 77;  %NQKr 97;  %NQKr: B7;  %NQKr* D7;  %NQKr*: F7;
%SHL 18;  %SHL: 38;  %SHL* 58;  %SHL*: 78;  %SHLr 98;  %SHLr: B8;  %SHLr* D8;  %SHLr*: F8;
%SHR 19;  %SHR: 39;  %SHR* 59;  %SHR*: 79;  %SHRr 99;  %SHRr: B9;  %SHRr* D9;  %SHRr*: F9;
%ROL 1A;  %ROL: 3A;  %ROL* 5A;  %ROL*: 7A;  %ROLr 9A;  %ROLr: BA;  %ROLr* DA;  %ROLr*: FA;
%ROR 1B;  %ROR: 3B;  %ROR* 5B;  %ROR*: 7B;  %RORr 9B;  %RORr: BB;  %RORr* DB;  %RORr*: FB;
%IOR 1C;  %IOR: 3C;  %IOR* 5C;  %IOR*: 7C;  %IORr 9C;  %IORr: BC;  %IORr* DC;  %IORr*: FC;
%XOR 1D;  %XOR: 3D;  %XOR* 5D;  %XOR*: 7D;  %XORr 9D;  %XORr: BD;  %XORr* DD;  %XORr*: FD;
%AND 1E;  %AND: 3E;  %AND* 5E;  %AND*: 7E;  %ANDr 9E;  %ANDr: BE;  %ANDr* DE;  %ANDr*: FE;
%NOT 1F;  %NOT: 3F;  %NOT* 5F;  %NOT*: 7F;  %NOTr 9F;  %NOTr: BF;  %NOTr* DF;  %NOTr*: FF;