MIPS Assembly parser

MIPS Assembly
Parser implementation
1. Brief overview of parsing algorithms
2. Using Syntax to implement MIPS parser
Conclusion

MIPS architecture is used in many embedded systems today, including gaming consoles, routers, and other devices. It is a RISC architecture, which makes decoding of the instructions easier, and the number of basic instructions is not that big (in contrast with CISC architecture, used e.g. in x86). This makes MIPS a good learning architecture, and that’s why it’s often used in universities in compiler class for code generation. And as a part of course for implementing a virtual machine for a subset of MIPS instructions, I have built the parser for its assembly.

The parser can be found in the mips-parser NPM package, or for development — in the corresponding github repo.

To install the parser globally:

npm install -g mips-parser

mips-parser --help

You can also try it online.

MIPS Assembly

MIPS is a load-store architecture (also sometimes called “register-register”), meaning it only performs arithmetic and logic operations between CPU registers, requiring load/store instructions to access memory (as opposed to a “register-memory” architecture, like x86).

Let’s take a look at the “hello-world” example in MIPS assembly:

# Hello, world!
# MIPS Assembly

      .data               # Data segment

# String to be printed:
message: .asciiz "Hello, world!"

      .text               # Code segment (instructions go here)

main:                     # Main entry point

      li $v0, 4           # System call code for printing string = 4
      la $a0, message     # Load address of string into $a0
      syscall             # Call operating system to perform operation
                          # specified in $v0
                          # syscall takes its arguments from $a0, $a1, etc.

      li $v0, 10          # Terminate the program
      syscall

In order to implement a virtual machine for this code, we first need to parse it.

With mips-parser it’s as easy as:

mips-parser -f ~/hello-world.s

The full parsed AST (abstract syntax tree) for this program is:

{
  "type": "Program",
  "segments": {
    ".text": {
      "instructions": [
        {
          "type": "Instruction",
          "opcode": "li",
          "operands": [
            {
              "type": "Register",
              "value": "$v0",
              "kind": "Name"
            },
            {
              "type": "Number",
              "kind": "decimal",
              "value": 4
            }
          ]
        },
        {
          "type": "Instruction",
          "opcode": "la",
          "operands": [
            {
              "type": "Register",
              "value": "$a0",
              "kind": "Name"
            },
            {
              "type": "Identifier",
              "value": "message"
            }
          ]
        },
        {
          "type": "Instruction",
          "opcode": "syscall"
        },
        {
          "type": "Instruction",
          "opcode": "li",
          "operands": [
            {
              "type": "Register",
              "value": "$v0",
              "kind": "Name"
            },
            {
              "type": "Number",
              "kind": "decimal",
              "value": 10
            }
          ]
        },
        {
          "type": "Instruction",
          "opcode": "syscall"
        }
      ]
    },
    ".data": {
      "instructions": [
        {
          "type": "Data",
          "mode": ".asciiz",
          "value": {
            "type": "String",
            "value": "Hello, world!"
          }
        }
      ]
    }
  },
  "labels": {
    "message": {
      "address": 0,
      "segment": ".data"
    },
    "main": {
      "address": 0,
      "segment": ".text"
    }
  },
  "directives": [
    {
      "type": "Segment",
      "value": ".data"
    },
    {
      "type": "Segment",
      "value": ".text"
    }
  ]
}

As you can see instructions are grouped by segments in this AST format, and labels track the address where they point to, and in which segment.

Parser implementation

Language parsers closely related to language grammars; in particular, programming languages — to BNF grammars (which is the notation for Context-free grammars).

Brief overview of parsing algorithms

Two approaches for parser implementation are used today on practice: hand-written parsers, and the automatically generated ones.

The former type is usually implemented using the recursive descent algorithm, which applies LL(1) parsing mode.

LL(1) stands for Left to right scan of a string, using Left most derivation, with 1 lookahead token.

Such parsers are also known as “top-down”. As an advantage of such a parser we may consider an ease of implementation, as a disadvantage — a lot of recursion, which can make the parsing slower, and the fact that you actually need to write the full code for it. Although, in case of some complicated grammars, the recursive descent parser can be used even when the Canonical LR-parser wouldn’t work (e.g. when one needs to reinterpret some AST nodes doing a backtracking — although in this case it won’t be an LL(1) anymore).

On the other hand, the automatically generated parsers are created for you by a machine from a given grammar. In this case we still need to write some code, but this usually is much less, than the full parser implementation, and is the code for the grammar file. This family of parsers is known as “bottom-up” (and the most canonical version of it, “shift-reduce” parsers). A shift-reduce parser is implemented using LR algorithm. Notice though, that an LL-parser can also be built by a parser generator. Advantages of this approach as we said are: reduced code for implementation, and no recursion (such parsers are usually table-driven, implementing a state machine), which makes them faster.

LR(1) stands for Left to right scan of a string, using Right most derivation, with 1 lookahead token.

Using Syntax to implement MIPS parser

We used Syntax tool to build an automatic LR parser (in particular, using the most practical LALR(1) algorithm). The main grammar file contains all needed rules to parse the MIPS code. Let’s take a look at some of these rules.

If we take a typical MIPS instruction and its statement, we see it consists of an optional label, an actual command (known as “opcode”), and the operands for this command:

# label
main:
      #opcode  #op1   #op2
      li       $v0,   4

Each instruction accepts from 0 to 3 operands, which is defined in the grammar rule:

Instruction : OPCODE Operands { $$ = {type: 'Instruction', opcode: $1, operands: $2} };

Operands    : Op              { $$ = [$1] }
            | Op , Op         { $$ = [$1, $2] }
            | Op , Op , Op    { $$ = [$1, $2, $3] }
            |                 /* empty list of operands */
            ;

And an operand can be a register, an immediate value, or an address.

In addition, a statement can be a data declaration in the .data segment:

message: .asciiz "Hello, world!"

Or for example, this is how we can define an array of bytes:

          .data

my_array: .byte 10, 15, 23

As we can see, the my_array label plays a role of a “variable name” in this case, and the comma-separated values contiguously allocated in memory are the elements of the array. Parsing this code gives us the following AST:

{
  "type": "Program",
  "segments": {
    ".text": {
      "instructions": []
    },
    ".data": {
      "instructions": [
        {
          "type": "Data",
          "mode": ".byte",
          "value": [
            {
              "type": "Number",
              "kind": "decimal",
              "value": 10
            },
            {
              "type": "Number",
              "kind": "decimal",
              "value": 15
            },
            {
              "type": "Number",
              "kind": "decimal",
              "value": 23
            }
          ]
        }
      ]
    }
  },
  "labels": {
    "my_array": {
      "address": 0,
      "segment": ".data"
    }
  },
  "directives": [
    {
      "type": "Segment",
      "value": ".data"
    }
  ]
}

To access the elements of the array, an interpreter would refer to the address of the my_array label, and a needed offset:

const {
  my_array: {
    address,
    segment,
  },
} = program.labels;

const dataSegment = program.segments[segment].instructions;

console.log(dataSegment[address].value[1].value); // 15

Traversing the AST, and handling each node type, we can, that’s said, implement either a direct interpreter for this program producing the result, or to generate an actual binary code corresponding to MIPS instruction, and then implement a virtual machine for this generated bytecode. Or, alternatively, we could even translate it to other lower-level languages, e.g. asm.js, or WebAssembly.

Conclusion

Parsers, low-level interpretation, and code generation might be complex, but yet truly fun topics. They allow to understand how most of the programing languages work under the hood, and what happens with the high-level once it’s translated to the lower, machine, code. Even if you work with a high-level language, such as JavaScript, which we choose to implement the parser here, it might still be very useful to dig the lower level in order to see how this “microcosm” is built from within, and for example be able to fix a bug at that level. As an exercise for parsers, I recommend implementing a recursive descent version for the same MIPS parser (having the grammar this should be relatively straightforward), and further an AST interpreter for it.

Have fun with compilers 😉

Dmitry Soshnikov

Software engineer interested in learning and education. Sometimes blog on topics of programming languages theory, compilers, and ECMAScript.

Published

2016-09-27

Dmitry Soshnikov in Compilers | 2016-09-27