Compiler Engineer Path

Available coupons:

$169.00 on Teachable

Building an Interpreter from scratch
Building a Transpiler from scratch
Building a Parser from scratch
Building a Typechecker from scratch
Building a Virtual Machine from scratch
Programming Language with LLVM
Parsing Algorithms
Garbage Collection Algorithms
Automata Theory: inside a RegExp machine
Recommended literature

Many books on engineering a compiler usually start from the topic of Lexical Analysis, going deeper into formalism of Automata Theory. Having stuck in formal state machines semantics, students may lose interest in building programming languages, attributing them to “compiler engineers” only. I believe it is always good to start with a language runtime, understanding a bigger picture of its semantics, and only after that to go down and deeper into theoretical aspects.

In this note we cover the courses from the Programming Languages Design and Implementation series, and tell in which order to take them for the best practical outcomes.

Courses in this series are divided into “… from scratch”, and also “… Theory” and “… Algorithms” parts. We recommend starting from the practical “from scratch” classes, and go to the theory after that.

Note: you may also find all courses combined in the single Programming Languages: Ultimate – 3rd Edition bundle.

Building an Interpreter from scratch

In this course we start our journey into the interesting world of programming languages, talking about different parsing and transformation pipelines, VMs, and eventually implement an AST interpreter, from scratch.

Available coupons:

$24.00 $59.00 on Teachable

To skip the parsing stage altogether and focus specifically on the runtime (that is, the actual semantics of a programming language), we use S-expression format for the lightweight syntax. We implement the interpreter in one of the most popular programming languages, JavaScript — and the semantics of our language — Eva — is very inspired by the JavaScript itself, Python, and other dynamic PLs.

By the end of this course students learn how to build own programming language from scratch at higher (AST) level. You will learn and understand how closures work, what is an activation record and lexical environment, how to call a function, and add support for OOP.

You may find more info and enroll into the course on the class page.

Building a Transpiler from scratch

Having implemented the interpreter runtime manually, it’s worth exploring high-level compilation process, covered by transpilers.

Available coupons:

$24.00 $59.00 on Teachable

In this class we build a concurrent, process-based programming language, by compiling it down to JavaScript.

By the end of this class students learn how the translation pipeline works at higher (AST) level. You will learn and understand how code generations works, understand semantics of concurrent processes, and compile your language to JS.

You may find more info and enroll into the course on the class page.

Building a Parser from scratch

Once you have a fully working runtime of a programming language, it is a good time to think about ergonomic syntax.

Yes, often syntax and runtime of a language may deeply intersect and depend on each other, however for the same runtime you may have multiple syntaxes. This also allows working with concept of syntactic sugar, i.e. inventing ergonomic syntactic constructs which delegate to the same runtime semantics.

Available coupons:

$24.00 $59.00 on Teachable

In this second course we build a manual Recursive-descent parser, also from scratch. You will learn how to construct Abstract Syntax Tree (AST) nodes, about Tokenizer and Parsing process, and also see different AST formats. As an implementation we also use JavaScript.

Again, you may find more info and enroll into the course on the class page.

In addition, the Interpreter and the Parser classes may be taken as the Programming Language Bundle.

Building a Typechecker from scratch

On top of your dynamic programming language, you may consider adding a static typechecker module.

Untyped programs are often prone to errors, and may fail at runtime. Having an extra type checker may improve robustness of your programs, improve documentation and ergonomics.

Available coupons:

None at the moment

In this course we build a Typechecker for the Eva programming language, also from scratch. You will learn about type systems and Type theory, see the difference between static and dynamic type checking, and get familiar with checking and inference algorithms. As an implementation we use JavaScript.

You may find more info and enroll into the course on the class page.

Building a Virtual Machine from scratch

By this moment you should already have a fully working programming language with ergonomic syntax, and understand how the languages work at higher level. It is time now to go to production-level VMs, improving performance, and understanding low-level semantics.

Available coupons:

None at the moment

In this course we build a Stack-based Virtual Machine, from scratch. Basic C++ experience and the Interpreters course are required as prerequisites for this class. However, we do not use too specific C++ constructs, and the code should be easily transferable to other languages.

Students will learn concepts of bytecode, compilation process, and lower-level interpretation. Concepts of stack- and heap-allocated values, and also Garbage Collection are considered. As a result we implement a VM similar in semantics to Python, JavaScript and other languages – with full support for functional programming, closures, and OOP.

You may find more info and enroll into the course on the class page.

In addition, the runtime classes can be taken as the Interpreters and Virtual Machines bundle.

Programming Language with LLVM

After implementing a custom low-level VM, it is worth considering using a production-level compiler infrastructure, LLVM.

Available coupons:

$29.00 $64.00 on Teachable

In this course we compile our programming language down to LLVM IR. Basic C++ experience and the Interpreters course are required as prerequisites for this class. However, we do not use too specific C++ constructs, and the code should be easily transferable to other languages.

Students will learn concepts of LLVM frontend, IR, and lower-level compilation. Details of stack- and heap-allocated values, and also Garbage Collection are considered. As a result you will understand how Clang, Rust, and other compilers work.

You may find more info and enroll into the course on the class page.

By the end of the four practical courses you should have a fully working programming language. The next classes may be taken in any order to extend learned concepts and to deeper understand the topics.

Parsing Algorithms

This course focuses on theory of parsing process, going deeper into discussions of the automated LL and LR parsers. The class combines theory and practice.

Available coupons:

$24.00 $59.00 on Teachable

Students will learn and understand full algorithms of LL(1), LALR(1) and other parsers, and also will implement a parser for a full programming language, using parser generator tool, Syntax. For implementation we use JavaScript.

You may find more info and enroll into the course on the class page.

Garbage Collection Algorithms

In the Virtual Machines class we implemented Mark-Sweep garbage collector. It is a good time to go deeper into GC algorithms, and learn the theory and implementation strategies behind them.

Available coupons:

None at the moment

Students will learn all the major GC techniques used today on practice. The class is mainly theoretical, explaining concepts of Tracing and Direct collectors, talking about Generational, Region-based collectors, Tri-color abstraction, and other topics. Specific algorithms of Mark-Sweep, Mark-Compact, Reference Counting, Copying, Immix, and G1 collectors are considered.

The course will be interesting to any curious programmer and is also recommended for professional compiler engineers.

You can find more info and enroll into the course on the class page.

Automata Theory: inside a RegExp machine

This theoretical course is devoted to the formalism of Automata, talking about State machines used to implement concept of Regular Expressions.

Available coupons:

None at the moment

The concepts of NFA (Non-deterministic Finite Automata) and DFA (Deterministic Finite Automata) are considered. The modules are also used for implementing tokenizer modules in parsing process. In addition, the topic of DFA minimization is described. Implementation skeleton is presented showing how to implement a RegExp machine.

Again, you can find more info and enroll into the course on the class page.

Recommended literature

The following classic and practical books are recommended for further reading:

“Compiler Design: Principles, Techniques and Tools” (aka “Dragon Book”), Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jeffrey D. Ullman
“The Garbage Collection Handbook: The Art of Automatic Memory Management”, Antony Hosking, Eliot Moss, and Richard Jones
“Parsing Techniques: A Practical Guide”, Dick Grune, Ceriel J.H. Jacobs
“Structure and Interpretation of Computer Programs (SICP)”, Harold Abelson and Gerald Jay Sussman
“Programming Languages: Application and Interpretation (PLAI)”, Shriram Krishnamurthi
“Types and Programming Languages”, Benjamin C. Pierce

I hope you find the journey into the world of compilers, virtual machines, and programming languages interesting and engaging — and these courses would be a great learning material for your own programming language, which you invent.

Thanks, and see you in the class!

Sincerely,
Dmitry Soshnikov

JavaScript object model in one line of code

Parsing lookaheads
Property access: dot and square brackets
Primitives and Objects
Dynamic objects
Prototypes
This value
Conclusion

Level: introductory

Take a look at the following JS code snippet:

  1..toString(); // "1"

Being small, this one-liner describes many major topics from the mysterious JavaScript language, which we’re about to cover in today’s note.

Note: if you’d like to take a deeper dive into ECMAScript ecosystem, you may consider the JavaScript. The Core article.

Parsing lookaheads

Let’s start from the weird .. double-dots method access. Is it a parse error?

Actually, the opposite. The parse error would be had we only one dot:

  1.toString(); // Syntax Error: Invalid or unexpected token

The reason is, ECMAScript numbers are based on the floating point IEEE 754 format, which means:

1 === 1.0; // true

The fractional part can be omitted, and due to JS parsing grammar, the dot followed immediately after a number is considered the fractional part delimiter:

1 === 1.; // true

The integer part can also be omitted:

.1 === 0.1; // true

(Wait, what’s the === triple-equal? — Go here)

So the parser looks ahead for a dot following a number and treats it as the fractional part delimiter.

And the next dot after the fractional part is already toString property access on the number object, hence the weird “double-dots method call” in this case:

  1..toString(); // "1"

If the fractional part is explicitly defined, the parser doesn’t fail:

1.5.toString(); // "1.5"

Takeaway: this is purely ECMAScript design flaw in the parsing grammar. For example, the issue doesn’t exist in other languages, such as Rust, where we can normally write the:

1.to_string();   // "1"

1.5.to_string(); // "1.5"

Beside the (mainly academical interest for the) double-dots case, the JS parse error can be bypassed by some other syntactic constructs, the most practical of which is extracting raw numbers into constants:

(1).toString();   // "1"

const RATIO = 1;
RATIO.toString(); // "1"

Including even weirder cases of:

1['toString']();  // "1"

1.['toString'](); // "1"

Property access: dot and square brackets

Yes, JavaScript has dot . and square bracket [] property access. The former is used when the property or method name is a valid identifier and is known before runtime. The later can be used for array access, when the property name is not a valid identifier, when a property is a symbol, or when it’s dynamic:

const data = [1, 2, 3];

// data.0;   // error, not a valid identifier

data[0];     // 1
data['0'];   // 1

const k = 0; 

data.k;      // undefined, "k" name doesn't exist
data[k];     // 1, OK dynamic access

Takeaway: JavaScript doesn’t distinguish property access and “subscript” operators as some other languages do (e.g. Ruby). Both, the dot and the square bracket notation, are just property access and resolve to the same toString method in our initial example — just inside the square brackets can be any expression (variables, strings, function calls), which evaluate to a property name.

An arrow function is created, immediately invoked, returning the 'toString' value as the property name:

1[(() => 'toString')()](); // "1"

However, this article is not about parsing issues in JS, rather about other (hidden from the first glance) ECMAScript internals, so let’s get down to them.

Primitives and Objects

“Everything in JS is an object.” — wrong.

The fact we are able to access the toString method directly from the 1 literal value, may mistakenly bring us to quick conclusions that “everything in JS is an object”.

JS however distinguish primitive values and objects, and the only reason why we can access methods from primitives is the temporary objects wrapping. In other languages the process is also known as “auto-boxing”, i.e. a boxed (heap-allocated) value of some class is created, and can now provide access to the methods of this class.

Our initial example is equivalent to:

1..toString(); // "1"

// Same as:

const __temp = new Number(1);
__temp.toString(); // "1"

// later __temp is destroyed by GC

Yes, every time we access methods or properties on primitives, a temporary wrapper object is created, and which is then destroyed by a garbage collector.

This can be demonstrated on the following example:

const n = 1;

n.toString(); // yay, everything in JS is an object!

n.prop = 2; // wow, can even create new properties!

n.prop; // wait, where is it? - undefined

Knowing about the auto-boxing process, we can describe what’s happened using the following example:

const n = 1;

const __temp1 = new Number(n);
__temp1.toString(); // access toString on __temp1

const __temp2 = new Number(n);
__temp2.prop = 2; // create prop on __temp2

const __temp3 = new Number(n);
__temp3.prop; // access prop on __temp3, but it's not here

Observing the Number constructor is called three times, it might be tempting to create an actual object only once, for the performance optimizations:

const RATIO = new Number(1);
RATIO.toString(); // "1"

However, JavaScript VMs usually may optimize such common cases and avoid allocating the wrappers. So in general try to avoid premature micro-optimizations and leave it to JS internals.

This example though with methods and objects takes us deeper to JS world of Object-oriented programming, classes, prototype and inheritance. Let’s take a look.

Dynamic objects

The line from the example above:

const __temp2 = new Number(n);
__temp2.prop = 2; // create prop on __temp2

shows that JS objects are completely dynamic and can hold any properties they need, regardless on what property shape is defined on their “class”:

class Number extends Object {
  constructor(value) {
    super();
    // The class shape defines only _value property:
    this._value = value;
  }
  toString() {
    // implementation
  }
}

const n = 1;

const __temp2 = new Number(n);
__temp2.prop = 2; // create prop on __temp2

Once a property is not needed, it can be deleted:

delete __temp2.prop;

However, the toString method we call from 1 is not actually defined on the number instance, instead it’s inherited.

Prototypes

To confirm that the toString method is not own for the number instances, we can do the following:

const n = new Number(1);

n.hasOwnProperty('toString'); // false

n.toString(); // OK, "1"

The fact that we instantiate the Number class (or function-constructor in legacy terms) gives us a hint that the toString method should sit there. And it’s true, however it’s not directly on the class, but on the extra storage for shared methods, known as the prototype.

The prototype is accessible via the special property __proto__ from any object (which on practice should be replaced with Reflect.getPrototypeOf API method):

n.__proto__; // object

n.__proto__.hasOwnProperty('toString'); // true

The rule is simple: if a property (the toString in this case) is not found directly on the object, it’s searched in the prototype. Further in the prototype of the prototype, etc — until the whole prototype chain is considered. If eventually a property is not found in the chain, the special undefined value is returned:

n.toSuperString; // undefined

And this is how inheritance mechanism is implemented in JavaScript — through the prototype chain, also known as delegation chain in computer science.

The Number class stores the reference to the prototype of its instances in the explicit prototype property:

Number.prototype === n.__proto__; // true

Number.prototype.toString === n.toString; // true

This value

Since objects in JavaScript are dynamic, we can extend (“monkey-patch” on jargon) any object with needed properties and methods. For example, let’s define our custom method for all numbers:

const n = new Number(1);

const prefix = 'Super: ';

Number.prototype.toSuperString = function () {
  return prefix + this.toString();
};

n.toSuperString(); // "Super: 1"

Note: never monkey-patch standard classes with custom methods — to avoid backward compatibility issues and to make the coding predictable. The dynamic extension only makes sense when you need to patch a missing method for older JS engines.

There are some interesting observations here.

The toSuperString method is accessible from n instance even if it’s defined after the n was created itself:

n.toSuperString(); // "Super: 1"

This underlines dynamic nature of the prototype chain, and the property lookup at runtime.

Next, our new method is normally can access the prefix variable defined outside:

return prefix + this.toString();

This brings us to the wonderful topic of closures, detailed descriptions of which in the view of the Funarg problem, you can find here.

Yes, a closure is when a function can access variables not being local to this function, nor being parameters of this functions (so called, free variables), and not only “when a function is returned from another function”.

And we also see the current instance is accessible via this value inside methods:

 this.toString();

which again brings us to the topic of inheritance, since toString is not an own method of number instances, and is found on the same Number.prototype.

The topic of this value is one of the most complex topics in JS, and there are some related caveats. For example, there is no auto-binding in JS as e.g. in Python:

n.toSuperString(); // OK

// Same as:

Number.prototype.toSuperString.call(n); // OK

// Extract the method to a variable:

const toSuperString = Number.prototype.toSuperString;

toSuperString(); // not OK :(

// Explicitly bind `this` value as `n`:

const toSuperStringBound = Number.prototype.toSuperString.bind(n);

toSuperStringBound(); // OK

The issue above is known as the problem of losing correct context object. In most of the cases explicit binding helps, or mentioned arrow functions have the lexical this and capture it automatically as other closured values (note: this use-case doesn’t fit much for methods though, where this still should be dynamic).

Conclusion

Of course we only scratched the surface of the interesting world of JavaScript, and there are many more details and related topics. In this article I wanted to show that even such small line of code as 1..toString() may hold inside tones of aspects of JS internals. I hope you enjoyed the reading, and if you got interested in JS deeper, I invite you to explore its Fundamental Core.

Written by: Dmitry Soshnikov
Published on: May 29, 2021

Building a Parser from scratch

Available coupons:

$24.00 on Teachable

Course overview

Parsing or syntactic analysis is one of the first stages in designing and implementing a compiler. A well-designed syntax of your programming language is a big motivation why users would prefer and choose exactly your language.

Note: this is a practical class on building a manual Recursive-descent parser. If you’re interested in parsing theory and automated algorithms you may also consider the [ Parsing Algorithms ] class.

Recursive descent parsers are the group of parsers which are widely used on practice in many production programming languages. In contrast with automated parsing algorithms, the manual implementation allows having full control over the parsing process, and handling complex constructs, which may not be possible in the automatic parsers.

Besides, implementing a full manual parser from scratch allows understanding and seeing this process from inside, demystifying internal structures, and turning building parsers into an interesting engineering task.

In the Building a Parser from scratch class we dive into pure practical implementation, building and learning different aspects of parsers.

In this class you will learn concept of Recursive descent parsing, understand what is Tokenizer and how it cooperates with Parser module, learn what is Abstract Syntax Tree (AST), and how to have different formats of these ASTs, what is “lookahead” and the predictive parsing, and eventually build a parser for a full programming language, similar to Java or JavaScript.

Implementing a parser would also make your practical usage of other programming languages more professional.

Continue reading →

Writing a Mark-Sweep Garbage Collector

Video lecture
Tracing vs. Direct collectors
C++ specifics
Implementation
Conclusion

In this articles we’re going to implement a Mark-Sweep garbage collector in C++.

This is the 9th lecture from the Essentials of Garbage Collectors course where we discuss in detail all the aspects of automatic memory management.

Continue reading →

Automata Theory: inside a RegExp machine

Course overview

Available coupons:

None at the moment

State machines — the fundamental concept used today in many practical applications, starting from UI programming in React, automated reply systems, lexical analysis in parsers and formal language theory — i.e. the RegExp machines, — and up to real life use cases, such as simple traffic lights, vending machines, and others.

The state machines are backed by the larger theoretical field of computer science known as Theory of Computation, and also by its direct theoretical model — the Automata Theory.

In this class we study the Automata Theory on the practical example of implementing a Regular Expressions machine.

See also: Essentials of Garbage Collectors class devoted to automatic memory management.

Continue reading →

Garbage Collection Algorithms

Available coupons:

None at the moment

Course overview

Memory leaks and dangling pointers are the main issues of the manual memory management. You delete a parent node in a linked list, forgetting to delete all its children first — and your memory is leaking. You delete an object chain in correct order — but suddenly your program crashes since you forgot about second owner of this resource, which now tries to dereference a null-pointer.

Continue reading →

Writing a Memory Allocator

Video lecture
Mutator, Allocator, Collector
Memory block
Allocator interface
Memory alignment
Memory mapping
Freeing the objects
Blocks reuse
First-fit search
Assignments
Conclusion

This is the 6th lecture from the Garbage Collection Algorithms class, devoted to the automatic memory management.

Before discussing algorithms of collecting the garbage, we need to see how these objects (which eventually become a garbage) are allocated on the heap. In today’s lecture we consider mechanisms of memory allocation.

Continue reading →

JavaScript. Ядро: 2-ое издание

Read this article in: English, German.

Данная статья является вторым изданием обзорной лекции JavaScript. Ядро, посвященной языку программирования ECMAScript, и ключевым компонентам его рантайм-системы.

Continue reading →

JavaScript. The Core: 2nd Edition

Available coupons:

$24.00 on Teachable

Read this article in: Russian, German.

This is the second edition of the JavaScript. The Core overview lecture, devoted to ECMAScript programming language and core components of its runtime system.

Note: see also Essentials of Interpretation course, where we build a programming language similar to JavaScript, from scratch.

Continue reading →

JavaScript. The Core.

Read this article in: German, Russian, French, Polish.

Note: a new 2nd Edition of this article is available.

An object
A prototype chain
Constructor
Execution context stack
Execution context
Variable object
Activation object
Scope chain
Closures
This value
Conclusion

This note is an overview and summary of the “ECMA-262-3 in detail” series. Every section contains references to the appropriate matching chapters so you can read them to get a deeper understanding. Continue reading →