How are programming languages created?
Creating a programming language is a complex yet fascinating task that requires a deep understanding of computer science principles, compilers, and the intricacies of human-computer interaction. In this guide, we'll walk through the essential ste...
1. Preparation
1.1. ANTLR
ANTLR is a tool that generates parsers from grammar definitions. It can handle various languages and provides a solid foundation for building language tools. Whether you're developing a compiler or a custom DSL (Domain-Specific Language), ANTLR offers a structured approach to parsing and interpreting languages.
In this case, we define rules for tokens or syntax of a programming language using ANTLR and then use these rules to generate the necessary code for subsequent steps.
2. Jasmin
Jasmin is a Java assembler that translates Jasmin assembly language into Java bytecode. It provides a low-level way to understand and manipulate Java bytecode, making it valuable for educational purposes, performance optimization, and debugging. With Jasmin, you write code in a human-readable assembly language, which is then converted into .class files executable by the JVM. This tool is useful for learning how Java code translates into bytecode, manually optimizing code, or debugging bytecode issues. Jasmin is distributed as a JAR file and can be run from the command line, making it accessible for developers interested in exploring the internals of the JVM.
2. Overview process
To create a programming language and compile it to Java bytecode, we follow several critical steps:
Define the Lexer and Parser: The initial step involves defining the lexer and parser for our programming language. The lexer breaks down the source code into tokens, which represent the smallest units of meaningful data. The parser then processes these tokens according to the grammar rules of the language, creating a syntactic structure that can be analyzed further.
Generate the Abstract Syntax Tree (AST): After parsing, we generate the Abstract Syntax Tree (AST). The AST represents the hierarchical structure of the source code, capturing its syntactic and semantic information. It is an essential data structure used for further analysis and transformations of the code.
Define Exception Checks: With the AST in place, we define exception checks to handle any errors or anomalies that may occur during compilation. These checks ensure that the code adheres to the language's rules and constraints, catching issues such as type mismatches, syntax errors, and invalid operations before the code is executed.
Use Jasmin for Bytecode Generation: Once the code is thoroughly analyzed and error-free, we use Jasmin. Jasmin allows us to write in a simple assembler-like syntax based on the Java Virtual Machine (JVM) instruction set. It translates this assembly code into binary Java class files. These files are then suitable for loading and execution by the Java runtime environment.
3 Define Lexer and Parser
3.1 Lexer
A lexer, also known as a lexical analyzer, is the first phase in the process of compiling source code. Its primary function is to break down the source code into manageable chunks, known as tokens. Tokens are the building blocks of syntax, representing keywords, identifiers, operators, and other significant elements in the code.
The lexer’s job is to scan the input source code and produce a sequence of tokens that are used by the parser in the next phase. This tokenization process simplifies the parsing phase by breaking the code into discrete, meaningful symbols.
In ANTLR, we can write Lexer at a G4 file and here is example how can I define some Lexer
BODY: 'Body' ;
BREAK: 'Break' ;
CONTINUE: 'Continue' ;
DO: 'Do' ;
ELSE: 'Else' ;
ELSEIF: 'ElseIf' ;
ENDBODY: 'EndBody' ;
ENDIF: 'EndIf' ;
ENDFOR: 'EndFor' ;
ENDWHILE: 'EndWhile' ;
FOR: 'For' ;
FUNCTION: 'Function' ;
IF: 'If' ;
PARAMETER: 'Parameter' ;
RETURN: 'Return' ;
THEN: 'Then' ;
VAR: 'Var' ;
WHILE: 'While' ;
TRUE: 'True' ;
FALSE: 'False' ;
ENDDO: 'EndDo' ;
Role of a Lexer:
- Tokenization: It breaks down the input text into tokens.
- Lexical Analysis: It analyzes the structure of tokens to ensure they conform to the language's rules.
- Filtering: It removes unnecessary whitespace and comments.
- Error Detection: It detects and reports lexical errors.
3.2 Parser
A parser is a software component that reads and interprets input data to convert it into a format that is easier for computers to process. Essentially, it takes raw input (often text) and translates it into a structured representation that can be understood and manipulated by other software components.
A parser is a part of a compiler or interpreter that processes the input data (source code, text files, etc.) and converts it into a format that a computer program can work with.
The main purpose of a parser is to ensure that the input data adheres to specific syntax rules and to transform it into a meaningful structure. This is essential for tasks such as language translation, data extraction, and configuration management.
We can write Parser in same file G4 with Lexer in case we want to generate material code in same class for ease of use. Below are some Parser examples:
if_stmt: IF exp THEN list_stmt
(ELSEIF exp THEN list_stmt)*
(ELSE list_stmt)?
ENDIF DOT;
exp: relational_exp;
relational_exp : logical_binary_exp EQUAL logical_binary_exp | ...
logical_binary_exp : logical_binary_exp AND adding_exp | ......
Here is an example. We use a lexer to define parsing, and you can define your parser in more detail and complexity to make your programming language more realistic. By describing your language with more reasonable rules,
4. Generate the Abstract Syntax Tree (AST)
The Abstract Syntax Tree (AST) plays a crucial role in the process of understanding and transforming code. The AST is a fundamental data structure that represents the syntactic structure of source code in a tree-like format. It is used extensively in compiler design, static analysis, and code transformation tools. This article will delve into the concept of AST, its importance, how it works, and its applications.
An Abstract Syntax Tree (AST) is a hierarchical representation of the abstract syntactic structure of source code. Unlike the concrete syntax tree (CST), which represents the exact structure of the source code, the AST abstracts away some of the syntactic details and focuses on the essential structure and relationships between language constructs.
In our project, we utilize ANTLR to generate material code and subsequently construct an Abstract Syntax Tree based on it.
For example, the expression x = 5 + 3 would be represented in an AST as:
= (assignment)
x (variable)
+ (addition)
5 (number)
3 (number)
To generate a material code, we need to run this command.
java -jar <Path to antlr jar file ex: antlr-4.13.2-complete.jar> -o <place you want to generate material code> -no-listener -visitor <Your G4 file directory>
After that, we has
Where
- Lexer.py: This file contains the implementation of the parser for your grammar. The parser takes the sequence of tokens generated by the lexer and builds a parse tree (or syntax tree) based on the grammar rules defined in your .g4 file.
- Visitor.py: This file provides a base implementation of the visitor pattern for your grammar. The visitor pattern is a way to traverse the parse tree or AST. The MyGrammarBaseVisitor class includes default implementations of the visitor methods for each rule in the grammar.
- Parser.py: This file contains the implementation of the parser for your grammar. The parser takes the sequence of tokens generated by the lexer and builds a parse tree (or syntax tree) based on the grammar rules defined in your .g4 file.
Here is sample result when we generate relate to if_statement,
class If_stmtContext(ParserRuleContext):
slots = 'parser'
def init(self, parser, parent:ParserRuleContext=None, invokingState:int=-1):
super().init(parent, invokingState)
self.parser = parser
def IF(self):
return self.getToken(Parser.IF, 0)
def exp(self, i:int=None):
if i is None:
return self.getTypedRuleContexts(Parser.ExpContext)
else:
return self.getTypedRuleContext(Parser.ExpContext,i)
def THEN(self, i:int=None):
if i is None:
return self.getTokens(Parser.THEN)
else:
return self.getToken(Parser.THEN, i)
def list_stmt(self, i:int=None):
if i is None:
return self.getTypedRuleContexts(Parser.List_stmtContext)
else:
return self.getTypedRuleContext(Parser.List_stmtContext,i)
def ENDIF(self):
return self.getToken(Parser.ENDIF, 0)
def DOT(self):
return self.getToken(Parser.DOT, 0)
def ELSEIF(self, i:int=None):
if i is None:
return self.getTokens(Parser.ELSEIF)
else:
return self.getToken(Parser.ELSEIF, i)
def ELSE(self):
return self.getToken(Parser.ELSE, 0)
def getRuleIndex(self):
return Parser.RULE_if_stmt
def accept(self, visitor:ParseTreeVisitor):
if hasattr( visitor, "visitIf_stmt" ):
return visitor.visitIf_stmt(self)
else:
return visitor.visitChildren(self)
We create a class that inheritance from ParseTreeVisitor to generate AST. Here is a example about if_statement.
def visitIf_stmt(self, ctx:Parser.If_stmtContext):
# ifthenStmt:List[Tuple[Expr,List[VarDecl],List[Stmt]]]
lstExp = [exp.accept(self) for exp in ctx.exp()]
lstTupVardecStmt = [tup.accept(self) for tup in ctx.list_stmt()] #[tuple(lstVardec,lstStmt)]
lstOfLstVardec = [tup[0] for tup in lstTupVardecStmt]
lstOfLstStmt = [tup[1] for tup in lstTupVardecStmt]
ifthenStmt = zip(lstExp,lstOfLstVardec,lstOfLstStmt)
if(ctx.ELSE()):
# elseStmt:Tuple[List[VarDecl],List[Stmt]]
elseStmt = lstTupVardecStmt[-1]
return If(ifthenStmt,elseStmt)
else:
elseStmt = ([],[])
return If(ifthenStmt,elseStmt)
5. Define Exception Checks
Exception handling is a crucial aspect of robust software development. It ensures that a program can handle unexpected situations gracefully, providing a smoother user experience and improving application reliability. In this guide, we'll explore the concept of exception checks, including their definition, importance, implementation strategies, and best practices. We’ll also cover real-world examples and common pitfalls to avoid.
Exception checks refer to the mechanisms and practices used to handle and manage exceptions in a program. An exception is an event that disrupts the normal flow of a program’s execution, often due to errors such as invalid input, unavailable resources, or unexpected conditions.
So that, when the AST tree is built, we can handle a concise code snippet to check the validity of the if_statement.
def visitIf(self, ast, o):
self.cur_stmt = ast
inside_env = o
for tup in ast.ifthenStmt:
if(isinstance(tup[0],CallExpr)):
idSymbol = self.getSymbolById(tup[0].method,o,Function())
else:
idSymbol = self.visit(tup[0],o)
if(not isinstance(idSymbol.mtype,ArrayType)):
self.inferType(idSymbol,BoolType())
if(isinstance(idSymbol.mtype,MType)):
if(not isinstance(idSymbol.mtype.return_type,BoolType)):
raise TypeMismatchInStatement(ast)
else:
if(not isinstance(idSymbol.mtype,BoolType)):
raise TypeMismatchInStatement(ast)
if(isinstance(tup[0],CallExpr)):
self.visit(tup[0],o)
lstVardecl = tup[1]
...
...
Our goal in this step is to examine the AST from step 2 for any anomalies that might make the code snippet incorrect. Typical examples of these anomalies are
- Redeclared Variable/Function/Parameter:
- Undeclared Identifier/Function:
- Type Cannot Be Inferred
- Type Mismatch In Statement
- Type Mismatch In Expression
- No Entry Point
6. Use Jasmin for Bytecode Generation
Jasmin is a Java assembler that enables developers to write Java bytecode directly. Bytecode is a low-level, platform-independent code generated from Java source code, which the Java Virtual Machine (JVM) interprets and executes. Jasmin provides a way to generate and manipulate this bytecode, offering more control over the compilation process.
Jasmin uses a low-level assembly-like language to write bytecode. This language is less abstract than Java and allows developers to specify bytecode instructions directly. It provides fine-grained control over the bytecode generation process.
Here’s a simple example of a Jasmin code snippet:
.class public Example
.super java/lang/Object
.method public static main([Ljava/lang/String;)V
.limit stack 1
.limit locals 1
ldc "Hello, world!"
invokevirtual java/io/PrintStream.println(Ljava/lang/String;)V
return
.end method
.class defines the class.
.method specifies a method with its signature.
ldcand invokevirtualare bytecode instructions for loading a constant and invoking a method, respectively.
.limit stack and .limit locals specify the stack and local variables' limits for the method.
7. Conclusion
Creating a programming language from scratch involves several meticulous steps, from defining its purpose and designing syntax to building a lexer, parser, AST, interpreter, or compiler. This guide provided a deep dive into each stage, showcasing code examples and techniques to help you understand how a new language is born.
If you have any questions or need further clarification, feel free to comment below.
Read more at : How are programming languages created?