Resolve Symbols and Calculate Types with Sharplasu
Parsing is typically where we begin and invest much of our enthusiasm. However, completing the parser is just the beginning. Sadly, I must inform you that additional steps are necessary. One step that enhances the value of the Abstract Syntax Trees (ASTs) obtained from parsing is semantic enrichment. In this article, we’ll explore what semantic enrichment is, what it enables us to do, and we are going to share some code.
As always, the code for this article is available on GitHub. You can find it at https://github.com/Strumenta/an-entity-language-sharplasu. In this article we just talk about symbol resolution, so if you want details about the parsing and creation of the AST you will have to look in the repository.
It All Starts With Parsing, but the Assembly Line Continue
When we parse, we recognize the structure of the code, by analyzing its syntax.
But what does it mean to parse? It means to:
- To check the code is syntactically correct
- To build the Abstract Syntax Tree (AST), we make nodes representing the structures we recognize. For example, we could process the code
dog.fly()
and return a node representing a method call.
All is good, except for the fact that dogs cannot fly.
The code is syntactically correct and semantically incorrect.
When we process code we start by verifying that it is syntactically correct.
If it is, we move to the semantic analysis. When the code is semantically correct, we can enrich the AST with semantic information.
In essence, we are interested in two things:
- Connecting references to the corresponding declarations. We call this symbol resolution
- Calculating the types of elements that can be typed. We call this type calculation
The Link Between Symbol Resolution and Type Calculation
I am an engineer by trade, and I really like the step-by-step approach to problem-solving: you solve one part of the problem and only then move to the next one. So you may wonder why I am conflating two apparently different problems, like symbol resolution and type calculation. As for most things I do, it is because there is no better alternative.
The two mechanisms are interconnected: one depends on the other. For example, let’s say that I have this code:
class TimeMachine { Time move(TimeIncrement) Point move(Direction) } class Point { Point add(Direction) } class Time { Time add(TimeIncrement) } myTimeMachine.move(foo).add(bar)
Let’s say my goal is to figure out the type of the overall expression myTimeMachine.move(foo).add(bar)
.
To answer this question I need to figure out which add
method we are referring to. If it is the one declared in Point
, then the overall result will be of type Point
. If instead we are referring to the add method declared in Time
, then the overall result will be of type Time
.
Ok, but how do I know which add
method I am calling? Well, it depends on the type of myTimeMachine.move(foo)
:
- If that expression has type
Point
andbar
has typeDirection
, then we are calling theadd
method inPoint
. - If instead that expression has type
Time
, and the expressionbar
has typeTimeIncrement
, then we are calling theadd
method inTime
.
This means, I need to figure out the type of myTimeMachine.move(foo)
. To do so, I need first to figure out if I am calling the first or the second overloaded variant of TimeMachine.move
. And that depends on the type of foo
.
So, you see, I cannot extricate the two problems: they affect each other results, and therefore, in principle, we treat them together. In practice, for very simple languages we can get away by treating them separately. Typically, you need to treat the problems in a combined way if there are composite types or cascading functions/method calls.
If you want to read about symbol resolution for a language like Java, you can look at How to Build a Symbol Solver for Java, in Clojure or Resolve method calls in Java code using the JavaSymbolSolver.
A Prerequisite for Any Non-Trivial Operation
Semantic Enrichment is a prerequisite for most programmatic handling of code.
Perhaps you may implement a linter or a code formatter without semantic enrichment, but for the most typical language engineering applications you need semantic enrichment:
- Interpretation: To execute code, we need to connect function invocation to their definition
- Migrations: To migrate code in any nontrivial way, we want to take into account the type of the different elements. For example, if we were translating the operation
a + b
, depending on the target language and the type of the operands, we may want to translate it asa + b
or perhapsa.concat(b)
or evena + b.toDouble()
. - Static analysis and refactoring: Automated code modifications, such as renaming variables or moving functions, depend on knowing which references are linked to which declarations.
- Editors: Autocompletion or error-checking depends on semantic enrichment. But also go-to-definition or find-usages. In essence the difference between a basic editor and an advanced one is in their support for semantic enrichment for the language of interest.
The StarLasu Approach and Semantic Enrichment
When it comes to parsing, at Strumenta we apply the principles of the Chisel Method. They are quite established at this point, after years of refining. For Semantic Enrichment, things are not as crystallized as we evolve the approach at each new project, finding new solutions to new challenges. That said, we are finding patterns that work and incorporating them in our core libraries and in Sharplasu in particular.
At this stage, Sharplasu has a module called SymbolResolution which has a reasonably good approach to symbol resolution. Type calculation is instead still implemented ad-hoc for each project, at this time. So we call the type calculation logic from symbol resolution (and viceversa). It is just that we have standard APIs for symbol resolution and not for type calculation.
Let’s See an Example of Semantic Enrichment
In our example we will work with a simple language that permits us to define entities. These entities have fields, called features with types. They can be initialized with expressions.
This is an example:
module example import standard type address class base { name string description string } class aged : base { age integer } class person : aged { speed integer = 2 } class athlete : person { speed integer = person.speed * 2 } class car : aged { kilometers integer = age * 10000 }
This language, while simple, contains some of the elements that we can find in most languages:
- We can import modules
- We can define new types
- We have built-in types such as
string
andinteger
- We can reference features without specifying the context (therefore using the current object as context) or specifying it
Symbol Resolution
Let’s take a look at a portion of our SymbolResolver
.
Scope ModuleLevelTypes(Node ctx) { var scope = new Scope(); var module = ctx.FindAncestorOfType<Module>(); if (module != null) { // let's define types module.Types.ForEach(type => scope.Define(type)); foreach (var import in module.Imports) { SymbolResolver.ResolveSymbols(import); if (import.Module.Referred?.Entities != null) { foreach (var entity in import.Module.Referred.Entities) { scope.Define(entity); } } if (import.Module.Referred?.Types != null) { foreach (var type in import.Module.Referred.Types) { scope.Define(type); } } } } return scope; } [..] public ExampleSemantics(IModuleFinder moduleFinder) { ModuleFinder = moduleFinder; SymbolResolver = new DeclarativeLocalSymbolResolver(Issues); [..] SymbolResolver.ScopeFor(typeof(FeatureDecl).GetProperty("Type"), (FeatureDecl feature) => { var scope = ModuleLevelTypes(feature); return scope; }); [..] SymbolResolver.ScopeFor(typeof(Import).GetProperty("Module"), (Import import) => { var scope = new Scope(); if(moduleFinder.FindModule(import.Module.Name) != null) { scope.Define(moduleFinder.FindModule(import.Module.Name)); } return scope; }); TypeCalculator = new EntityTypeCalculator(SymbolResolver); }
Here we want to see the basics of symbol resolution and how importing symbols from other elements works. We first see that symbol resolution depends on moduleFinder
. The moduleFinder
is the thing containing the list of available code for a project. This means files of the project and any library available for the project. You can see on the repository that, for this project, is just a Dictionary that keeps tracks of the name and the corresponding Module
object. A Module
object is the root of the AST. Given the previous example file, there will a Module
with name "example"
representing that. The important part is that is the object that tells you if and where modules outside the current one are located.
You can see that to solve imports, as in:
import standard
To solve a symbol you need a scope. You can think of scope as a container of available definitions. In Sharplasu, a Scope
can have a parent Scope
, so you can properly nest them. For example, there could be global scope to solve imports and a class scope to solve features.
So, we create a scope, and then we ask the moduleFinder
if there is a module with that name, and, if so, we define that module. Defining a symbol means telling our SymbolResolver
object that there is a definition of the argument in the current scope. The SymbolResolver
object is based on a Sharplasu class, so you can use that class for all your projects. Later, when we will ask our symbol resolver to resolve the symbols, it will look in the scope of that reference and check if there is valid definition for a reference with that name.
How Importing Modules Affects Types
You will notice that defining module is necessary to solve the type of features.
class base { name string description string }
So, string
after name
and description
are types and name string
is a definition of a feature.
To solve the references to the types of features you call the ModuleLevelTypes
method. This method will:
- look for definition of types in the current module
- loop through imported module, make sure that all references in imported modules are solved
- then it will define the types in each imported module
Solving imports is therefore crucial to solve types. Particularly as in this example, as in many real languages, types are often the ones from the standard library.
class athlete : person { speed integer = person.speed * 2 } class car : aged { kilometers integer = age * 10000 }
Solving ReferenceExpression
In our language a ReferenceExpression
, like person.speed
or age
can only have:
- an optional parent/
context
element that is a class (likeperson
) - a
target
element that references a feature (likespeed
orage
)
No nesting or multiple levels are allowed.
So, solving either a reference to the context or target element is similar.
SymbolResolver.ScopeFor(typeof(ReferenceExpression).GetProperty("Context"), (ReferenceExpression reference) => { var scope = new Scope(); var classParent = reference.FindAncestorOfType<ClassDecl>(); if (classParent != null) scope = ClassHierarchyEntities(reference.FindAncestorOfType<ClassDecl>()); else Issues.Add(Issue.Semantic("The class containing this expression has no superclasses. The Context cannot be solved.", reference.Position)); return scope; }); SymbolResolver.ScopeFor(typeof(ReferenceExpression).GetProperty("Target"), (ReferenceExpression reference) => { var scope = new Scope(); if (reference.Context == null) { var classParent = reference.FindAncestorOfType<ClassDecl>(); if (classParent != null) scope.Parent = ClassLevelTypes(classParent); } else if (reference.Context.Resolved) { reference.Context.Referred.Features.ForEach(it => scope.Define(it)); } return scope; });
One difference is just that if the current expression contains a Context
element, but the class containing the expression has no superclass, we have an issue, because the reference cannot be solved. The other one is that for solving Target
, we must first solve Context
, to make sure we only consider the features in the Context
element.
Otherwise, we look for the proper elements in the parent class. Let’s just look at how to solve the hierarchy of classes.
Scope ClassHierarchyEntities(ClassDecl ctx) { var scope = new Scope(); var superclass = ctx.Superclass; if (superclass != null && superclass.Resolved) { // let's define the superclass scope.Define(superclass.Referred); scope.Parent = ClassHierarchyEntities(superclass.Referred); } return scope; }
If the reference to the superclass of the current class has been solved, we define the current superclass. Then we rise up through the hierarchy of classes to define them all.
class base { name string description string } class aged : base { age integer } class person : aged { speed integer = 2 } class athlete : person { speed integer = person.speed * 2 }
Basically, considering the previous example, to solve the reference to person
, in person.speed
we define person
, because is the superclass of athlete
, the class containing the expression, then aged
and base
.
Symbol Resolution Patterns
We can then see a few rules:
- The way in which we resolve imports is by delegating to the moduleFinder object. This is the case because we need to use some logic to find other files and parse them on demand, possibly managing loops (what if a file import itself, directly or indirectly?)
- When looking for a superclass we use a global scope, for all classes in the module
- For solving references to feature we do not specify any element declared at that level. We just look for features declared in the classes containing them. So to do this, we define a parent scope
- The case of
ModuleLevelTypes
is interesting because there we can see a combination of elements:- We get all the types declared in the module
- We get all the types declared in the imported modules
- We could also get all the built-in entities, but we choose to force the user to import a standard module to get them instead
These few rules cover many of the most common patterns we see in languages we work with, either Domain Specific Languages we design or legacy languages for which we build parsers.
Type Calculation
Let’s take a look at our simple Type Calculator. Notice that we managed to separate our type calculation from symbol resolution. Our type calculation needs symbol resolution, but not vice versa. We are going to see this simpler case first, then what happens in the standard case.
public override IType CalculateType(Node node) { switch (node) { case OperatorExpression opExpr: var leftType = GetType(opExpr.Left); var rightType = GetType(opExpr.Right); if (leftType == null || rightType == null) return null; switch (opExpr.Operator) { case Operator.Addition: if (leftType == EntityStandardLibrary.StringType && rightType == EntityStandardLibrary.StringType) return EntityStandardLibrary.StringType; else if (leftType == EntityStandardLibrary.StringType && rightType == EntityStandardLibrary.IntegerType) return EntityStandardLibrary.StringType; else if (leftType == EntityStandardLibrary.IntegerType && rightType == EntityStandardLibrary.IntegerType) return EntityStandardLibrary.IntegerType; else throw new NotImplementedException($"Unsupported operand types for addition: {leftType}, {rightType}"); [..] } case ReferenceExpression refExpr: if(refExpr.Context == null) return GetTypeOfReference<ReferenceExpression, FeatureDecl>(refExpr, typeof(ReferenceExpression).GetProperty("Target")); else { SymbolResolver.ResolveNode(refExpr); return GetTypeOfReference<ReferenceExpression, FeatureDecl>(refExpr, typeof(ReferenceExpression).GetProperty("Target")); } [..] case StringLiteralExpression _: return EntityStandardLibrary.StringType; case BooleanLiteralExpression _: return EntityStandardLibrary.BooleanType; case IntegerLiteralExpression _: return EntityStandardLibrary.IntegerType; default: throw new NotImplementedException($"Type calculation not implemented for node type {node.GetType()}"); } }
It shows common patterns for type calculation:
- On the bottom you can see that we solve types for literals: we assign a standard type for each literal
- We solve types for binary operations by finding types for the two individual elements (
left
andright
) of the expression and then defining rules for the combination. For each an addition of a string and an integer is considered a concatenation, so the resulting type is an integer. This will vary very much by language and how you choose to handle type conversion between compatible types - To solve the type of reference expressions, we need to solve the reference and then get its type
Calculating the Type of References
To solve the type of references we get a look at GetTypeOfReference
method. This method has type argument the class that can hold the reference and the information about the property that will hold the reference in the first argument. Notice that in this case we could avoid using the first type argument, since we ReferenceExpression
is the only kind of expression containing a reference. However, this code shows that it is easy to generalize this method.
private IType GetTypeOfReference<T, S>(T refHolder, PropertyInfo refAccessor) where T : Node where S : Node, Named { ReferenceByName<S> refValue = refAccessor.GetValue(refHolder) as ReferenceByName<S>; if (refValue != null && !refValue.Resolved) { SymbolResolver?.ResolveProperty(refAccessor, refHolder); } else if (refValue != null && refValue.Resolved != false) return GetType(refValue.Referred as Node); return null; }
The method uses a bit of reflection. In essence, it checks if the providing property corresponds to a reference that has been solved. If it is not, then it triggers the symbol resolution, supposing we have a SymbolResolver
. Then we get the type referred to. GetType
is simply a way to access a Dictionary
matching nodes with types.
These are the common patterns to calculate a type. One that we are missing is having a special type for void
or unit
, that represents no type.
Again, it may sound quite boring, but these are the kind of patterns we routinely see. Of course, things can get more exciting if we throw generics and type inference in the mix, but for this time, let’s keep things simple.
Our types are all classes that inherits from an interface IType
. This is not an interface part of Sharplasu, we created it for this example, but it is so simple that you can look it up on your own in the repository.
When Type Calculation and Symbol Resolution Intertwine
We avoided making type calculation and symbol resolution be dependent on each other for a few reasons. Our references had only two levels and we knew that the first one was always a superclass of the current one. Imagine we change that.
class address { note base city string street string number integer } class person : aged { location address speed integer = 2 } class athlete : person { deliveryNote string = location.note.description luckyNumber integer = location.number + 3 }
Now, the features can have as a type a class, other than scalar types. Our references now have a Context
property, which is an Expression
. So, the node ReferenceExpression
for location.note.description
will have this structure. At the first level we have a ReferenceExpression
with Target description
and Context
another ReferenceExpression
with Target note
and so on.
This means that now we cannot determine statically what will be the actual type of a Context
object: it could be a scalar type or a class. So, to solve a reference in Target
we need to dynamically define the type of Context, which will depend on what the reference in Context
resolves to. So, how do we accomplish this? For starters, we need to make a small change in the ModuleLevelTypes
method and make sure that we define all entities at the module level.
module.Entities.ForEach(type => scope.Define(type));
Apart from that all we need to change is how we solve symbols for the Target
property.
SymbolResolver.ScopeFor(typeof(ReferenceExpression).GetProperty("Target"), (ReferenceExpression reference) => { var scope = new Scope(); var classParent = reference.FindAncestorOfType<ClassDecl>(); if (classParent != null) scope.Parent = ClassLevelTypes(classParent); if (reference.Context != null) { SymbolResolver.ResolveNode(reference.Context); var type = TypeCalculator.GetType(reference.Context) as ClassDecl; if (type != null) { type.Features.ForEach(it => scope.Define(it)); } } return scope; });
The pattern is simple:
- We ensure we have resolved the
Context
node, so we know which feature Context resolves to - This allows us to get the type, i.e. the class the features has
- We can now define the features of the class
And voilà, we can now solve the current reference.
Using Semantic Enrichment
It is easy to glue together symbol resolution and type calculation.
public List<Issue> SemanticEnrichment(Node node) { SymbolResolver.ResolveSymbols(node); node.WalkDescendants<Expression>().ToList().ForEach(expression => { TypeCalculator.SetTypeIfNeeded(expression); }); return Issues; }
So, we can take any module and trigger symbol resolution. We will then get an AST containing references that have been resolved. Also, the types will be stored in a cache in TypeCalculator
.
public abstract class TypeCalculator { public virtual IType GetType(Node node) { return SetTypeIfNeeded(node); } public IType StrictlyGetType(Node node) { var type = SetTypeIfNeeded(node); if (type == null) throw new InvalidOperationException($"Cannot get type for node {node}"); return type; } public abstract IType CalculateType(Node node); public virtual IType SetTypeIfNeeded(Node node) { if (node.GetTypeSemantics() == null) { var calculatedType = CalculateType(node); node.SetTypeSemantics(calculatedType); } return node.GetTypeSemantics(); } }
This means that after invoking the semantic enrichment, we can look each of our nodes of class Expression
and get their type using mynode.GetTypeSemantics()
. Easy, right?
A Simple Test
What is life without tests? Here at Strumenta we do not want to imagine such a sorry existence, so our repository has a few tests. Let’s see a simple one:
[TestMethod] public void TestTypeCalculation() { EntitySharplasuParser parser = new EntitySharplasuParser(); string code = @"module example import standard class base { name string description string } class aged : base { age integer } class person : aged { location address speed integer = 2 } class address { note base city string street string number integer } class athlete : person { deliveryNote string = location.note.description luckyNumber integer = location.number + 3 } class car : aged { kilometers integer = age * 10000 }"; var result = parser.Parse(code); SimpleModuleFinder moduleFinder = new SimpleModuleFinder(); ExampleSemantics semantics = new ExampleSemantics(moduleFinder); List<Issue> issues = semantics.SemanticEnrichment(result.Root); Assert.AreEqual(0, issues.Count); result.Root.AssertAllExpressionsHaveTypes(); Assert.AreEqual("string", result.Root.Entities[4].Features[0].Value.GetTypeSemantics().Name ); }
We test that all expressions have a type, then we check one specific expression. The highlighted expression should have type string
, since location
is of class address
which has the note
feature of class base
, which in turn has a feature description
of type string
.
Conclusions
While parsing organizes code into a syntactic structure, Semantic Enrichment uncovers its meaning by resolving symbols and determining types. Without this critical step, advanced operations such as code generation, interpretation, and refactoring would be impossible. Semantic Enrichment is not trivial to implement.
With this article, we wanted to share some of the principles behind it. And through the support built-in in Sharplasu, we want to provide a way to simplify the implementation of advanced Language Engineering solutions. For us, it has been working pretty well, and we hope it will work similarly well for you.
Have fun with your Language Engineering project!
The post Resolve Symbols and Calculate Types with Sharplasu appeared first on Strumenta.