Abstract
Traditional compiler front-ends are designed around near-deterministic parsing algorithms, restricting the form of the parsing grammar. The corresponding lexical analysers are required to provide only a single partitioning of an input string into tokens. This requires the lexical analyser to apply lexical disambiguation techniques.
A generalised parser can in principle generate all derivations from a grammar whose tokens are the characters of the underlying alphabet (so called character-level parsing) but this results in a very large data structure. This thesis investigates a form of lexical analysis that can offer all tokenisations of a string to a parser, as well as a corresponding multiple input parsing algorithm, MGLL, which can simultaneously construct derivations over all tokenisations. The goal is to provide equivalent generality to a character-level parser, but with signicantly improved performance in terms of both the space required to represent the derivations and the time required to construct them. In addition, lexical-level disambiguation constructs are provided which may be
used to model traditional lexical disambiguation strategies under the new framework.
The thesis will also discuss the translation of derivation trees into abstract syntax forms suitable for use in structural semantics, by annotating a grammar with a small set of local operations called GIFT operators. Some preliminary work on how to specify ambiguity reduction at syntactic level shall also be described.
The thesis concludes with two case studies. One demonstrating the application of this new form of lexical analysis to the C# 2.0 language specication. The other drawn from the PLanCompS project, which shows how to generate derivations in an appropriate abstract syntax for the C# 1.2 language, from a parser which uses the concrete grammar from the C# 1.2 language standard specication.
A generalised parser can in principle generate all derivations from a grammar whose tokens are the characters of the underlying alphabet (so called character-level parsing) but this results in a very large data structure. This thesis investigates a form of lexical analysis that can offer all tokenisations of a string to a parser, as well as a corresponding multiple input parsing algorithm, MGLL, which can simultaneously construct derivations over all tokenisations. The goal is to provide equivalent generality to a character-level parser, but with signicantly improved performance in terms of both the space required to represent the derivations and the time required to construct them. In addition, lexical-level disambiguation constructs are provided which may be
used to model traditional lexical disambiguation strategies under the new framework.
The thesis will also discuss the translation of derivation trees into abstract syntax forms suitable for use in structural semantics, by annotating a grammar with a small set of local operations called GIFT operators. Some preliminary work on how to specify ambiguity reduction at syntactic level shall also be described.
The thesis concludes with two case studies. One demonstrating the application of this new form of lexical analysis to the C# 2.0 language specication. The other drawn from the PLanCompS project, which shows how to generate derivations in an appropriate abstract syntax for the C# 1.2 language, from a parser which uses the concrete grammar from the C# 1.2 language standard specication.
Original language | English |
---|---|
Qualification | Ph.D. |
Awarding Institution |
|
Supervisors/Advisors |
|
Thesis sponsors | |
Award date | 19 Feb 2016 |
Publication status | Unpublished - 2016 |
Keywords
- parsing
- lexical analysis
- PhD
- disambiguation