Adapting Compiler Front Ends for Generalised Parsing

Robert Michael Walsh

Adapting Compiler Front Ends for Generalised Parsing

Robert Michael Walsh

Department of Computer Science

Research output: Thesis › Doctoral Thesis

587 Downloads (Pure)

Abstract

Traditional compiler front-ends are designed around near-deterministic parsing algorithms, restricting the form of the parsing grammar. The corresponding lexical analysers are required to provide only a single partitioning of an input string into tokens. This requires the lexical analyser to apply lexical disambiguation techniques.

A generalised parser can in principle generate all derivations from a grammar whose tokens are the characters of the underlying alphabet (so called character-level parsing) but this results in a very large data structure. This thesis investigates a form of lexical analysis that can offer all tokenisations of a string to a parser, as well as a corresponding multiple input parsing algorithm, MGLL, which can simultaneously construct derivations over all tokenisations. The goal is to provide equivalent generality to a character-level parser, but with signicantly improved performance in terms of both the space required to represent the derivations and the time required to construct them. In addition, lexical-level disambiguation constructs are provided which may be
used to model traditional lexical disambiguation strategies under the new framework.

The thesis will also discuss the translation of derivation trees into abstract syntax forms suitable for use in structural semantics, by annotating a grammar with a small set of local operations called GIFT operators. Some preliminary work on how to specify ambiguity reduction at syntactic level shall also be described.

The thesis concludes with two case studies. One demonstrating the application of this new form of lexical analysis to the C# 2.0 language specication. The other drawn from the PLanCompS project, which shows how to generate derivations in an appropriate abstract syntax for the C# 1.2 language, from a parser which uses the concrete grammar from the C# 1.2 language standard specication.

Original language	English
Qualification	Ph.D.
Awarding Institution	Royal Holloway, University of London
Supervisors/Advisors	Johnstone, Adrian, Supervisor Scott, Elizabeth, Supervisor Watkins, Chris, Advisor
Thesis sponsors	Eng & Phys Sci Res Council EPSRC
Award date	19 Feb 2016
Publication status	Unpublished - 2016

Keywords

parsing
lexical analysis
PhD
disambiguation

Access to Document

WalshRMPhDThesisOther version, 1.34 MB

Cite this

@phdthesis{49bce35e2f99425a8dd6bb616fbb9b26,

title = "Adapting Compiler Front Ends for Generalised Parsing",

abstract = "Traditional compiler front-ends are designed around near-deterministic parsing algorithms, restricting the form of the parsing grammar. The corresponding lexical analysers are required to provide only a single partitioning of an input string into tokens. This requires the lexical analyser to apply lexical disambiguation techniques.A generalised parser can in principle generate all derivations from a grammar whose tokens are the characters of the underlying alphabet (so called character-level parsing) but this results in a very large data structure. This thesis investigates a form of lexical analysis that can offer all tokenisations of a string to a parser, as well as a corresponding multiple input parsing algorithm, MGLL, which can simultaneously construct derivations over all tokenisations. The goal is to provide equivalent generality to a character-level parser, but with signicantly improved performance in terms of both the space required to represent the derivations and the time required to construct them. In addition, lexical-level disambiguation constructs are provided which may beused to model traditional lexical disambiguation strategies under the new framework.The thesis will also discuss the translation of derivation trees into abstract syntax forms suitable for use in structural semantics, by annotating a grammar with a small set of local operations called GIFT operators. Some preliminary work on how to specify ambiguity reduction at syntactic level shall also be described.The thesis concludes with two case studies. One demonstrating the application of this new form of lexical analysis to the C# 2.0 language specication. The other drawn from the PLanCompS project, which shows how to generate derivations in an appropriate abstract syntax for the C# 1.2 language, from a parser which uses the concrete grammar from the C# 1.2 language standard specication.",

keywords = "parsing, lexical analysis, PhD, disambiguation",

author = "Walsh, {Robert Michael}",

year = "2016",

language = "English",

school = "Royal Holloway, University of London",

}

TY - BOOK

T1 - Adapting Compiler Front Ends for Generalised Parsing

AU - Walsh, Robert Michael

PY - 2016

Y1 - 2016

N2 - Traditional compiler front-ends are designed around near-deterministic parsing algorithms, restricting the form of the parsing grammar. The corresponding lexical analysers are required to provide only a single partitioning of an input string into tokens. This requires the lexical analyser to apply lexical disambiguation techniques.A generalised parser can in principle generate all derivations from a grammar whose tokens are the characters of the underlying alphabet (so called character-level parsing) but this results in a very large data structure. This thesis investigates a form of lexical analysis that can offer all tokenisations of a string to a parser, as well as a corresponding multiple input parsing algorithm, MGLL, which can simultaneously construct derivations over all tokenisations. The goal is to provide equivalent generality to a character-level parser, but with signicantly improved performance in terms of both the space required to represent the derivations and the time required to construct them. In addition, lexical-level disambiguation constructs are provided which may beused to model traditional lexical disambiguation strategies under the new framework.The thesis will also discuss the translation of derivation trees into abstract syntax forms suitable for use in structural semantics, by annotating a grammar with a small set of local operations called GIFT operators. Some preliminary work on how to specify ambiguity reduction at syntactic level shall also be described.The thesis concludes with two case studies. One demonstrating the application of this new form of lexical analysis to the C# 2.0 language specication. The other drawn from the PLanCompS project, which shows how to generate derivations in an appropriate abstract syntax for the C# 1.2 language, from a parser which uses the concrete grammar from the C# 1.2 language standard specication.

AB - Traditional compiler front-ends are designed around near-deterministic parsing algorithms, restricting the form of the parsing grammar. The corresponding lexical analysers are required to provide only a single partitioning of an input string into tokens. This requires the lexical analyser to apply lexical disambiguation techniques.A generalised parser can in principle generate all derivations from a grammar whose tokens are the characters of the underlying alphabet (so called character-level parsing) but this results in a very large data structure. This thesis investigates a form of lexical analysis that can offer all tokenisations of a string to a parser, as well as a corresponding multiple input parsing algorithm, MGLL, which can simultaneously construct derivations over all tokenisations. The goal is to provide equivalent generality to a character-level parser, but with signicantly improved performance in terms of both the space required to represent the derivations and the time required to construct them. In addition, lexical-level disambiguation constructs are provided which may beused to model traditional lexical disambiguation strategies under the new framework.The thesis will also discuss the translation of derivation trees into abstract syntax forms suitable for use in structural semantics, by annotating a grammar with a small set of local operations called GIFT operators. Some preliminary work on how to specify ambiguity reduction at syntactic level shall also be described.The thesis concludes with two case studies. One demonstrating the application of this new form of lexical analysis to the C# 2.0 language specication. The other drawn from the PLanCompS project, which shows how to generate derivations in an appropriate abstract syntax for the C# 1.2 language, from a parser which uses the concrete grammar from the C# 1.2 language standard specication.

KW - parsing

KW - lexical analysis

KW - PhD

KW - disambiguation

M3 - Doctoral Thesis

ER -