Python Library · Compiler Design

compilerdesign

A pure-Python teaching library that covers the whole compiler-design syllabus — lexer, grammar transforms, FIRST / FOLLOW, LL(1), shift-reduce, LR(0), symbol table, three-address code, DAG and expression conversion. Every result also has a matching show_* pretty-printer for clean tabular output. Import as cd. Zero dependencies.

pip
pip install compilerdesign
Python ≥ 3.8 No Dependencies MIT License 11 Modules
LEX
SYM
LL1
LR0
PARSE
3AC
DAG
show_*
11Modules
30+Functions
12show_* Printers
0Dependencies
3.8+Python
MITLicense
00

Quick Start

Installation & first usage
01
Lexical Analyzer
cd.lexical_analyzer(code)
02
Symbol Table
cd.build_symbol_table(code)
03
Grammar Transforms
cd.eliminate_left_recursion(g)
04
FIRST & FOLLOW
cd.compute_first_follow(g)
05
LL(1) Table
cd.build_ll1_table(g)
06
Shift-Reduce
cd.shift_reduce_parse(p, t)
07
Leading & Trailing
cd.compute_leading_trailing(g)
08
LR(0) Items
cd.compute_lr0_items(p, start)
09
Expression Convert
cd.infix_to_postfix(expr)
10
Three-Address Code
cd.generate_three_address_code(e)
11
DAG Construction
cd.build_dag(expr)
12
Pretty-Printers
cd.show_*(result)
  • 1

    Install the package

    Requires Python 3.8 or later. No external dependencies.

    bash
    pip install compilerdesign
  • 2

    Import as cd

    Use the conventional alias cd for brevity.

    python
    import compilerdesign as cd
  • 3

    Try your first analysis

    python
    import compilerdesign as cd
    
    # Tokenize C-style code
    result = cd.lexical_analyzer("int x = 10 + y;")
    print(result['summary'])
    
    # Convert expression notation
    print(cd.infix_to_postfix("a + b * c"))  # a b c * +
  • 4

    See the output as a clean table

    Every function has a matching show_* pretty-printer. It prints an ASCII table / trace and returns the same object so you can chain.

    python
    # Raw dict
    result = cd.lexical_analyzer("int x = 10;")
    
    # Pretty table (same dict, printed nicely)
    cd.show_lexical(result)
ℹ️
Grammar format: dict of non-terminal → list of production strings. Symbols are space-separated. Use ε or eps for epsilon.
💡
Two calls, two styles. cd.xxx(...) gives you the raw dict to inspect programmatically; cd.show_xxx(result) prints it as a teacher-style table. Jump straight to §12 Pretty-Printers for the full list.
01

Lexical Analyzer

Tokenize source code · count token types
cd.lexical_analyzer
lexical_analyzer(code: str) → dict
lexer tokenizer

Tokenizes C-style source code using regex-based pattern matching. Classifies each token as keyword, identifier, constant, operator, punctuation, string literal, or comment. Returns detailed counts and unique token lists.

python
code = """
int main() {
    int x = 10;
    float y = 3.14;
    return x + y;
}
"""

result = cd.lexical_analyzer(code)

print(result['summary'])
# {
#   'total_tokens': 24,
#   'total_characters': 52,
#   'total_lines': 7,
#   'keywords': 5,
#   'identifiers': 3,
#   'integer_constants': 1,
#   'float_constants': 1,
#   'operators': 3,
#   'punctuation': 8,
# }

print(result['tokens'])         # [('KEYWORD','int'), ('KEYWORD','main'), ...]
print(result['unique_identifiers']) # ['x', 'y']
print(result['keywords_used'])      # ['int', 'float', 'return', 'main']

↳ Return value keys

KeyTypeDescription
tokenslist of tuplesEach element is (token_type, value)
summarydictCount of each token category + lines + characters
unique_identifierslistDeduplicated list of all identifier tokens
keywords_usedlistDeduplicated list of all keyword tokens found
💡
Token types: KEYWORD, IDENTIFIER, INTEGER_CONSTANT, FLOAT_CONSTANT, OPERATOR, PUNCTUATION, STRING_LITERAL, COMMENT, UNKNOWN
02

Symbol Table

Scope-aware declaration tracking
cd.build_symbol_table
build_symbol_table(code: str) → dict
semantic scope

Scans C-like source code and records every declared function, parameter and variable with its type, kind, scope and line number. Intended for teaching — not a full type checker.

python
code = """
int globalVar;
int add(int a, int b) {
    int sum = a + b;
    return sum;
}
"""

table = cd.build_symbol_table(code)
cd.show_symbol_table(table)   # pretty table

print(table['total'])      # 5
print(table['entries'][0])
# {'name': 'globalVar', 'type': 'int', 'kind': 'variable', 'scope': 'global', 'line': 2}

↳ Return value keys

KeyTypeDescription
entrieslist of dictEach entry has name, type, kind, scope, line
totalintNumber of unique symbols recorded
💡
kind is one of function, parameter, variable. Scope is either global, a function name, or block@<line>.
03

Grammar Transforms

Left recursion · Left factoring · Ambiguity check
📌
Grammar format: A dict mapping each non-terminal (string) to a list of production strings. Symbols are space-separated. Use ε or eps for epsilon.
cd.eliminate_left_recursion
eliminate_left_recursion(grammar: dict) → dict
grammarLR

Eliminates immediate left recursion from each non-terminal. Handles A → Aα | β patterns by introducing A' non-terminals.

python
grammar = {
    'E': ['E + T', 'T'],
    'T': ['T * F', 'F'],
    'F': ['( E )', 'id']
}

new_g = cd.eliminate_left_recursion(grammar)
# {
#   'E':  ["T E'"],
#   "E'": ["+ T E'", 'ε'],
#   'T':  ["F T'"],
#   "T'": ["* F T'", 'ε'],
#   'F':  ['( E )', 'id']
# }
cd.left_factoring
left_factoring(grammar: dict) → dict
grammarLL

Applies left factoring to eliminate common prefixes in productions. Introduces new non-terminals with primes (') to factor out shared prefixes.

python
grammar = {'A': ['a b', 'a c', 'd']}

factored = cd.left_factoring(grammar)
# {'A': ["a A'", 'd'],  "A'": ['b', 'c']}
cd.check_ambiguity
check_ambiguity(grammar: dict) → dict
grammarvalidation

Heuristic ambiguity detection. Flags duplicate productions and productions sharing the same leading symbol. Not a complete decidability check.

python
dup_grammar = {'A': ['a', 'a']}
result = cd.check_ambiguity(dup_grammar)
# {'ambiguous': True, 'issues': ["A has duplicate productions."]}

clean_grammar = {'A': ['a b', 'c']}
result = cd.check_ambiguity(clean_grammar)
# {'ambiguous': False, 'issues': []}
04

FIRST & FOLLOW

Set computation for predictive parsing
cd.compute_first_follow
compute_first_follow(grammar: dict, start: str = None) → dict
setsLL(1)

Computes both FIRST and FOLLOW sets in one call. The start symbol defaults to the first key in the grammar dict. Returns sorted lists for each non-terminal.

python
grammar = {
    'E': ['T R'],    'R': ['+ T R', 'ε'],
    'T': ['F Y'],    'Y': ['* F Y', 'ε'],
    'F': ['( E )', 'i']
}

result = cd.compute_first_follow(grammar, start='E')

print(result['FIRST'])
# {'E': ['(', 'i'], 'R': ['+', 'ε'], 'T': ['(', 'i'],
#  'Y': ['*', 'ε'], 'F': ['(', 'i']}

print(result['FOLLOW'])
# {'E': ['$', ')'], 'R': ['$', ')'], 'T': ['$', ')', '+'],
#  'Y': ['$', ')', '+'], 'F': ['$', ')', '*', '+']}

↳ Individual functions

python
grammar = {
    'E': ['T R'], 'R': ['+ T R', 'ε'],
    'T': ['F Y'], 'Y': ['* F Y', 'ε'],
    'F': ['( E )', 'i']
}
first = cd.compute_first(grammar)
follow = cd.compute_follow(grammar, start='E')
05

LL(1) Predictive Parsing

Table construction · top-down parsing
cd.build_ll1_table
build_ll1_table(grammar: dict, start: str = None) → dict
LL(1)table

Builds the LL(1) predictive parsing table from the grammar. Detects conflicts and reports whether the grammar is LL(1).

python
grammar = {
    'E': ['T R'], 'R': ['+ T R', 'ε'],
    'T': ['F Y'], 'Y': ['* F Y', 'ε'],
    'F': ['( E )', 'i']
}
r = cd.build_ll1_table(grammar, start='E')

print(r['is_ll1'])      # True
print(r['conflicts'])  # [] (empty if LL(1))
print(r['table'])      # {('E','('): 'T R', ('E','i'): 'T R', ...}
cd.ll1_parse
ll1_parse(grammar: dict, tokens: list, start: str = None) → dict
LL(1)parser

Parses a list of input tokens using the LL(1) table. Returns step-by-step trace of the parse including stack state, remaining input, and action taken at each step.

python
grammar = {
    'E': ['T R'], 'R': ['+ T R', 'ε'],
    'T': ['F Y'], 'Y': ['* F Y', 'ε'],
    'F': ['( E )', 'i']
}
result = cd.ll1_parse(
    grammar,
    tokens=['i', '+', 'i'],
    start='E'
)

print(result['accepted'])  # True

for step in result['steps']:
    print(step['stack'], '|', step['input'], '|', step['action'])

↳ Each step dict contains

KeyTypeDescription
stacklistCurrent parse stack (top-first)
inputlistRemaining input tokens
actionstrAction taken: Match X, A → prod, ACCEPT, or ERROR
06

Shift-Reduce Parsing

Bottom-up parsing simulation
cd.shift_reduce_parse
shift_reduce_parse(productions: list, tokens: list) → dict
LRbottom-up

Simulates shift-reduce parsing using handle-finding. Productions are passed as a list of (lhs, rhs) tuples where rhs is a list of symbols. The first production's LHS is treated as the start symbol.

python
productions = [
    ('E', ['E', '+', 'T']),
    ('E', ['T']),
    ('T', ['T', '*', 'F']),
    ('T', ['F']),
    ('F', ['id'])
]

result = cd.shift_reduce_parse(productions, tokens=['id', '+', 'id'])

print(result['accepted'])  # True

for step in result['steps']:
    print(f"Stack: {step['stack']}  Input: {step['input']}  Action: {step['action']}")
07

Leading & Trailing

Operator precedence parsing sets
cd.compute_leading_trailing
compute_leading_trailing(grammar: dict) → dict
setsoperator precedence

Computes LEADING and TRAILING sets used in operator-precedence (Floyd) parsing. LEADING(A) = terminals that can appear leftmost in any derivation of A. TRAILING(A) = terminals that can appear rightmost.

python
grammar = {
    'E': ['T R'], 'R': ['+ T R', 'ε'],
    'T': ['F Y'], 'Y': ['* F Y', 'ε'],
    'F': ['( E )', 'i']
}
result = cd.compute_leading_trailing(grammar)

print(result['LEADING'])
# {'E': ['(', '+', '*', 'i'], 'R': ['+'], ...}

print(result['TRAILING'])
# {'E': [')', '+', '*', 'i'], ...}

# Individual functions:
leading  = cd.compute_leading(grammar)
trailing = cd.compute_trailing(grammar)
08

LR(0) Items

Canonical collection · closure · goto
cd.compute_lr0_items
compute_lr0_items(productions: list, start: str) → dict
LR(0)automatastates

Constructs the canonical collection of LR(0) item sets. Automatically adds the augmented production S' → S. Returns all states with their items and the GOTO transition table between states.

python
productions = [
    ('E', ['E', '+', 'T']),
    ('E', ['T']),
    ('T', ['id'])
]

result = cd.compute_lr0_items(productions, start='E')

print(f"Total states: {result['num_states']}")  # Total states: 6

for state in result['states']:
    print(f"\nState {state['id']}:")
    for item in state['items']:
        print(f"  {item}")  # e.g. "E' -> . E"

for (frm, sym, to) in result['transitions']:
    print(f"  State {frm} --{sym}--> State {to}")

↳ Return value keys

KeyTypeDescription
stateslist of dictsEach dict has id (int) and items (list of strings)
transitionslist of tuples(from_state_id, symbol, to_state_id)
num_statesintTotal number of LR(0) states
augmented_startstrThe added augmented start symbol (e.g. E')
09

Expression Conversion

Infix ↔ Postfix ↔ Prefix
Expression Converters
All 6 pairwise conversions supported
ICGpostfixprefixinfix

Convert arithmetic expressions between infix (standard), postfix (RPN), and prefix (Polish notation). Supports + - * / % ^ with correct operator precedence and right-associativity for ^.

python
# Infix → Postfix (Reverse Polish Notation)
cd.infix_to_postfix("a + b * c")     # "a b c * +"
cd.infix_to_postfix("(a + b) * c")   # "a b + c *"
cd.infix_to_postfix("a ^ b ^ c")     # "a b c ^ ^"  (right-assoc)

# Infix → Prefix (Polish Notation)
cd.infix_to_prefix("a + b * c")      # "+ a * b c"

# Postfix → Infix
cd.postfix_to_infix("a b c * +")      # "(a + (b * c))"

# Prefix → Infix
cd.prefix_to_infix("+ a * b c")      # "(a + (b * c))"

# Postfix → Prefix
cd.postfix_to_prefix("a b c * +")    # "+ a * b c"

# Prefix → Postfix
cd.prefix_to_postfix("+ a * b c")    # "a b c * +"

# Universal converter
cd.convert_expression(
    "a + b * c",
    from_notation="infix",
    to_notation="postfix"
)  # "a b c * +"

↳ Operator precedence table

OperatorPrecedenceAssociativity
+   -1 (lowest)Left
*   /   %2Left
^3 (highest)Right
10

Three-Address Code

Quadruples from an infix expression
cd.generate_three_address_code
generate_three_address_code(expr: str) → dict
IR 3AC

Converts an infix arithmetic expression into a linear sequence of quadruples. Each quadruple has an operator and at most three addresses: result = arg1 op arg2. Temporaries are named t1, t2, ….

python
tac = cd.generate_three_address_code("a + b * c - d")
cd.show_three_address_code(tac)   # pretty table

for line in tac['pretty']:
    print(line)
# t1 = b * c
# t2 = a + t1
# t3 = t2 - d

print(tac['final'])   # 't3' (where the result lives)

↳ Return value keys

KeyTypeDescription
quadrupleslist of dictEach has op, arg1, arg2, result
prettylist of strReady-to-print t_k = x op y lines
finalstrName of the final temporary holding the result
11

DAG Construction

Directed Acyclic Graph of an expression
cd.build_dag
build_dag(expr: str) → dict
optimization CSE

Builds a DAG for an arithmetic expression. Common sub-expressions share a single node automatically, making redundant computations easy to spot.

python
dag = cd.build_dag("a + a * b")
cd.show_dag(dag)

# 'a' appears only once in the node list — shared leaf
print([n['label'] for n in dag['nodes']])
# ['a', 'b', '*', '+']

↳ Return value keys

KeyTypeDescription
nodeslist of dictEach node: id, label, children (indices)
rootintIndex of the root node (whole expression)
12

Pretty-Printers

Clean, tabular output for every result
ℹ️
Every core function has a matching show_* helper that prints an ASCII table / parse trace and returns the same object it received — so you can chain calls naturally: cd.show_lexical(cd.lexical_analyzer(code)).
Pretty-printer cheat sheet
show_*(result) → same object
output teaching
TopicCore functionPretty-printer
Lexicallexical_analyzershow_lexical
Symbol tablebuild_symbol_tableshow_symbol_table
Grammar transformseliminate_left_recursion / left_factoringshow_grammar
Ambiguitycheck_ambiguityshow_ambiguity
FIRST / FOLLOWcompute_first_followshow_first_follow
LEADING / TRAILINGcompute_leading_trailingshow_leading_trailing
LL(1) tablebuild_ll1_tableshow_ll1_table
Parse tracell1_parse / shift_reduce_parseshow_parse_trace
LR(0) itemscompute_lr0_itemsshow_lr0
Three-address codegenerate_three_address_codeshow_three_address_code
DAGbuild_dagshow_dag
python
# End-to-end example — every step shows a clean table
import compilerdesign as cd

src = "int add(int a, int b) { int c = a + b * 2; return c; }"

cd.show_lexical(cd.lexical_analyzer(src))
cd.show_symbol_table(cd.build_symbol_table(src))
cd.show_three_address_code(cd.generate_three_address_code("a + b * 2"))
cd.show_dag(cd.build_dag("a + b * 2"))

g = {'E': ['E + T', 'T'], 'T': ['T * F', 'F'], 'F': ['( E )', 'id']}
g = cd.eliminate_left_recursion(g)
cd.show_grammar(g, "NO LEFT RECURSION")
cd.show_first_follow(cd.compute_first_follow(g, start='E'))
cd.show_ll1_table(cd.build_ll1_table(g, start='E'))
cd.show_parse_trace(cd.ll1_parse(g, ['id', '+', 'id'], start='E'))