Quick Start
Installation & first usage-
1
Install the package
Requires Python 3.8 or later. No external dependencies.
bashpip install compilerdesign -
2
Import as
cdUse the conventional alias
cdfor brevity.pythonimport compilerdesign as cd -
3
Try your first analysis
pythonimport compilerdesign as cd # Tokenize C-style code result = cd.lexical_analyzer("int x = 10 + y;") print(result['summary']) # Convert expression notation print(cd.infix_to_postfix("a + b * c")) # a b c * + -
4
See the output as a clean table
Every function has a matching
show_*pretty-printer. It prints an ASCII table / trace and returns the same object so you can chain.python# Raw dict result = cd.lexical_analyzer("int x = 10;") # Pretty table (same dict, printed nicely) cd.show_lexical(result)
ε or eps for epsilon.
cd.xxx(...) gives you the raw dict to inspect programmatically;
cd.show_xxx(result) prints it as a teacher-style table. Jump straight to
§12 Pretty-Printers for the full list.
Lexical Analyzer
Tokenize source code · count token typesTokenizes C-style source code using regex-based pattern matching. Classifies each token as keyword, identifier, constant, operator, punctuation, string literal, or comment. Returns detailed counts and unique token lists.
code = """
int main() {
int x = 10;
float y = 3.14;
return x + y;
}
"""
result = cd.lexical_analyzer(code)
print(result['summary'])
# {
# 'total_tokens': 24,
# 'total_characters': 52,
# 'total_lines': 7,
# 'keywords': 5,
# 'identifiers': 3,
# 'integer_constants': 1,
# 'float_constants': 1,
# 'operators': 3,
# 'punctuation': 8,
# }
print(result['tokens']) # [('KEYWORD','int'), ('KEYWORD','main'), ...]
print(result['unique_identifiers']) # ['x', 'y']
print(result['keywords_used']) # ['int', 'float', 'return', 'main']
↳ Return value keys
| Key | Type | Description |
|---|---|---|
| tokens | list of tuples | Each element is (token_type, value) |
| summary | dict | Count of each token category + lines + characters |
| unique_identifiers | list | Deduplicated list of all identifier tokens |
| keywords_used | list | Deduplicated list of all keyword tokens found |
KEYWORD, IDENTIFIER, INTEGER_CONSTANT,
FLOAT_CONSTANT, OPERATOR, PUNCTUATION,
STRING_LITERAL, COMMENT, UNKNOWN
Symbol Table
Scope-aware declaration trackingScans C-like source code and records every declared function, parameter and variable with its type, kind, scope and line number. Intended for teaching — not a full type checker.
code = """
int globalVar;
int add(int a, int b) {
int sum = a + b;
return sum;
}
"""
table = cd.build_symbol_table(code)
cd.show_symbol_table(table) # pretty table
print(table['total']) # 5
print(table['entries'][0])
# {'name': 'globalVar', 'type': 'int', 'kind': 'variable', 'scope': 'global', 'line': 2}
↳ Return value keys
| Key | Type | Description |
|---|---|---|
| entries | list of dict | Each entry has name, type, kind, scope, line |
| total | int | Number of unique symbols recorded |
kind is one of function,
parameter, variable. Scope is either
global, a function name, or block@<line>.
Grammar Transforms
Left recursion · Left factoring · Ambiguity checkdict mapping each non-terminal (string) to a list of
production strings. Symbols are space-separated. Use ε or eps for epsilon.
Eliminates immediate left recursion from each non-terminal. Handles A → Aα | β patterns by introducing A' non-terminals.
grammar = {
'E': ['E + T', 'T'],
'T': ['T * F', 'F'],
'F': ['( E )', 'id']
}
new_g = cd.eliminate_left_recursion(grammar)
# {
# 'E': ["T E'"],
# "E'": ["+ T E'", 'ε'],
# 'T': ["F T'"],
# "T'": ["* F T'", 'ε'],
# 'F': ['( E )', 'id']
# }
Applies left factoring to eliminate common prefixes in productions. Introduces new non-terminals with primes (') to factor out shared prefixes.
grammar = {'A': ['a b', 'a c', 'd']}
factored = cd.left_factoring(grammar)
# {'A': ["a A'", 'd'], "A'": ['b', 'c']}
Heuristic ambiguity detection. Flags duplicate productions and productions sharing the same leading symbol. Not a complete decidability check.
dup_grammar = {'A': ['a', 'a']}
result = cd.check_ambiguity(dup_grammar)
# {'ambiguous': True, 'issues': ["A has duplicate productions."]}
clean_grammar = {'A': ['a b', 'c']}
result = cd.check_ambiguity(clean_grammar)
# {'ambiguous': False, 'issues': []}
FIRST & FOLLOW
Set computation for predictive parsingComputes both FIRST and FOLLOW sets in one call. The start symbol defaults to the first key in the grammar dict. Returns sorted lists for each non-terminal.
grammar = {
'E': ['T R'], 'R': ['+ T R', 'ε'],
'T': ['F Y'], 'Y': ['* F Y', 'ε'],
'F': ['( E )', 'i']
}
result = cd.compute_first_follow(grammar, start='E')
print(result['FIRST'])
# {'E': ['(', 'i'], 'R': ['+', 'ε'], 'T': ['(', 'i'],
# 'Y': ['*', 'ε'], 'F': ['(', 'i']}
print(result['FOLLOW'])
# {'E': ['$', ')'], 'R': ['$', ')'], 'T': ['$', ')', '+'],
# 'Y': ['$', ')', '+'], 'F': ['$', ')', '*', '+']}
↳ Individual functions
grammar = {
'E': ['T R'], 'R': ['+ T R', 'ε'],
'T': ['F Y'], 'Y': ['* F Y', 'ε'],
'F': ['( E )', 'i']
}
first = cd.compute_first(grammar)
follow = cd.compute_follow(grammar, start='E')
LL(1) Predictive Parsing
Table construction · top-down parsingBuilds the LL(1) predictive parsing table from the grammar. Detects conflicts and reports whether the grammar is LL(1).
grammar = {
'E': ['T R'], 'R': ['+ T R', 'ε'],
'T': ['F Y'], 'Y': ['* F Y', 'ε'],
'F': ['( E )', 'i']
}
r = cd.build_ll1_table(grammar, start='E')
print(r['is_ll1']) # True
print(r['conflicts']) # [] (empty if LL(1))
print(r['table']) # {('E','('): 'T R', ('E','i'): 'T R', ...}
Parses a list of input tokens using the LL(1) table. Returns step-by-step trace of the parse including stack state, remaining input, and action taken at each step.
grammar = {
'E': ['T R'], 'R': ['+ T R', 'ε'],
'T': ['F Y'], 'Y': ['* F Y', 'ε'],
'F': ['( E )', 'i']
}
result = cd.ll1_parse(
grammar,
tokens=['i', '+', 'i'],
start='E'
)
print(result['accepted']) # True
for step in result['steps']:
print(step['stack'], '|', step['input'], '|', step['action'])
↳ Each step dict contains
| Key | Type | Description |
|---|---|---|
| stack | list | Current parse stack (top-first) |
| input | list | Remaining input tokens |
| action | str | Action taken: Match X, A → prod, ACCEPT, or ERROR |
Shift-Reduce Parsing
Bottom-up parsing simulation
Simulates shift-reduce parsing using handle-finding. Productions are passed as a list of
(lhs, rhs) tuples where rhs is a list of symbols.
The first production's LHS is treated as the start symbol.
productions = [
('E', ['E', '+', 'T']),
('E', ['T']),
('T', ['T', '*', 'F']),
('T', ['F']),
('F', ['id'])
]
result = cd.shift_reduce_parse(productions, tokens=['id', '+', 'id'])
print(result['accepted']) # True
for step in result['steps']:
print(f"Stack: {step['stack']} Input: {step['input']} Action: {step['action']}")
Leading & Trailing
Operator precedence parsing setsComputes LEADING and TRAILING sets used in operator-precedence (Floyd) parsing. LEADING(A) = terminals that can appear leftmost in any derivation of A. TRAILING(A) = terminals that can appear rightmost.
grammar = {
'E': ['T R'], 'R': ['+ T R', 'ε'],
'T': ['F Y'], 'Y': ['* F Y', 'ε'],
'F': ['( E )', 'i']
}
result = cd.compute_leading_trailing(grammar)
print(result['LEADING'])
# {'E': ['(', '+', '*', 'i'], 'R': ['+'], ...}
print(result['TRAILING'])
# {'E': [')', '+', '*', 'i'], ...}
# Individual functions:
leading = cd.compute_leading(grammar)
trailing = cd.compute_trailing(grammar)
LR(0) Items
Canonical collection · closure · goto
Constructs the canonical collection of LR(0) item sets. Automatically adds the augmented
production S' → S. Returns all states with their items and the
GOTO transition table between states.
productions = [
('E', ['E', '+', 'T']),
('E', ['T']),
('T', ['id'])
]
result = cd.compute_lr0_items(productions, start='E')
print(f"Total states: {result['num_states']}") # Total states: 6
for state in result['states']:
print(f"\nState {state['id']}:")
for item in state['items']:
print(f" {item}") # e.g. "E' -> . E"
for (frm, sym, to) in result['transitions']:
print(f" State {frm} --{sym}--> State {to}")
↳ Return value keys
| Key | Type | Description |
|---|---|---|
| states | list of dicts | Each dict has id (int) and items (list of strings) |
| transitions | list of tuples | (from_state_id, symbol, to_state_id) |
| num_states | int | Total number of LR(0) states |
| augmented_start | str | The added augmented start symbol (e.g. E') |
Expression Conversion
Infix ↔ Postfix ↔ PrefixConvert arithmetic expressions between infix (standard), postfix (RPN), and prefix (Polish notation). Supports + - * / % ^ with correct operator precedence and right-associativity for ^.
# Infix → Postfix (Reverse Polish Notation)
cd.infix_to_postfix("a + b * c") # "a b c * +"
cd.infix_to_postfix("(a + b) * c") # "a b + c *"
cd.infix_to_postfix("a ^ b ^ c") # "a b c ^ ^" (right-assoc)
# Infix → Prefix (Polish Notation)
cd.infix_to_prefix("a + b * c") # "+ a * b c"
# Postfix → Infix
cd.postfix_to_infix("a b c * +") # "(a + (b * c))"
# Prefix → Infix
cd.prefix_to_infix("+ a * b c") # "(a + (b * c))"
# Postfix → Prefix
cd.postfix_to_prefix("a b c * +") # "+ a * b c"
# Prefix → Postfix
cd.prefix_to_postfix("+ a * b c") # "a b c * +"
# Universal converter
cd.convert_expression(
"a + b * c",
from_notation="infix",
to_notation="postfix"
) # "a b c * +"
↳ Operator precedence table
| Operator | Precedence | Associativity |
|---|---|---|
| + - | 1 (lowest) | Left |
| * / % | 2 | Left |
| ^ | 3 (highest) | Right |
Three-Address Code
Quadruples from an infix expression
Converts an infix arithmetic expression into a linear sequence of
quadruples. Each quadruple has an operator and at most
three addresses: result = arg1 op arg2. Temporaries are
named t1, t2, ….
tac = cd.generate_three_address_code("a + b * c - d")
cd.show_three_address_code(tac) # pretty table
for line in tac['pretty']:
print(line)
# t1 = b * c
# t2 = a + t1
# t3 = t2 - d
print(tac['final']) # 't3' (where the result lives)
↳ Return value keys
| Key | Type | Description |
|---|---|---|
| quadruples | list of dict | Each has op, arg1, arg2, result |
| pretty | list of str | Ready-to-print t_k = x op y lines |
| final | str | Name of the final temporary holding the result |
DAG Construction
Directed Acyclic Graph of an expressionBuilds a DAG for an arithmetic expression. Common sub-expressions share a single node automatically, making redundant computations easy to spot.
dag = cd.build_dag("a + a * b")
cd.show_dag(dag)
# 'a' appears only once in the node list — shared leaf
print([n['label'] for n in dag['nodes']])
# ['a', 'b', '*', '+']
↳ Return value keys
| Key | Type | Description |
|---|---|---|
| nodes | list of dict | Each node: id, label, children (indices) |
| root | int | Index of the root node (whole expression) |
Pretty-Printers
Clean, tabular output for every resultshow_*
helper that prints an ASCII table / parse trace and returns the
same object it received — so you can chain calls naturally:
cd.show_lexical(cd.lexical_analyzer(code)).
| Topic | Core function | Pretty-printer |
|---|---|---|
| Lexical | lexical_analyzer | show_lexical |
| Symbol table | build_symbol_table | show_symbol_table |
| Grammar transforms | eliminate_left_recursion / left_factoring | show_grammar |
| Ambiguity | check_ambiguity | show_ambiguity |
| FIRST / FOLLOW | compute_first_follow | show_first_follow |
| LEADING / TRAILING | compute_leading_trailing | show_leading_trailing |
| LL(1) table | build_ll1_table | show_ll1_table |
| Parse trace | ll1_parse / shift_reduce_parse | show_parse_trace |
| LR(0) items | compute_lr0_items | show_lr0 |
| Three-address code | generate_three_address_code | show_three_address_code |
| DAG | build_dag | show_dag |
# End-to-end example — every step shows a clean table
import compilerdesign as cd
src = "int add(int a, int b) { int c = a + b * 2; return c; }"
cd.show_lexical(cd.lexical_analyzer(src))
cd.show_symbol_table(cd.build_symbol_table(src))
cd.show_three_address_code(cd.generate_three_address_code("a + b * 2"))
cd.show_dag(cd.build_dag("a + b * 2"))
g = {'E': ['E + T', 'T'], 'T': ['T * F', 'F'], 'F': ['( E )', 'id']}
g = cd.eliminate_left_recursion(g)
cd.show_grammar(g, "NO LEFT RECURSION")
cd.show_first_follow(cd.compute_first_follow(g, start='E'))
cd.show_ll1_table(cd.build_ll1_table(g, start='E'))
cd.show_parse_trace(cd.ll1_parse(g, ['id', '+', 'id'], start='E'))