pythoncontext-free-grammarformal-languagesautomata-theory

How to prevent duplications when generating strings from context free grammar


I need to define context free grammar and generate strings from this language.

Grammar for a language that contains all strings with matched parentheses:

s->ss
s->(s)
s->()

Examples: (), ()(), (()),…

For generating strings I use usual Breadth-first search. Root is s. Production rules are the edges. Each terminal node is a string.

Problem: At some point there exist node: sss. From this node there exists 3 different edges to get the same node ssss:

[s->ss]ss = s[s->ss]s=ss[s->ss]

What is the right approach to prune duplication edges?

  1. Disambiguate the grammar. How to do it for this language?
  2. Change generating procedure (BFS)?

Code:

from collections import deque

# rules
rules1 = [('s', 'ss'), 
         ('s', '(s)'),
          ('s', '()')]

# tree
def get_edges(node, rules):
  edges = []
  for count, item in enumerate(node):
    for rule in rules:
      if item == rule[0]:
        edges.append((count, rule[1], rule[0]))

  return edges

def get_nodes(node, edges, rules):
  nodes = map(lambda edge: node[:edge[0]] + edge[1] + node[edge[0]+len(edge[2]):], edges)
  
  return list(nodes)

def get_adjacent_nodes(node, rules):
  edges = get_edges(node, rules)
  nodes = get_nodes(node, edges, rules) 

  return nodes

# search
def BFS_simple(G, root, goal, max_nodes_visited=10 ** 4):
  Q = deque()
  explored = set()

  explored.add(root)
  Q.append(root)

  log_total_nodes_visited = 0
  log_number_of_duplications = 0

  while len(Q) > 0:
    v = Q.popleft()

    if v == goal:
      print('total nodes visited = ', log_total_nodes_visited, 'duplications = ', log_number_of_duplications)
      return v

    for node in G(v):
      if node not in explored:
        explored.add(node)
        Q.append(node)

      else:
        log_number_of_duplications += 1

    log_total_nodes_visited += 1
    if log_total_nodes_visited > max_nodes_visited:
      return False

  return False

BFS_simple(lambda node: get_adjacent_nodes(node, rules1), 's', '(()()())')

Result: total nodes visited = 1244 duplications = 6527


Solution

  • The Wikipedia entry on BFS says: (emphasis added)

    Breadth-first search can be generalized to graphs, when the start node (sometimes referred to as a 'search key') is explicitly given, and precautions are taken against following a vertex twice.

    and the pseudocode implementation includes such a precaution (see lines 10 and 11 reproduced below):

      9  for all edges from v to w in G.adjacentEdges(v) do
      10   if w is not labeled as explored then
      11     label w as explored
      12     Q.enqueue(w)
    

    It does not mention how to "label w as explored"; two common strategies are to use an additional data structure which maps nodes to a Boolean (either a dictionary or an array, depending on how nodes are identified) or to augment the node data structure with an additional Boolean field.

    That will work for any grammar, but it is not necessary fo unambiguous grammars. Since it is trivial to find an unambiguous grammar for a Dyck language, you could save some time by using it instead of rejecting duplicate results. But I would suggest writing the graph-oriented BFS any way, because the overhead is slight and it allows a wider range of possible grammars.