In a standard MS Access SQL query output that does not have any aliases, how do I replace the full names by their "first-letters" aliases?

I have a lot of queries from a bunch of MS Access databases. I read them out and split them into their SQL-blocks by the standard SQL keywords with:

How do I get the full sql code for all queries in all MS Access databases of a folder?
"The big SQL RegEx": How do I RegEx split an easy SQL query (SELECT ... INTO ... FROM ... WHERE ... GROUP BY ... HAVING ... ORDER BY)?

Thus, you need to put the objects from the MS Access databases in an Excel file of that same format to go on with answering this question. You might say that this is too detailed, and yes, it is, but without doing so, you will not get the split of the FROM-block from the rest of the query, which is needed to answer the question at all.

The columns of that input Excel file:

Or altogether:

ID Datenbank Objektname LastUpdated Objekttyp Objektart SourceTableName Abfrage_SQL Fehler Select Into From Where Group_By Having Order_By New SQL Codes Mapping

The main SQL query is in column H (Abfrage_SQL, "Abfrage" means "Query"), while the split into blocks ranges from columns J to P. In the code, you will need column H and column L as the input.

You might get to an answer without this Excel input file and with other code, but you will not get around splitting the Code with some Regex, and I did not want to reinvent the wheel, therefore check the link how to get there. The queries from the MS Access databases at hand do not have aliases.

Mind that queries in your MS Access databases might have aliases. If you have them all the time, you do not need this question. But if they are there only sometimes, you need to change the code of the answer.

Task

I do not want to put the aliases with Regex or even by hand and in many steps. Instead, I want to run Python code on it that does it all in one go.

I need to replace the full table and view names with their standardized aliases. The alias shall be built with the first letters of each name that is split by "_" so that "my_random_table" becomes "mrt". If more than one full name is assigned to an abbreviation, the repeated abbreviation must be given an ascending number.

The full query from MS Access might look like this:

select my_random_table.* from my_random_table

This shall be shortened by aliases like this:

select mrt.* from my_random_table mrt

The Excel input file affords a list of 100+ queries in column H and their FROM-block in column L.

In a standard MS Access SQL query output that does not have any aliases, how do I replace the full names by their "first-letters" aliases? This shall be done with Python code that is run on an Excel input file. This Excel input file can be built with the help of the links listed above.

PS

The tags "excel" and "ms-access" are not the core of the question, they are not even needed. The answer can help in any other SQL settings as well. I make this clear since the answer takes up the output from MS Access and MS Excel, but you will get around that by re-writing the code for another software setting.

Solution

The Python code to get two new columns (New SQL Codes and Mapping) at at the right end of the input Excel file and output that as a new output file, with the first two filled queries (processed_count < 2) as debugging printouts, is:

import pandas as pd
import re

# Define SQL keywords
SQL_KEYWORDS = {
    'SELECT', 'FROM', 'WHERE', 'JOIN', 'INNER', 'OUTER', 'LEFT', 'RIGHT', 
    'ON', 'AND', 'OR', 'GROUP', 'BY', 'ORDER', 'DESC', 'ASC', 'HAVING', 
    'INSERT', 'UPDATE', 'DELETE', 'VALUES', 'SET', 'CREATE', 'DROP', 
    'ALTER', 'TABLE', 'VIEW', 'AS', 'DISTINCT', 'LIKE', 'IN', 'BETWEEN', 
    'EXISTS', 'NULL', 'IS', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 
    'UNION', 'ALL', 'LIMIT', 'OFFSET', 'SELECTED', 'EXPLAIN', 'WITH', 
    'INTO'
}

def is_valid_object_name(full_name):
    """Checks if the object name is a table or view."""
    return full_name.upper() not in SQL_KEYWORDS

def generate_short_name(full_name, existing_aliases):
    """Generates a short name from the full table name and ensures it is unique."""
    parts = re.split(r'[_-]+', full_name.strip('[]'))
    short_name = ''.join(part[0] for part in parts if part)  # Generate abbreviation
    short_name = short_name.lower()

    # Ensure uniqueness
    if short_name in existing_aliases:
        count = 2
        new_short_name = f"{short_name}{count}"
        while new_short_name in existing_aliases:
            count += 1
            new_short_name = f"{short_name}{count}"
        short_name = new_short_name

    existing_aliases.add(short_name)
    return short_name

def process_queries(queries, from_clauses):
    """Processes SQL queries and replaces table names with aliases."""
    all_output_queries = []
    all_short_names = []

    processed_count = 0  # To keep track of the number of processed queries

    for query, from_clause in zip(queries, from_clauses):
        if query:  # Debugging for each filled query
            if processed_count < 2:  # Only for the first two filled queries
                print("\n=== Debugging Output for Query ===")
                print("Original Query:", query)
                print("FROM Clause:", from_clause)

        short_names = {}
        existing_aliases = set()  # Collection for already used aliases
        output_query = query.strip()
        found_tables = re.findall(r'(?:^(\w+)$|((?:\[)?\b\w+(?:[-_]\w+)?\b(?:\])?)(?=\s*(?:INNER|JOIN|LEFT|RIGHT|OUTER|ON|FROM)))', from_clause)

        # found_tables = [name[0] if name[0] else name[1] if name[1] else '' for name in table_names]  # Extracting only the first group
        found_tables = [name[0] if name[0] else name[1] if name[1] else '' for name in found_tables]

        if processed_count < 2:  # Debug information for the first two queries
            print("Found Tables:", found_tables)  # Output of found tables

        for full_name in found_tables:
            if is_valid_object_name(full_name) and full_name not in short_names:
                short_name = generate_short_name(full_name, existing_aliases)
                short_names[full_name] = short_name

        if processed_count < 2:  # Debug information for the first two queries
            print("Generated Aliases:", short_names)  # Output of generated aliases

        # Inserting aliases in the FROM block
        for full_name, short_name in short_names.items():
            if full_name.startswith('[') and full_name.endswith(']'):
                # Remove the square brackets for the substitution
                pattern = re.escape(full_name).replace(r'\[', r'\[\b').replace(r'\]', r'\b\]')
            else:
                pattern = r'\b' + re.escape(full_name) + r'\b'

            from_clause = re.sub(
                pattern,
                f'{full_name} {short_name}',  # Full name and alias
                from_clause, count=1
            )
            
            if processed_count < 2:  # Debug information only for the first two queries
                print(f"Inserted: {full_name} with Alias {short_name}")

        # Replacing table names with aliases in the SELECT query
        for full_name, short_name in short_names.items():
            if full_name.startswith('[') and full_name.endswith(']'):
                # Remove the square brackets for the substitution
                pattern = re.escape(full_name).replace(r'\[', r'\[\b').replace(r'\]', r'\b\]')

                # if '[' in full_name:  # Debug information for the first two queries
                #     print("pattern:", pattern)  # Output of pattern
            else:
                pattern = r'\b' + re.escape(full_name) + r'\b'
            output_query = re.sub(pattern, short_name, output_query)

        if processed_count < 2:  # Debug information for the first two queries
            print("Query after Alias Replacements:", output_query)

        # Inserting the FROM block into the complete query
        output_query = re.sub(r'\bFROM\b.*', f'FROM {from_clause}', output_query, count=1)

        if processed_count < 2:  # Debug information for the first two queries
            print("Query after Adding the FROM Clause:", output_query)

        # Column names prefixed with table names should not be changed
        for full_name, short_name in short_names.items():
            if full_name.startswith('[') and full_name.endswith(']'):
                # Remove the square brackets for the substitution
                pattern = re.escape(full_name).replace(r'\[', r'\[\b').replace(r'\]', r'\b\]\.')
            else:
                pattern = r'\b' + re.escape(full_name) + r'\.'
            output_query = re.sub(pattern, f'{short_name}.', output_query)

        if processed_count < 2:  # Debug information for the first two queries
            print("Processed Query:", output_query)

        all_output_queries.append(output_query)  # Add the processed query
        all_short_names.append(short_names)       # Add the abbreviations
        processed_count += 1  # Increment the processed count

    return all_output_queries, all_short_names
    
# Read the Excel file
input_file = 'input_file.xlsx'  # Path to the input file
output_file = 'output_file.xlsx'  # Path to the output file

# Load the Excel file and read relevant data
df = pd.read_excel(input_file)

# Extract SQL queries from column H (index 7) and the FROM clause from column L (index 11)
raw_queries = df.iloc[:, 7].dropna().tolist()  # Column H
from_clauses = df.iloc[:, 11].dropna().tolist()  # Column L

# Process queries
processed_queries, abbreviations = process_queries(raw_queries, from_clauses)

# Add new SQL codes and mappings to DataFrame
df['New SQL Codes'] = ''
df['Mapping'] = ''

processed_index = 0  # Index for processed queries
for i in range(len(df)):  # Loop through original queries
    query = df.at[i, 'Abfrage_SQL']
    if isinstance(query, str) and query:  # Check
        df.at[i, 'New SQL Codes'] = processed_queries[processed_index]

        # Create mapping only for table names, excluding column names
        mapping_str = ', '.join(f"{k}: {v}" for k, v in abbreviations[processed_index].items()
                                 if is_valid_object_name(k) and not '.' in k)  # Only use valid table names

        df.at[i, 'Mapping'] = mapping_str.strip()  # Prepare mapping

        processed_index += 1  # Only increment for processed queries

# Save the new data to an Excel file
with pd.ExcelWriter(output_file, engine='openpyxl') as writer:
    df.to_excel(writer, index=False, sheet_name='SQL_Queries')

print(f'Data has been saved in {output_file}.')

The New SQL Codes column lists all new "aliased" SQL queries. The Mapping column lists all tables and views with their aliases, so that:

SELECT my_random_table.*, my_raw_tbl.* 
FROM my_random_table 
INNER JOIN my_raw_tbl 
ON my_random_table.id = my_raw_tbl.random_tbl_id

..becomes:

SELECT mrt.*, mrt2.* 
FROM my_random_table mrt 
INNER JOIN my_raw_tbl mrt2 
ON mrt.id = mrt2.random_tbl_id

The code can also handle tables embedded in squared brackets, for example "[abc-cba]" will be understood as alias "ac".