pythonsqlsqlalchemygraphqlgraphene-python

Limiting SQL query to defined fields/columns in Graphene-SQLAlchemy


This question has been posted as a GH issues under https://github.com/graphql-python/graphene-sqlalchemy/issues/134 but I thought I'd post it here too to tap into the SO crowd.

A full working demo can be found under https://github.com/somada141/demo-graphql-sqlalchemy-falcon.

Consider the following SQLAlchemy ORM class:

class Author(Base, OrmBaseMixin):
    __tablename__ = "authors"

    author_id = sqlalchemy.Column(
        sqlalchemy.types.Integer(),
        primary_key=True,
    )

    name_first = sqlalchemy.Column(
        sqlalchemy.types.Unicode(length=80),
        nullable=False,
    )

    name_last = sqlalchemy.Column(
        sqlalchemy.types.Unicode(length=80),
        nullable=False,
    )

Simply wrapped in an SQLAlchemyObjectType as such:

class TypeAuthor(SQLAlchemyObjectType):
    class Meta:
        model = Author

and exposed through:

author = graphene.Field(
    TypeAuthor,
    author_id=graphene.Argument(type=graphene.Int, required=False),
    name_first=graphene.Argument(type=graphene.String, required=False),
    name_last=graphene.Argument(type=graphene.String, required=False),
)

@staticmethod
def resolve_author(
    args,
    info,
    author_id: Union[int, None] = None,
    name_first: Union[str, None] = None,
    name_last: Union[str, None] = None,
):
    query = TypeAuthor.get_query(info=info)

    if author_id:
        query = query.filter(Author.author_id == author_id)

    if name_first:
        query = query.filter(Author.name_first == name_first)

    if name_last:
        query = query.filter(Author.name_last == name_last)

    author = query.first()

    return author

A GraphQL query such as:

query GetAuthor{
  author(authorId: 1) {
    nameFirst
  }
}

will cause the following raw SQL to be emitted (taken from the echo logs of the SQLA engine):

SELECT authors.author_id AS authors_author_id, authors.name_first AS authors_name_first, authors.name_last AS authors_name_last
FROM authors
WHERE authors.author_id = ?
 LIMIT ? OFFSET ?
2018-05-24 16:23:03,669 INFO sqlalchemy.engine.base.Engine (1, 1, 0)

As one can see we may only want the nameFirst field, i.e., the name_first column but the entire row is fetched. Of course the GraphQL response only contains the requested fields, i.e.,

{
  "data": {
    "author": {
      "nameFirst": "Robert"
    }
  }
}

but we have still fetched the entire row, which becomes a major issue when dealing with wide tables.

Is there a way to automagically communicate which columns are needed to SQLAlchemy so as preclude this form of over-fetching?


Solution

  • My question was answered on the GitHub issue (https://github.com/graphql-python/graphene-sqlalchemy/issues/134).

    The idea is to identify the requested fields out of the info argument (of type graphql.execution.base.ResolveInfo) that gets passed to the resolver function through a get_field_names function such as the one below:

    def get_field_names(info):
        """
        Parses a query info into a list of composite field names.
        For example the following query:
            {
              carts {
                edges {
                  node {
                    id
                    name
                    ...cartInfo
                  }
                }
              }
            }
            fragment cartInfo on CartType { whatever }
    
        Will result in an array:
            [
                'carts',
                'carts.edges',
                'carts.edges.node',
                'carts.edges.node.id',
                'carts.edges.node.name',
                'carts.edges.node.whatever'
            ]
        """
    
        fragments = info.fragments
    
        def iterate_field_names(prefix, field):
            name = field.name.value
    
            if isinstance(field, FragmentSpread):
                _results = []
                new_prefix = prefix
                sub_selection = fragments[field.name.value].selection_set.selections
            else:
                _results = [prefix + name]
                new_prefix = prefix + name + "."
                if field.selection_set:
                    sub_selection = field.selection_set.selections
                else:
                    sub_selection = []
    
            for sub_field in sub_selection:
                _results += iterate_field_names(new_prefix, sub_field)
    
            return _results
    
        results = iterate_field_names('', info.field_asts[0])
    
        return results
    

    The above function was taken from https://github.com/graphql-python/graphene/issues/348#issuecomment-267717809. That issue contains other versions of this function but I felt this was the most complete.

    and using the identified fields to limit the retrieved fields in the SQLAlchemy query as such:

    fields = get_field_names(info=info)
    query = TypeAuthor.get_query(info=info).options(load_only(*relation_fields))
    

    When applied to the above example query:

    query GetAuthor{
      author(authorId: 1) {
        nameFirst
      }
    }
    

    The get_field_names function would return ['author', 'author.nameFirst']. However, as the 'original' SQLAlchemy ORM fields are snake-cased the get_field_names query needs to be updated to remove the author prefix and convert the fieldnames via the graphene.utils.str_converters.to_snake_case function.

    Long story short, the above approach yields a raw SQL query like this:

    INFO:sqlalchemy.engine.base.Engine:SELECT authors.author_id AS authors_author_id, authors.name_first AS authors_name_first
    FROM authors
    WHERE authors.author_id = ?
     LIMIT ? OFFSET ?
    2018-06-09 13:22:16,396 INFO sqlalchemy.engine.base.Engine (1, 1, 0)
    

    Update

    Should anyone land here wondering about implementation I got around to implementing my own version of the get_query_fields function as such:

    from typing import List, Dict, Union, Type
    
    import graphql
    from graphql.language.ast import FragmentSpread
    from graphql.language.ast import Field
    from graphene.utils.str_converters import to_snake_case
    import sqlalchemy.orm
    
    from demo.orm_base import OrmBaseMixin
    
    def extract_requested_fields(
        info: graphql.execution.base.ResolveInfo,
        fields: List[Union[Field, FragmentSpread]],
        do_convert_to_snake_case: bool = True,
    ) -> Dict:
        """Extracts the fields requested in a GraphQL query by processing the AST
        and returns a nested dictionary representing the requested fields.
    
        Note:
            This function should support arbitrarily nested field structures
            including fragments.
    
        Example:
            Consider the following query passed to a resolver and running this
            function with the `ResolveInfo` object passed to the resolver.
    
            >>> query = "query getAuthor{author(authorId: 1){nameFirst, nameLast}}"
            >>> extract_requested_fields(info, info.field_asts, True)
            {'author': {'name_first': None, 'name_last': None}}
    
        Args:
            info (graphql.execution.base.ResolveInfo): The GraphQL query info passed
                to the resolver function.
            fields (List[Union[Field, FragmentSpread]]): The list of `Field` or
                `FragmentSpread` objects parsed out of the GraphQL query and stored
                in the AST.
            do_convert_to_snake_case (bool): Whether to convert the fields as they
                appear in the GraphQL query (typically in camel-case) back to
                snake-case (which is how they typically appear in ORM classes).
    
        Returns:
            Dict: The nested dictionary containing all the requested fields.
        """
    
        result = {}
        for field in fields:
    
            # Set the `key` as the field name.
            key = field.name.value
    
            # Convert the key from camel-case to snake-case (if required).
            if do_convert_to_snake_case:
                key = to_snake_case(name=key)
    
            # Initialize `val` to `None`. Fields without nested-fields under them
            # will have a dictionary value of `None`.
            val = None
    
            # If the field is of type `Field` then extract the nested fields under
            # the `selection_set` (if defined). These nested fields will be
            # extracted recursively and placed in a dictionary under the field
            # name in the `result` dictionary.
            if isinstance(field, Field):
                if (
                    hasattr(field, "selection_set") and
                    field.selection_set is not None
                ):
                    # Extract field names out of the field selections.
                    val = extract_requested_fields(
                        info=info,
                        fields=field.selection_set.selections,
                    )
                result[key] = val
            # If the field is of type `FragmentSpread` then retrieve the fragment
            # from `info.fragments` and recursively extract the nested fields but
            # as we don't want the name of the fragment appearing in the result
            # dictionary (since it does not match anything in the ORM classes) the
            # result will simply be result of the extraction.
            elif isinstance(field, FragmentSpread):
                # Retrieve referened fragment.
                fragment = info.fragments[field.name.value]
                # Extract field names out of the fragment selections.
                val = extract_requested_fields(
                    info=info,
                    fields=fragment.selection_set.selections,
                )
                result = val
    
        return result
    

    which parses the AST into a dict preserving the structure of the query and (hopefully) matching the structure of the ORM.

    Running the info object of a query like:

    query getAuthor{
      author(authorId: 1) {
        nameFirst,
        nameLast
      }
    }
    

    produces

    {'author': {'name_first': None, 'name_last': None}}
    

    while a more complex query like this:

    query getAuthor{
      author(nameFirst: "Brandon") {
        ...authorFields
        books {
          ...bookFields
        }
      }
    }
    
    fragment authorFields on TypeAuthor {
      nameFirst,
      nameLast
    }
    
    fragment bookFields on TypeBook {
      title,
      year
    }
    

    produces:

    {'author': {'books': {'title': None, 'year': None},
      'name_first': None,
      'name_last': None}}
    

    Now these dictionaries can be used to define what is a field on the primary-table (Author in this case) as they'll have a value of None such as name_first or a field on a relationship of that primary-table such as field title on the books relationship.

    A simplistic approach to auto-applying those fields can take the form of the following function:

    def apply_requested_fields(
        info: graphql.execution.base.ResolveInfo,
        query: sqlalchemy.orm.Query,
        orm_class: Type[OrmBaseMixin]
    ) -> sqlalchemy.orm.Query:
        """Updates the SQLAlchemy Query object by limiting the loaded fields of the
        table and its relationship to the ones explicitly requested in the GraphQL
        query.
    
        Note:
            This function is fairly simplistic in that it assumes that (1) the
            SQLAlchemy query only selects a single ORM class/table and that (2)
            relationship fields are only one level deep, i.e., that requestd fields
            are either table fields or fields of the table relationship, e.g., it
            does not support fields of relationship relationships.
    
        Args:
            info (graphql.execution.base.ResolveInfo): The GraphQL query info passed
                to the resolver function.
            query (sqlalchemy.orm.Query): The SQLAlchemy Query object to be updated.
            orm_class (Type[OrmBaseMixin]): The ORM class of the selected table.
    
        Returns:
            sqlalchemy.orm.Query: The updated SQLAlchemy Query object.
        """
    
        # Extract the fields requested in the GraphQL query.
        fields = extract_requested_fields(
            info=info,
            fields=info.field_asts,
            do_convert_to_snake_case=True,
        )
    
        # We assume that the top level of the `fields` dictionary only contains a
        # single key referring to the GraphQL resource being resolved.
        tl_key = list(fields.keys())[0]
        # We assume that any keys that have a value of `None` (as opposed to
        # dictionaries) are fields of the primary table.
        table_fields = [
            key for key, val in fields[tl_key].items()
            if val is None
        ]
    
        # We assume that any keys that have a value being a dictionary are
        # relationship attributes on the primary table with the keys in the
        # dictionary being fields on that relationship. Thus we create a list of
        # `[relatioship_name, relationship_fields]` lists to be used in the
        # `joinedload` definitions.
        relationship_fieldsets = [
            [key, val.keys()]
            for key, val in fields[tl_key].items()
            if isinstance(val, dict)
        ]
    
        # Assemble a list of `joinedload` definitions on the defined relationship
        # attribute name and the requested fields on that relationship.
        options_joinedloads = []
        for relationship_fieldset in relationship_fieldsets:
            relationship = relationship_fieldset[0]
            rel_fields = relationship_fieldset[1]
            options_joinedloads.append(
                sqlalchemy.orm.joinedload(
                    getattr(orm_class, relationship)
                ).load_only(*rel_fields)
            )
    
        # Update the SQLAlchemy query by limiting the loaded fields on the primary
        # table as well as by including the `joinedload` definitions.
        query = query.options(
            sqlalchemy.orm.load_only(*table_fields),
            *options_joinedloads
        )
    
        return query