pythonpalantir-foundryfoundry-data-connectionfoundry-contour

How to parse the complete set of records for a dataset through an API call?


How can I get the full dataset records through foundry API call? I want to use the dataset in another Python application outside Foundry and using requests only first 300 rows of records are coming. The requests API end point I have is using Contour dataset-preview.


Solution

  • There are different possibilities to query datasets in Foundry, depending on the dataset size and use case. Probably the easiest to start with is the data-proxy query sql, because you don't have to worry about the underlying file format of the dataset.

    import requests
    import pandas as pd
    
    def query_foundry_sql(query, token, branch='master', base_url='https://foundry-instance.com') -> (list, list):
        """
        Queries the dataproxy query API with spark SQL.
        Example: query_foundry_sql("SELECT * FROM `/path/to/dataset` Limit 5000", "ey...")
        Args:
            query: the sql query
            branch: the branch of the dataset / query
    
        Returns: (columns, data) tuple. data contains the data matrix, columns the list of columns
        Can be converted to a pandas Dataframe:
        pd.DataFrame(data, columns)
    
        """
        response = requests.post(f"{base_url}/foundry-data-proxy/api/dataproxy/queryWithFallbacks",
                                 headers={'Authorization': f'Bearer {token}'},
                                 params={'fallbackBranchIds': [branch]},
                                 json={'query': query})
    
        response.raise_for_status()
        json = response.json()
        columns = [e['name'] for e in json['foundrySchema']['fieldSchemaList']]
        return columns, json['rows']
    
    columns, data = query_foundry_sql("SELECT * FROM `/Global/Foundry Operations/Foundry Support/iris` Limit 5000", 
                                      "ey...",
                                     base_url="https://foundry-instance.com")
    df = pd.DataFrame(data=data, columns=columns)
    df.head(5)