databricksazure-databricks

Using external links with Microsoft.Azure.Databricks.Client does not return all results


If I go into the Databricks sql editor and run a count against my table, it says there are more items than the external links say.

If I do this from the Azure Databricks webpage, it says there are over 9k rows.

select count(distinct oneColumn) from myTable

When I run via the .NET API like this, it says 6,504 rows.

var sql = "SELECT DISTINCT oneColumn from myTable";
    
var statement = SqlStatement.Create(sql, warehouseId);
statement.Disposition = SqlStatementDisposition.EXTERNAL_LINKS;

using var client = DatabricksClient.CreateClient(baseUrl, token);
var result = await client.SQL.StatementExecution.Execute(statement);

the result.Result.ExternalLinks is a list with just a single item. If it's just one link, then I'd expect the row count to match. Am I misunderstanding how to use external links?

I have to use the external links as the amount of data is too large to query directly.


Solution

  • The external_links result is of below format.

        "external_links": [
          {
            "chunk_index": 0,
            "row_offset": 0,
            "row_count": 100,
            "byte_count": 293,
        "next_chunk_index",
        "next_chunk_internal_link",
            "external_link": "https://someplace.cloud-provider.com/very/long/path/...",
            "expiration": "2023-01-30T22:23:23.140Z"
          }
        ]
    

    next_chunk_index give next chunk index and next_chunk_internal_link provides link to fetch the next chunk data, if both are absent then no more chunks left, you will get total row_count under chunks field also in external_link.

    The row counts you getting here will be different for your queries because

    First query gives you count of distinct rows and second gives you distinct rows.

    Let's say i have 500000 distinct rows and i will get row count as 500000 for your second query.

    The external link you got copy it.

    enter image description here

    make another GET request using that link.

    enter image description here

    Now save the file as csv, you will get entire data in that file having id column from 0 to 499999.

    When i query the first statement, the row count will be 1 and the value will be 500000.

    enter image description here

    and value in the downloaded file.

    enter image description here

    You download and check the value in the file, it will match the rows.