If I go into the Databricks sql editor and run a count against my table, it says there are more items than the external links say.
If I do this from the Azure Databricks webpage, it says there are over 9k rows.
select count(distinct oneColumn) from myTable
When I run via the .NET API like this, it says 6,504 rows.
var sql = "SELECT DISTINCT oneColumn from myTable";
var statement = SqlStatement.Create(sql, warehouseId);
statement.Disposition = SqlStatementDisposition.EXTERNAL_LINKS;
using var client = DatabricksClient.CreateClient(baseUrl, token);
var result = await client.SQL.StatementExecution.Execute(statement);
the result.Result.ExternalLinks
is a list with just a single item. If it's just one link, then I'd expect the row count to match. Am I misunderstanding how to use external links?
I have to use the external links as the amount of data is too large to query directly.
The external_links
result is of below format.
"external_links": [
{
"chunk_index": 0,
"row_offset": 0,
"row_count": 100,
"byte_count": 293,
"next_chunk_index",
"next_chunk_internal_link",
"external_link": "https://someplace.cloud-provider.com/very/long/path/...",
"expiration": "2023-01-30T22:23:23.140Z"
}
]
next_chunk_index
give next chunk index and next_chunk_internal_link
provides link to fetch the next chunk data, if both are absent then no more chunks left, you will get total row_count
under chunks
field also in external_link
.
The row counts you getting here will be different for your queries because
First query gives you count of distinct rows and second gives you distinct rows.
Let's say i have 500000 distinct rows and i will get row count as 500000 for your second query.
The external link you got copy it.
make another GET request using that link.
Now save the file as csv, you will get entire data in that file having id
column from 0 to 499999.
When i query the first statement, the row count will be 1 and the value will be 500000.
and value in the downloaded file.
You download and check the value in the file, it will match the rows.