I have some coordinate data; some of it high precision, some of it low precision thanks to multiple data sources and other operational realities. I want to have a column that indicates the relative precision of the coordinates. So far, what I want is to essentially count digits after the decimal; in my case more digits indicates higher precision data. In my case I usually get data like the data in the example; its either coming with five to six digits precision or just one digit. Both are useful; but we can do more analysis on higher precision data as you may imagine.
This code does what I want, but it seems .... wordy, inelegant; as if I'm being paid by the line of code. Is there a simpler way to do this?
import polars as pl
df = pl.DataFrame(
"lat": [ 43.6425047, 43.6, 40.688966, 40.6],
"lng": [-79.3861057, -79.3, -74.044438, -74.0],
df = (df.with_columns(
.str.split_exact(".", 1)
.struct.rename_fields(["lat_major", "lat_minor"])
.drop("lat_major", "lat_minor")
.str.split_exact(".", 1)
.struct.rename_fields(["lng_major", "lng_minor"])
.drop("lng_major", "lng_minor")
.drop("lat_precision", "lng_precision")
results in
shape: (4, 3)
│ lat ┆ lng ┆ precision │
│ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ u32 │
│ 43.642505 ┆ -79.386106 ┆ 14 │
│ 43.6 ┆ -79.3 ┆ 2 │
│ 40.688966 ┆ -74.044438 ┆ 12 │
│ 40.6 ┆ -74.0 ┆ 2 │
later I might pull out records with precision over 5, for instance, as my source data tends to be either one decimal point precision or four+ decimal points precision per coordinate.
You can extract the minor
fields directly without the need for temp columns and unnesting.
pl.col("lat", "lng").cast(pl.String)
.str.split_exact(".", 1)
shape: (4, 4)
│ lat ┆ lng ┆ lat_minor ┆ lng_minor │
│ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ u32 ┆ u32 │
│ 43.642505 ┆ -79.386106 ┆ 7 ┆ 7 │
│ 43.6 ┆ -79.3 ┆ 1 ┆ 1 │
│ 40.688966 ┆ -74.044438 ┆ 6 ┆ 6 │
│ 40.6 ┆ -74.0 ┆ 1 ┆ 1 │
We're using a single pl.col("lat", "lng")
call here which will go through an "expansion" step, i.e.
pl.col("lat", "lng").foo().bar()
is expanded into individual expressions.
can be used if you just want the totals.
pl.col("lat", "lng").cast(pl.String)
.str.split_exact(".", 1)
shape: (4, 3)
│ lat ┆ lng ┆ precision │
│ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ u32 │
│ 43.642505 ┆ -79.386106 ┆ 14 │
│ 43.6 ┆ -79.3 ┆ 2 │
│ 40.688966 ┆ -74.044438 ┆ 12 │
│ 40.6 ┆ -74.0 ┆ 2 │