I have some coordinate data; some of it high precision, some of it low precision thanks to multiple data sources and other operational realities. I want to have a column that indicates the relative precision of the coordinates. So far, what I want is to essentially count digits after the decimal; in my case more digits indicates higher precision data. In my case I usually get data like the data in the example; its either coming with five to six digits precision or just one digit. Both are useful; but we can do more analysis on higher precision data as you may imagine.
This code does what I want, but it seems .... wordy, inelegant; as if I'm being paid by the line of code. Is there a simpler way to do this?
import polars as pl
df = pl.DataFrame(
{
"lat": [ 43.6425047, 43.6, 40.688966, 40.6],
"lng": [-79.3861057, -79.3, -74.044438, -74.0],
}
)
df = (df.with_columns(
pl.col("lat").cast(pl.String)
.str.split_exact(".", 1)
.struct.rename_fields(["lat_major", "lat_minor"])
.alias("lat_temp"))
.unnest("lat_temp")
.with_columns(
pl.col("lat_minor")
.str.len_chars()
.alias("lat_precision"))
.drop("lat_major", "lat_minor")
.with_columns(
pl.col("lng").cast(pl.String)
.str.split_exact(".", 1)
.struct.rename_fields(["lng_major", "lng_minor"])
.alias("lng_temp"))
.unnest("lng_temp")
.with_columns(
pl.col("lng_minor")
.str.len_chars()
.alias("lng_precision"))
.drop("lng_major", "lng_minor")
.with_columns(
pl.col("lat_precision")
.add(pl.col("lng_precision"))
.alias("precision"))
.drop("lat_precision", "lng_precision")
)
df.head()
results in
shape: (4, 3)
┌───────────┬────────────┬───────────┐
│ lat ┆ lng ┆ precision │
│ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ u32 │
╞═══════════╪════════════╪═══════════╡
│ 43.642505 ┆ -79.386106 ┆ 14 │
│ 43.6 ┆ -79.3 ┆ 2 │
│ 40.688966 ┆ -74.044438 ┆ 12 │
│ 40.6 ┆ -74.0 ┆ 2 │
└───────────┴────────────┴───────────┘
later I might pull out records with precision over 5, for instance, as my source data tends to be either one decimal point precision or four+ decimal points precision per coordinate.
You can extract the minor
fields directly without the need for temp columns and unnesting.
df.with_columns(
pl.col("lat", "lng").cast(pl.String)
.str.split_exact(".", 1)
.struct.field("field_1")
.str.len_chars()
.name.suffix("_minor")
)
shape: (4, 4)
┌───────────┬────────────┬───────────┬───────────┐
│ lat ┆ lng ┆ lat_minor ┆ lng_minor │
│ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ u32 ┆ u32 │
╞═══════════╪════════════╪═══════════╪═══════════╡
│ 43.642505 ┆ -79.386106 ┆ 7 ┆ 7 │
│ 43.6 ┆ -79.3 ┆ 1 ┆ 1 │
│ 40.688966 ┆ -74.044438 ┆ 6 ┆ 6 │
│ 40.6 ┆ -74.0 ┆ 1 ┆ 1 │
└───────────┴────────────┴───────────┴───────────┘
We're using a single pl.col("lat", "lng")
call here which will go through an "expansion" step, i.e.
pl.col("lat", "lng").foo().bar()
is expanded into individual expressions.
pl.col("lat").foo().bar(),
pl.col("lng").foo().bar()
pl.sum_horizontal()
can be used if you just want the totals.
df.with_columns(
pl.sum_horizontal(
pl.col("lat", "lng").cast(pl.String)
.str.split_exact(".", 1)
.struct.field("field_1")
.str.len_chars()
)
.alias("precision")
)
shape: (4, 3)
┌───────────┬────────────┬───────────┐
│ lat ┆ lng ┆ precision │
│ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ u32 │
╞═══════════╪════════════╪═══════════╡
│ 43.642505 ┆ -79.386106 ┆ 14 │
│ 43.6 ┆ -79.3 ┆ 2 │
│ 40.688966 ┆ -74.044438 ┆ 12 │
│ 40.6 ┆ -74.0 ┆ 2 │
└───────────┴────────────┴───────────┘