Hi have a dataframe where I have a column called tags which is a json string.
I want to filter this dataframe on the tags column so it only contains rows where a certain tag key is present or where a tag has a particular value.
I guess I could do a string contains match but think it may be more robust to have the json convert into a dict first and using has_key etc ?
What would be the recommended way to do this in python polars ?
Thanks
Polars does not have a generic dictionary type. Instead, dictionaries are imported/mapped as structs. Each dictionary key is mapped to a struct 'field name', and the corresponding dictionary value becomes the value of this field.
However, there are some constraints for creating a Series of type struct. Two of them are:
In your description, you mention has_key
, which indicates that the dictionaries will not have the same keys. As such, creating a column of struct from your dictionaries will not work. (For more information, you can see this Stack Overflow response.)
json_path_match
I suggest using json_path_match
, which extracts values based on some simple JSONPath syntax. Using JSONPath syntax, you should be able to query whether a key exists, and retrieve it's value. (For simple unnested dictionaries, these are the same query.)
For example, let's start with this data:
import polars as pl
json_list = [
"""{"name": "Maria",
"position": "developer",
"office": "Seattle"}""",
"""{"name": "Josh",
"position": "analyst",
"termination_date": "2020-01-01"}""",
"""{"name": "Jorge",
"position": "architect",
"office": "",
"manager_st_dt": "2020-01-01"}""",
]
df = pl.DataFrame(
{
"tags": json_list,
}
).with_row_index("id", 1)
df
shape: (3, 2)
┌─────┬───────────────────────────────────────────┐
│ id ┆ tags │
│ --- ┆ --- │
│ u32 ┆ str │
╞═════╪═══════════════════════════════════════════╡
│ 1 ┆ {"name": "Maria", │
│ ┆ "position": "developer", │
│ ┆ "office": "Seattle"} │
│ 2 ┆ {"name": "Josh", │
│ ┆ "position": "analyst", │
│ ┆ "termination_date": "2020-01-01"} │
│ 3 ┆ {"name": "Jorge", │
│ ┆ "position": "architect", │
│ ┆ "office": "", │
│ ┆ "manager_st_dt": "2… │
└─────┴───────────────────────────────────────────┘
To query for values:
df.with_columns(
pl.col("tags").str.json_path_match(r"$.name").alias("name"),
pl.col("tags").str.json_path_match(r"$.office").alias("location"),
pl.col("tags").str.json_path_match(r"$.manager_st_dt").alias("manager start date"),
)
shape: (3, 5)
┌─────┬───────────────────────────────────────────┬───────┬──────────┬────────────────────┐
│ id ┆ tags ┆ name ┆ location ┆ manager start date │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ str ┆ str ┆ str │
╞═════╪═══════════════════════════════════════════╪═══════╪══════════╪════════════════════╡
│ 1 ┆ {"name": "Maria", ┆ Maria ┆ Seattle ┆ null │
│ ┆ "position": "developer", ┆ ┆ ┆ │
│ ┆ "office": "Seattle"} ┆ ┆ ┆ │
│ 2 ┆ {"name": "Josh", ┆ Josh ┆ null ┆ null │
│ ┆ "position": "analyst", ┆ ┆ ┆ │
│ ┆ "termination_date": "2020-01-01"} ┆ ┆ ┆ │
│ 3 ┆ {"name": "Jorge", ┆ Jorge ┆ ┆ 2020-01-01 │
│ ┆ "position": "architect", ┆ ┆ ┆ │
│ ┆ "office": "", ┆ ┆ ┆ │
│ ┆ "manager_st_dt": "2… ┆ ┆ ┆ │
└─────┴───────────────────────────────────────────┴───────┴──────────┴────────────────────┘
Notice the null
values. This is the return value when a key is not found. We'll use this fact for the has_key
functionality you mentioned.
Also, if we look at the "location" column, you'll see that json_path_match
does distinguish between an empty string "office":""
and a key not found..
To filter for the presence of a key, we simply filter for null
values.
df.filter(
pl.col("tags").str.json_path_match(r"$.manager_st_dt").is_not_null()
)
shape: (1, 2)
┌─────┬───────────────────┐
│ id ┆ tags │
│ --- ┆ --- │
│ u32 ┆ str │
╞═════╪═══════════════════╡
│ 3 ┆ {"name": "Jorge", │
│ ┆ "posit... │
└─────┴───────────────────┘
The json_path_match
will also work with nested structures. (See the Syntax page for details.)
One limitation, however: json_path_match
will only return the first match for a query, rather than a list of matches. If your JSON strings are not lists or nested dictionaries, this won't be a problem.