Polars: how to write a column of strings into a txt file without escaping?

I have a .ndjson files with millions of rows. Each row has a field html which contains html strings. I would like to write all such html into a .txt file. One html is into one line of the .txt file. I tried using polars.LazyFrame.sink_csv for its speed:

import polars as pl
import requests
from pathlib import Path

url = "https://raw.githubusercontent.com/leanhdung1994/files/main/processedStep1_enwiktionary_namespace_0_43.ndjson"
workingDir = r"E:\Personal Projects\tmp\tarFiles"
outNdjson = Path(workingDir, "wiktionary.ndjson")
outTxt = Path(workingDir, "wiktionary.txt")

# Download
resp = requests.get(url)
resp.raise_for_status()

# Save
with open(outNdjson, "wb") as f:
    f.write(resp.content)

# Read with Polars
df = pl.scan_ndjson(outNdjson)
print(df.select("html").collect())

df.select("html").sink_csv(outTxt, include_header=False)

The column html is

shape: (23, 1)
┌─────────────────────────────────┐
│ html                            │
│ ---                             │
│ str                             │
╞═════════════════════════════════╡
│ playabilities <link href="enWi… │
│ plecopterans <link href="enWik… │
│ pleiotropies <link href="enWik… │
│ pleochroisms <link href="enWik… │
│ plicometry <link href="enWikti… │
│ …                               │
│ pontil marks <link href="enWik… │
│ poringly <link href="enWiktion… │
│ pornaholics <link href="enWikt… │
│ geronimo <link href="enWiktion… │
│ uncage <link href="enWiktionar… │
└─────────────────────────────────┘

But the resulted .txt file contain escape quotations:

"playabilities <link href=""enWiktionary.css"" rel=""stylesheet"" type=""text/css""/> <script src=""enWiktionary.js"" type=""text/javascript""></script>

I tried the option quote_char='', but it returns an error:

TypeError: ord() expected string of length 1, but NoneType found

Could you explain how to do so?

Solution

The problem is that the data contains both double quotes " and embedded trailing newlines.

import polars as pl

url = "https://raw.githubusercontent.com/leanhdung1994/files/main/processedStep1_enwiktionary_namespace_0_43.ndjson"

df = pl.read_ndjson(url)

df.head(1).select("html").item()[-10:]
# 'html> </>\n'
#           ^^

You want to strip the trailing newlines, and disable quoting in the CSV writer.

(pl.scan_ndjson(url)
   .select(pl.col("html").str.strip_chars_end("\n"))
   .sink_csv(outTxt, include_header=False, line_terminator="\r\n", quote_style="never")
)

Disable quoting

import io
import polars as pl

df = pl.DataFrame({"x": ['"foo"\n']})

f = io.BytesIO()
df.write_csv(f, include_header=False, line_terminator="\r\n")

f.getvalue()
# b'"""foo""\n"\r\n'

quote_style="never" is how to disable quoting.

f = io.BytesIO()
df.write_csv(f, include_header=False, line_terminator="\r\n", quote_style="never")

f.getvalue()
# b'"foo"\n\r\n'

Strip trailing newlines

Note the trailing \n from the data is still there, so you end up with mixed line endings. (\n\r\n)

.strip_chars_end("\n") can be used to remove it before sinking.

f = io.BytesIO()

(df.select(pl.col("x").str.strip_chars_end("\n"))
   .write_csv(f, include_header=False, line_terminator="\r\n", quote_style="never"))

f.getvalue()
# b'"foo"\r\n'