I am trying to convert csv (with 10 columns) to vowpal wabbit input format txt file. Some csv columns has integer values and some have strings (like: com.12346.xyz). For example if my csv columns look like this:
loss weight SSD_id weight label imp feat_val
0.693147 0.693147 1 1.0 -1.0000 0.0000 com.12346.xyz
0.419189 0.145231 2 2.0 1.0000 -1.8559 com.12346.xyz
0.235457 0.051725 4 4.0 -1.0000 -2.7588 com.12356.xyz
6.371911 12.508365 8 8.0 -1.0000 -3.7784 com.12346.xyz
3.485084 0.598258 16 16.0 1.0000 -2.2767 com.12346.xyz
1.765249 0.045413 32 32.0 -1.0000 -2.8924 com.1236.xyz
1.017911 0.270573 64 64.0 -1.0000 -3.0438 com.12236.xyz
0.611419 0.204927 128 128.0 1.0000 -3.1539 com.16746.xyz
0.469127 0.326834 256 256.0 -1.0000 -1.6101 com.1946.xyz
0.403473 0.337820 512 512.0 1.0000 -2.8843 com.126.xyz
0.337348 0.271222 1024 1024.0 -1.0000 -2.5209 com.1346.xyz
0.328909 0.320471 2048 2048.0 1.0000 -2.0732 com.1234.xyz
0.309401 0.289892 4096 4096.0 1.0000 -2.7639 com.12396.xyz
and vowpal wabbit input format looks like this:
label weight |i imp SSD_id loss |c feat_val
and inside vowpal wabbit txt file values should be:
-1 0.051725 |i imp:-2.7588 SSD_id:4 loss:0.235457 |c feat_val=com.12356.xyz
1 0.598258 |i imp:-2.7588 SSD_id:4 loss:3.485034 |c feat_val=com.12346.xyz
... and so on... for all the rows values. I have huge number of rows in the csv file that I want to convert to the above format and save all of them in single txt file. I have started with this small function given below:
def to_new_format(document, label=None):
return str(label or '') + ' |i ' + ' '.join(re.findall('\w{3,}', document.lower())) + '\n'
to_new_format(str(text_train[1])
but I am completely lost now after many trials with dataframe, csv formatting and trying functions. Can someone give some direction to me how I can achieve this in minimal lines of code.
This is simpler to do than it might seem, because of some handy ways Pandas lets you treat sequences almost the same as you would single values in Python.
First we'll import your CSV file, treating all values as strings to make our formatting easier:
import pandas as pd
df = pd.read_csv('test_data.txt', dtype=pd.StringDtype())
Your label
column is recorded like 1.0000
in your file, but you don't want the decimal or zeroes in your output. We can fix that using Pandas' str.replace
method.
df.label = df.label.str.replace('.0000', '', regex=False)
And here's the magical part: we can concatenate these just like if they were individual strings!
formatted = (
df.label + ' ' + df.weight +
' |i imp:' + df.imp +
' SSD_id: ' + df.SSD_id +
' loss:' + df.loss +
' |c feat_val=' + df.feat_val +
'\n'
)
That code looks would it would create a string, but because of how it includes the dataframe's columns (each of which is a Pandas sequence), the result is a sequence too:
print(formatted)
0 -1 0.693147 |i imp:0.0000 SSD_id: 1 loss:0.693...
1 1 0.145231 |i imp:-1.8559 SSD_id: 2 loss:0.419...
2 -1 0.051725 |i imp:-2.7588 SSD_id: 4 loss:0.23...
3 -1 12.508365 |i imp:-3.7784 SSD_id: 8 loss:6.3...
4 1 0.598258 |i imp:-2.2767 SSD_id: 16 loss:3.48...
5 -1 0.045413 |i imp:-2.8924 SSD_id: 32 loss:1.7...
6 -1 0.270573 |i imp:-3.0438 SSD_id: 64 loss:1.0...
7 1 0.204927 |i imp:-3.1539 SSD_id: 128 loss:0.6...
8 -1 0.326834 |i imp:-1.6101 SSD_id: 256 loss:0....
9 1 0.337820 |i imp:-2.8843 SSD_id: 512 loss:0.4...
10 -1 0.271222 |i imp:-2.5209 SSD_id: 1024 loss:0...
11 1 0.320471 |i imp:-2.0732 SSD_id: 2048 loss:0....
12 1 0.289892 |i imp:-2.7639 SSD_id: 4096 loss:0....
Each line is truncated when printed like this, but it's all in there. For instance:
print(formatted[0])
-1 0.693147 |i imp:0.0000 SSD_id: 1 loss:0.693147 |c feat_val=com.12346.xyz
All that's left is to save it to a file:
with open('out.txt', 'w') as f:
f.writelines(formatted)