pythonvowpalwabbit

How to convert CSV columns into Vowpal Wabbit txt input file


I am trying to convert csv (with 10 columns) to vowpal wabbit input format txt file. Some csv columns has integer values and some have strings (like: com.12346.xyz). For example if my csv columns look like this:

loss       weight         SSD_id      weight    label      imp            feat_val
0.693147   0.693147            1         1.0  -1.0000   0.0000       com.12346.xyz
0.419189   0.145231            2         2.0   1.0000  -1.8559       com.12346.xyz
0.235457   0.051725            4         4.0  -1.0000  -2.7588       com.12356.xyz
6.371911   12.508365           8         8.0  -1.0000  -3.7784       com.12346.xyz
3.485084   0.598258           16        16.0   1.0000  -2.2767       com.12346.xyz
1.765249   0.045413           32        32.0  -1.0000  -2.8924       com.1236.xyz
1.017911   0.270573           64        64.0  -1.0000  -3.0438       com.12236.xyz
0.611419   0.204927          128       128.0   1.0000  -3.1539       com.16746.xyz
0.469127   0.326834          256       256.0  -1.0000  -1.6101       com.1946.xyz
0.403473   0.337820          512       512.0   1.0000  -2.8843       com.126.xyz
0.337348   0.271222         1024      1024.0  -1.0000  -2.5209       com.1346.xyz
0.328909   0.320471         2048      2048.0   1.0000  -2.0732       com.1234.xyz
0.309401   0.289892         4096      4096.0   1.0000  -2.7639       com.12396.xyz

and vowpal wabbit input format looks like this:

label weight |i imp SSD_id loss |c feat_val

and inside vowpal wabbit txt file values should be:

-1 0.051725 |i imp:-2.7588 SSD_id:4 loss:0.235457 |c feat_val=com.12356.xyz
 1 0.598258 |i imp:-2.7588 SSD_id:4 loss:3.485034 |c feat_val=com.12346.xyz

... and so on... for all the rows values. I have huge number of rows in the csv file that I want to convert to the above format and save all of them in single txt file. I have started with this small function given below:

def to_new_format(document, label=None):
      return str(label or '') + ' |i ' + ' '.join(re.findall('\w{3,}', document.lower())) + '\n'
to_new_format(str(text_train[1])

but I am completely lost now after many trials with dataframe, csv formatting and trying functions. Can someone give some direction to me how I can achieve this in minimal lines of code.


Solution

  • This is simpler to do than it might seem, because of some handy ways Pandas lets you treat sequences almost the same as you would single values in Python.

    First we'll import your CSV file, treating all values as strings to make our formatting easier:

    import pandas as pd
    
    df = pd.read_csv('test_data.txt', dtype=pd.StringDtype())
    

    Your label column is recorded like 1.0000 in your file, but you don't want the decimal or zeroes in your output. We can fix that using Pandas' str.replace method.

    df.label = df.label.str.replace('.0000', '', regex=False)
    

    And here's the magical part: we can concatenate these just like if they were individual strings!

    formatted = (
        df.label + ' ' + df.weight +
        ' |i imp:' + df.imp +
        ' SSD_id: ' + df.SSD_id +
        ' loss:' + df.loss +
        ' |c feat_val=' + df.feat_val +
        '\n'
    )
    

    That code looks would it would create a string, but because of how it includes the dataframe's columns (each of which is a Pandas sequence), the result is a sequence too:

    print(formatted)
    
    0     -1 0.693147 |i imp:0.0000 SSD_id: 1 loss:0.693...
    1     1 0.145231 |i imp:-1.8559 SSD_id: 2 loss:0.419...
    2     -1 0.051725 |i imp:-2.7588 SSD_id: 4 loss:0.23...
    3     -1 12.508365 |i imp:-3.7784 SSD_id: 8 loss:6.3...
    4     1 0.598258 |i imp:-2.2767 SSD_id: 16 loss:3.48...
    5     -1 0.045413 |i imp:-2.8924 SSD_id: 32 loss:1.7...
    6     -1 0.270573 |i imp:-3.0438 SSD_id: 64 loss:1.0...
    7     1 0.204927 |i imp:-3.1539 SSD_id: 128 loss:0.6...
    8     -1 0.326834 |i imp:-1.6101 SSD_id: 256 loss:0....
    9     1 0.337820 |i imp:-2.8843 SSD_id: 512 loss:0.4...
    10    -1 0.271222 |i imp:-2.5209 SSD_id: 1024 loss:0...
    11    1 0.320471 |i imp:-2.0732 SSD_id: 2048 loss:0....
    12    1 0.289892 |i imp:-2.7639 SSD_id: 4096 loss:0....
    

    Each line is truncated when printed like this, but it's all in there. For instance:

    print(formatted[0])
    
    -1 0.693147 |i imp:0.0000 SSD_id: 1 loss:0.693147 |c feat_val=com.12346.xyz
    

    All that's left is to save it to a file:

    with open('out.txt', 'w') as f:
        f.writelines(formatted)