csvprologprolog-findall

Prolog, read a csv file and make a predicate. findall


I am using SWI-Prolog.

I have a csv file where the top row are probes and then each row is a sample:

    1007_s_at   1053_at 117_at ...
GSM102447.CEL   1   0   0 ...
GSM102449.CEL   1   0   0 ...
GSM102451.CEL   1   0   0 ...
GSM102455.CEL   1   0   0 ...
GSM102507.CEL   1   0   1 ...
...

The actual file has over 20,000 columns ('probes') and no more than 150 rows ('samples').

I want to extract each relation and print them as facts in another file.

For example:

%probe_value_in_sample(Probe,Sample_Strip,ProbeValue).
probe_value_in_sample('1007_s_at','GSM102447',1).
etc

My code so far:

foreach(csv_read_file_row_list('GSE2109_BarCode.csv', List), assert(['samples'|List])).

probe_value_in_sample(Probe,Sample_Strip,ProbeValue):-
[samples|[samples,Empty|ProbeList]],Empty='', %the first value is empty
indexOf(ProbeList,Probe,IndexOfProbe),
[samples|[samples,Sample|SampleValues]],Sample\='',
nth0(IndexOfProbe,SampleValues,ProbeValue),
name(Sample, CharSample),
append(Char_Sample_Strip,".CEL",CharSample),
name(Sample_Strip,Char_Sample_Strip).

%IndexOf(MyList, MyElement, MyIndex).
indexOf([Element|_], Element, 0).
indexOf([_|Tail], Element, Index):-
indexOf(Tail, Element, Index1),
Index is Index1+1.

This seems to work well, but does not work or it is so slow to be unusable with a findall.

Any idea what could be the problem?

Thank you for any help.

Update

Thank you for your reply.

I have defined:

csv_read_file_row_list(File, List,Functor):-
csv_read_file_row(File,Row,[functor(Functor)]),Row=..List.

So I have an open file rather than a stream and the Functor variable is superfluous at the moment.

I am confused by how you have used maplist? And I can't quite get it to work.

I have tried:

:- dynamic samples/3.

csv_read_file_row_list(File, List,Functor):-
csv_read_file_row(File,Row,[functor(Functor)]),Row=..List.

prepare_db(File) :-
   ( nonvar(File) ; File = 'GSE2109_BarCode.csv' ),
   %open(File, read, S),
   csv_read_file_row_list(File,     ['thing',_Empty|ColKeys],'thing'),
 forall(csv_read_file_row_list(File,    ['thing',RowKeyDirty|Samples],'thing'),
    (   clean_rowkey(RowKeyDirty, RowKey),
        maplist(store_sample(RowKey), ColKeys, Samples)
    )).
%close(S).

store_sample(RowKey, ColKey, Sample) :-
  assertz(samples(RowKey, ColKey, Sample)).

clean_rowkey(RowKeyDirty, RowKey) :- append(RowKey, ".CEL", RowKeyDirty).

As well as:

:- dynamic samples/3.

csv_read_file_row_list(File, List,Functor):-
csv_read_file_row(File,Row,[functor(Functor)]),Row=..List.

prepare_db(File) :-
( nonvar(File) ; File = 'GSE2109_BarCode.csv' ),
%open(File, read, S),
csv_read_file_row_list(File, ['thing',_Empty|ColKeys],'thing'),
forall(csv_read_file_row_list(File, ['thing',RowKeyDirty|Samples],'thing'),
    (   clean_rowkey(RowKeyDirty, RowKey),
        maplist(store_sample,[RowKey], ColKeys, Samples)
    )).
%close(S).

store_sample(RowKey, ColKey, Sample) :-
assertz(samples(RowKey, ColKey, Sample)).

clean_rowkey(RowKeyDirty, RowKey) :- append(RowKey, ".CEL", RowKeyDirty).

But both fail.


Solution

  • You are not using assert/1 in proper way. Prolog has a fast and efficient in memory DB, but as any DB, must be properly indexed. And of course, as any language, avoid repeating the very same operation each time, but format the data once, while preparing the DB.

    :- dynamic samples/3.
    
    prepare_db(File) :-
        ( nonvar(File) ; File = 'GSE2109_BarCode.csv' ),
        open(File, read, S),
        read_row(S, [_Empty|ColKeys]),
        forall(read_row(S, [RowKeyDirty|Samples]),
            (   clean_rowkey(RowKeyDirty, RowKey),
                maplist(store_sample(RowKey), ColKeys, Samples)
            )),
        close(S).
    
    store_sample(RowKey, ColKey, Sample) :-
        assertz(samples(RowKey, ColKey, Sample)).
    
    clean_rowkey(RowKeyDirty, RowKey) :- append(RowKey, ".CEL", RowKeyDirty).
    

    This code assumes that the first row has the very same number of columns as all other rows.

    read_row/2 must fetch a row and split in a list of codes lists, I guess your csv_read_file_row_list/2 already does it, but I can't spot your definition in posted code.

    Indexing works better with atoms, instead of code lists. atom_codes/2 allows to switch between these representations.

    edit

    From your comment and additional posted code, I can see my answer wasn't very appropriate. Here is a modified and tested snippet

    :- [library(csv)].
    
    :- dynamic samples/3.
    :- dynamic column_keys/1.
    
    prepare_db(File) :-
        retractall(column_keys(_)),
        retractall(samples(_,_,_)),
        ( nonvar(File) ; File = '/tmp/test.csv' ),
        forall(read_row(File, Row), store_row(Row)).
    
    store_row(Row) :-
        Row =.. [row|Cols],
        (   column_keys(ColKeys)
        ->  Cols = [RowKeyDirty|Samples],
            clean_rowkey(RowKeyDirty, RowKey),
            maplist(store_sample(RowKey), ColKeys, Samples)
        ;   assertz(column_keys(Cols))
        ).
    
    store_sample(RowKey, ColKey, Sample) :-
        assertz(samples(RowKey, ColKey, Sample)).
    
    clean_rowkey(RowKeyDirty, RowKey) :-
        atom_concat(RowKey, '.CEL', RowKeyDirty).
    
    read_row(File, Row) :-
        csv_read_file_row(File, Row, [separator(0' ), strip(true), convert(true)]),
        writeln(read_row(Row)).
    

    that works for this test file

                     1007_s_at    1053_at   117_at
    GSM102447.CEL    1            0         0
    GSM102449.CEL    1            0         0
    GSM102451.CEL    1            0         0
    GSM102455.CEL    1            0         0
    GSM102507.CEL    1            0         1
    

    and yields

    ?- prepare_db(_).
    read_row(row(1007_s_at,1053_at,117_at))
    read_row(row(GSM102447.CEL,1,0,0))
    read_row(row(GSM102449.CEL,1,0,0))
    read_row(row(GSM102451.CEL,1,0,0))
    read_row(row(GSM102455.CEL,1,0,0))
    read_row(row(GSM102507.CEL,1,0,1))
    true.
    
    16 ?- samples(X,Y,Z).
    X = 'GSM102447',
    Y = '1007_s_at',
    Z = 1 ;
    X = 'GSM102447',
    Y = '1053_at',
    Z = 0 ;
    ...
    

    Of course, the display of read row is just for debug purpose