pythoncsvpandasdata-processingdata-dump

sentence extraction from 'so dump' using pandas


I am supposed to work with the stack-overflow dump as a part of my project. Now, being a novice programmer I am facing problems with doing the following task with the Pandas library.

I have a .csv file that looks like:

Id,ParentId,CreationDate,Score,Body,OwnerUserId,LastEditorUserId,LastEditDate,LastActivityDate,CommentCount,FLAG
127,126,01-08-08 16:13,51,"This has religious war potential, but it seems to me that if you're using a getter/setter, you should use it internally as well - using both will lead to maintenance problems down the road (e.g. somebody adds code to a setter that needs to run every time that property is set, and the property is being set internally w/o that setter being called).",35,35,01-08-08 16:32,01-08-08 16:32,2,
152,146,01-08-08 17:33,28,"The funny thing is i wrote a php media gallery for all my music 2 days ago. I had a similar problem.  Im using http://musicplayer.sourceforge.net/ for the player. and the playlis are built via php.  all music request go there a script called xfer.php?file=WHATEVER

$filename = base64_url_decode($_REQUEST['file']);
header(""Cache-Control: public"");
header(""Content-Description: File Transfer"");
header('Content-disposition: attachment; filename='.basename($filename));
header(""Content-Transfer-Encoding: binary"");
header('Content-Length: '. filesize($filename));

//  Put either file counting code here. either a db or static files

//

readfile($filename);  //and spit the user the file


function base64_url_decode($input) {
    return base64_decode(strtr($input, '-_,', '+/='));
}


And when you call files use something like 

function base64_url_encode($input) {
     return strtr(base64_encode($input), '+/=', '-_,');
}


http://us.php.net/manual/en/function.base64-encode.php

If you are using some javascript or a flash player (JW player for example) that requires the actual link to be an mp3 file or whatever, you can append the text ""&type=.mp3"" so the final linke becomes something like ""www.example.com/xfer.php?file=34842ffjfjxfh&type=.mp3"". That way it looks like it ends with an mp3 extension without affecting the file link.
",146637,30,10-08-08 12:16,10-08-08 12:16,4,

I wish to obtain another .csv file which should look like:

Id,ParentId,CreationDate,Score,Body,OwnerUserId,LastEditorUserId,LastEditDate,LastActivityDate,CommentCount,FLAG
127,126,2008-08-01 16:13:48,51,"This has religious war potential, but it seems to me that if you're using a getter/setter, you should use it internally as well - using both will lead to maintenance problems down the road (e.g. somebody adds code to a setter that needs to run every time that property is set, and the property is being set internally w/o that setter being called).",35,35.0,2008-08-01 16:32:17,2008-08-01 16:32:17,2,
152,146,2008-08-01 17:33:59,28,"The funny thing is i wrote a php media gallery for all my music 2 days ago.",146637,30.0,2008-08-10 12:16:40,2008-08-10 12:16:40,4,
152,146,2008-08-01 17:33:59,28,"I had a similar problem.",146637,30.0,2008-08-10 12:16:40,2008-08-10 12:16:40,4,
152,146,2008-08-01 17:33:59,28,"Im using /musicplayer.sourceforge/ for the player. and the playlis are built via php. all music request go there a script called xfer.php?file=WHATEVER ",146637,30.0,2008-08-10 12:16:40,2008-08-10 12:16:40,4,
152,146,2008-08-01 17:33:59,28,"$filename = base64_url_decode($_REQUEST['file']); header(""Cache-Control: public""); header(""Content-Description: File Transfer""); header('Content-disposition: attachment; filename='.basename($filename)); header(""Content-Transfer-Encoding: binary""); header('Content-Length: '. filesize($filename)); //  Put either file counting code here. either a db or static files // readfile($filename);  //and spit the user the file function base64_url_decode($input) {    return base64_decode(strtr($input, '-_,', '+/='));}",146637,30.0,2008-08-10 12:16:40,2008-08-10 12:16:40,4,
152,146,2008-08-01 17:33:59,28,"And when you call files use something like function base64_url_encode($input) {     return strtr(base64_encode($input), '+/=', '-_,');}",146637,30.0,2008-08-10 12:16:40,2008-08-10 12:16:40,4,
152,146,2008-08-01 17:33:59,28,"http://us.php.net/manual/en/function.base64-encode.php",146637,30.0,2008-08-10 12:16:40,2008-08-10 12:16:40,4,
152,146,2008-08-01 17:33:59,28,"If you are using some javascript or a flash player (JW player for example) that requires the actual link to be an mp3 file or whatever, you can append the text ""&type=.mp3"" so the final linke becomes something like ""example/xfer.php?file=34842ffjfjxfh&type=.mp3"".",146637,30.0,2008-08-10 12:16:40,2008-08-10 12:16:40,4,
152,146,2008-08-01 17:33:59,28,"That way it looks like it ends with an mp3 extension without affecting the file link. ",146637,30.0,2008-08-10 12:16:40,2008-08-10 12:16:40,4,

Solution

  • After cleaning the input csv file using the re and lxml, the following code did the job (using nltk)

    sentences = []
    for row in df.itertuples():
    for sentence in nltk.sent_tokenize(row[10]):
    sentences.append((row[1], sentence, row[11]))
    new_df = pd.DataFrame(sentences, columns=['POSTID', 'SENTENCE', 'FLAG'])
    

    This snippet found online....some tweaking was necessary ofcourse.