I am supposed to work with the stack-overflow dump as a part of my project. Now, being a novice programmer I am facing problems with doing the following task with the Pandas library.
I have a .csv file that looks like:
Id,ParentId,CreationDate,Score,Body,OwnerUserId,LastEditorUserId,LastEditDate,LastActivityDate,CommentCount,FLAG
127,126,01-08-08 16:13,51,"This has religious war potential, but it seems to me that if you're using a getter/setter, you should use it internally as well - using both will lead to maintenance problems down the road (e.g. somebody adds code to a setter that needs to run every time that property is set, and the property is being set internally w/o that setter being called).",35,35,01-08-08 16:32,01-08-08 16:32,2,
152,146,01-08-08 17:33,28,"The funny thing is i wrote a php media gallery for all my music 2 days ago. I had a similar problem. Im using http://musicplayer.sourceforge.net/ for the player. and the playlis are built via php. all music request go there a script called xfer.php?file=WHATEVER
$filename = base64_url_decode($_REQUEST['file']);
header(""Cache-Control: public"");
header(""Content-Description: File Transfer"");
header('Content-disposition: attachment; filename='.basename($filename));
header(""Content-Transfer-Encoding: binary"");
header('Content-Length: '. filesize($filename));
// Put either file counting code here. either a db or static files
//
readfile($filename); //and spit the user the file
function base64_url_decode($input) {
return base64_decode(strtr($input, '-_,', '+/='));
}
And when you call files use something like
function base64_url_encode($input) {
return strtr(base64_encode($input), '+/=', '-_,');
}
http://us.php.net/manual/en/function.base64-encode.php
If you are using some javascript or a flash player (JW player for example) that requires the actual link to be an mp3 file or whatever, you can append the text ""&type=.mp3"" so the final linke becomes something like ""www.example.com/xfer.php?file=34842ffjfjxfh&type=.mp3"". That way it looks like it ends with an mp3 extension without affecting the file link.
",146637,30,10-08-08 12:16,10-08-08 12:16,4,
I wish to obtain another .csv file which should look like:
Id,ParentId,CreationDate,Score,Body,OwnerUserId,LastEditorUserId,LastEditDate,LastActivityDate,CommentCount,FLAG
127,126,2008-08-01 16:13:48,51,"This has religious war potential, but it seems to me that if you're using a getter/setter, you should use it internally as well - using both will lead to maintenance problems down the road (e.g. somebody adds code to a setter that needs to run every time that property is set, and the property is being set internally w/o that setter being called).",35,35.0,2008-08-01 16:32:17,2008-08-01 16:32:17,2,
152,146,2008-08-01 17:33:59,28,"The funny thing is i wrote a php media gallery for all my music 2 days ago.",146637,30.0,2008-08-10 12:16:40,2008-08-10 12:16:40,4,
152,146,2008-08-01 17:33:59,28,"I had a similar problem.",146637,30.0,2008-08-10 12:16:40,2008-08-10 12:16:40,4,
152,146,2008-08-01 17:33:59,28,"Im using /musicplayer.sourceforge/ for the player. and the playlis are built via php. all music request go there a script called xfer.php?file=WHATEVER ",146637,30.0,2008-08-10 12:16:40,2008-08-10 12:16:40,4,
152,146,2008-08-01 17:33:59,28,"$filename = base64_url_decode($_REQUEST['file']); header(""Cache-Control: public""); header(""Content-Description: File Transfer""); header('Content-disposition: attachment; filename='.basename($filename)); header(""Content-Transfer-Encoding: binary""); header('Content-Length: '. filesize($filename)); // Put either file counting code here. either a db or static files // readfile($filename); //and spit the user the file function base64_url_decode($input) { return base64_decode(strtr($input, '-_,', '+/='));}",146637,30.0,2008-08-10 12:16:40,2008-08-10 12:16:40,4,
152,146,2008-08-01 17:33:59,28,"And when you call files use something like function base64_url_encode($input) { return strtr(base64_encode($input), '+/=', '-_,');}",146637,30.0,2008-08-10 12:16:40,2008-08-10 12:16:40,4,
152,146,2008-08-01 17:33:59,28,"http://us.php.net/manual/en/function.base64-encode.php",146637,30.0,2008-08-10 12:16:40,2008-08-10 12:16:40,4,
152,146,2008-08-01 17:33:59,28,"If you are using some javascript or a flash player (JW player for example) that requires the actual link to be an mp3 file or whatever, you can append the text ""&type=.mp3"" so the final linke becomes something like ""example/xfer.php?file=34842ffjfjxfh&type=.mp3"".",146637,30.0,2008-08-10 12:16:40,2008-08-10 12:16:40,4,
152,146,2008-08-01 17:33:59,28,"That way it looks like it ends with an mp3 extension without affecting the file link. ",146637,30.0,2008-08-10 12:16:40,2008-08-10 12:16:40,4,
After cleaning the input csv file using the re and lxml, the following code did the job (using nltk)
sentences = []
for row in df.itertuples():
for sentence in nltk.sent_tokenize(row[10]):
sentences.append((row[1], sentence, row[11]))
new_df = pd.DataFrame(sentences, columns=['POSTID', 'SENTENCE', 'FLAG'])
This snippet found online....some tweaking was necessary ofcourse.