I am reading a wikipedia XML file, in which i have to delete anything which is a list item. E.g. For the following string:
String text = ": definition list\n
** some list item\n
# another list item\n
[[Category:1918 births]]\n
[[Category:2005 deaths]]\n
[[Category:Scottish female singers]]\n
[[Category:Billy Cotton Band Show]]\n
[[Category:Deaths from Alzheimer's disease]]\n
[[Category:People from Glasgow]]";
Here, i want to delete the *
,#
and :
, but not the one where it says category. Output should look like:
String outtext = "definition list\n
some list item\n
another list item\n
[[Category:1918 births]]\n
[[Category:2005 deaths]]\n
[[Category:Scottish female singers]]\n
[[Category:Billy Cotton Band Show]]\n
[[Category:Deaths from Alzheimer's disease]]\n
[[Category:People from Glasgow]]";
I am using the following code:
Pattern pattern = Pattern.compile("(^\\*+|#+|;|:)(.+)$");
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
String outtext = matcher.group(0);
outtext = outtext.replaceAll("(^\\*+|#+|;|:)\\s", "");
return(outtext);
}
This is not working. Can you please indicate how i should do it?
This should work:
text = text.replaceAll("(?m)^[*:#]+\\s*", "");
Important is using (?m)
for MULTILINE
mode here that lets you use line start/end anchors for each line.
OUTPUT:
definition list
some list item
another list item
[[Category:1918 births]]
[[Category:2005 deaths]]
[[Category:Scottish female singers]]
[[Category:Billy Cotton Band Show]]
[[Category:Deaths from Alzheimer's disease]]
[[Category:People from Glasgow]]