I am trying to read contents from an srt file and I am using Java's BufferedReader to read the file line by line. The content from the srt file is:
2
00:00:40,665 --> 00:00:44,806
<i>♪ Nants ingonyama ♪</i>
And the code snippet is as follows:
public void parseSubtitles(@NonNull final MultipartFile subtitleFile) {
InputStream is = subtitleFile.getInputStream();
BufferedReader reader = new BufferedReader(new InputStreamReader(is));
String line;
while ((line = reader.readLine()) != null) {
System.out.println(line);
}
}
while debugging through the code, by adding break points, I found out that while reading the first line 2
, the byte array value is [-3, -1, -3, -1, 50, 0, 0, 0]
.
Then the next line is just a byte array with value [0]
The next line is then [0, 48, 0, 48, 0, 58, 0, 48, 0, 48, 0, 58, 0, 52, 0, 48, 0, 44, 0, 54, 0, 54, 0, 53, 0, 32, 0, 45, 0, 45, 0, 62, 0, 32, 0, 48, 0, 48, 0, 58, 0, 48, 0, 48, 0, 58, 0, 52, 0, 52, 0, 44, 0, 56, 0, 48, 0, 54, 0]
which in this case is the time interval of the subtitle.
This is not the case with other subtitle files as there are no 0
value in the byte array and no garbage lines like a byte array with null value [0]
.
Any idea on what might be causing this issue?
My guess is that you should be reading as UTF-16. The tell-tale sign is the null byte preceding each non-null one. This would mean the two byte encoding of UTF-16 is redundant for 'ascii' characters, which is why UTF-8 is used more, except in the case of certain languages