javascript regex node.js stream visual-sourcesafe-2005

Extract Multiple Values from Dynamic Multi-line String

I'm working on a small Node.js app to parse a running log file in order to extract key values and generate custom alerts based on the results. However, I've now run into an issue for which I can't seem to find a solution.. If it's relevant at all, the specific log being parsed is a MS SourceSafe 2005 journal file.

For clarity, here are three examples of possible journal entries (some details changed for privacy reasons, structure kept intact):

$/path/to/a/project/folder
Version: 84
User: User1           Date: 14/01/27  Time: 12:15p
testBanner.rb added
Comment: Style and content changes based on corporate branding
Remove detector column on sc600 page
Styling tweaks and bug fixes

$/path/to/a/project/file.java
Version: 22
User: User2           Date: 14/01/29  Time: 12:34p
Checked in
Comment: Added fw updates to help fix (xxx) as seen in (yyy):
Changes include:
1) Peak tuning (minimum peak distance, and percentage crosstalk peak)
2) Dynamic pulses adjusted in run time by the sensor for low temperature climate
s
3) Startup noise automatic resets
4) More faults

$/path/to/a/project/folder
Version: 29
User: User3           Date: 14/01/30  Time: 11:54a
Labeled v2.036
Comment: Added many changes at this point, see aaVersion.java for a more comple
te listing.

So far, the following points are known:

First entry line is always the relevant VSS database project or file path.
Second entry line is always the relevant version of the above project or file.
Third entry line always contains three values: User:, Date: and Time:.
Fourth entry line is always the associated action, which can be any one of the following:
- Checked in: {file}
- {file} added
- {folder} created
- {file or folder} deleted
- {file or folder} destroyed
- Labeled: {label}
Fifth entry line is an optional comment block, starting with Comment:. It may contain any type of string input, including new lines, file names, brackets, etc. Basically VSS does not restrict the comment contents at all.

I've found regex patterns to match everything except the "Comment:" section, not knowing how many new line characters may be included in the comment makes this really difficult for someone like me who doesn't speak regex very fluently at all..

So far, I've managed to get my app to watch the journal file for changes and catch only fresh data in a stream. My initial plan was to use .split('\n\n') on the stream output to catch each individual entry, but since comments may also contain any number of new lines at any position, this is not exactly a safe approach.

I found a module called regex-stream, which makes me think I don't need to collect the results in an array of strings before extracting details, but I don't really understand the given usage example. Alternatively, I have no problem with splitting and parsing individual strings, as long as I can find a reliable way to break the stream down into the individual entries.

In the end, I'm looking for an array of objects with the following entry structure for each update of the journal:

{
    path: "",
    version: "",
    user: "",
    date: "",
    time: "",
    action: "",
    comment: ""
}

Please note: If 100 files are checked in in one action, VSS will still log an entry for each file. In order to prevent notification spamming, I still need to perform additional validation and grouping before generating any notifications..

The current state of my app can be seen in this Github repo. Could someone please help point me in the right direction here...?

Solution

There is no 100% fool-proof way to parse when the Comment section can contain anything. The next best choice would be to make some heuristics, and hoping that there is no crazy comment.

If we can assume that 2 new lines followed by a path signifies the start of an entry, then we can split on this regex (after you replace all variants of line separators to \n):

/\n\n(?=\$\/[^\n]*\n)/

The look-ahead (?=pattern) check that there is a path ahead \$\/[^\n]*\n, without consuming it.

To be extra sure, you can make it checks that the version line follows after the path:

/\n\n(?=\$\/[^\n]*\nVersion: \d+\n)/