dsllines-of-code

Add language support to sloccount


Is there a way to tell sloccount that some files are neither of the existing languages already, but a new (different) language (some DSL, a language not supported by sloccount, scala, go, rust...) but not based on file extension, rather by their content (e.g. contain some specific keywords, or a specific style of comments, I could provide a complete list of tokens to the tool, etc.).

Is there is a better tool (simple) for the job for this specific task ?

Thanks in advance.


Solution

  • Op writes: Is there is a better tool (simple) for the job for this specific task ?

    What you want is a tool that knows something about a wide variety of languages, can use the file extension as a hint and uses the file content as a sanity check or a classification if the extension isn't present.

    Semantic Designs' (my company) File Inventory tool scans a large set of files and classifies them this way. File extensions hint at content. When no file extension is present, a set of user-definable regexes are used to attempt a basic classification of the type of file. Once the file content is guessed, a second pass using language accurate lexical scanners are used to confirm that content is what it claims to be to provide confidence factors. (It works without the lexical scanners too... you just get the hinted type).

    FileInventory doesn't compute source code metrics by itself. (It does compute file size and line counts for files that appear to contain text). But it does manufacture project files for the classified files to drive our Source Code Search Engine (SCSE), a tool for search large code bases in multiple languages. A side effect of SCSE scanning the code base to index it for fast access, is the computation of basic metrics: lines, SLOC, comments, Halstead, McCabe metrics (example output).

    [We have a special lexical analyzer called "Ad Hoc Text". This tries to model the random programming language found in the zillion how-to computer books, so it know about typical comments /* ... */ -- ... , various kinds of quoted strings "...." '....' ...., lots of numerical literals types (decimal, float), typical keywords 'function' 'if' 'do' etc. Using this lexical analyzer the SCSE can lex most randomly chosen programming languages partially, but its good enough to compute not-terribly inaccurate metrics. That's really handy to for all the uncatergorized source code one often finds in big crufty source code bases.]

    So the combination of FileInventory and Source Code Search Engine seem to do what you want, at scale. These tools are not what I would call simple in terms of how that are internally implemented (doing anything that knows details about programming languages is actually pretty complicated), but they are very simple to configure and run.