visualizationcloc

Is there a tool that shows a distribution of lines of code per file of a folder?


I want to know how big files are within my repository in terms of lines of code, to see the 'health' of a repository.

In order to answer this, I would like to see a distribution (visualised or not) of the number of files for a specific range (can be 1):

#lines of code   #files   
 1-10             1
11-20             23
etc...

(A histogram of this would be nice)

Is there quick why to get this, with for example cloc or any other (command line) tool?


Solution

  • So the goal was to get a histogram of the sizes (in lines of code) for all the files in a directory. Since our project is a React Native project, we are concerned with .ts and .tsx files. All the test files (also .ts and .tsx files) can be skipped.

    Also, show the 5 largest files, so we know where our attention is needed.

    What we basically did was traverse the directory recursively and for every file we're interested in 1) calculate size (in lines of code), 2) calculate in which 'bin'/'bar' the file belongs and 3) add it to that bin. Meanwhile, you keep track of all the sizes, to display the 5 largest files.

    The following python script worked perfectly for my use case:

    import os
    import matplotlib.pyplot as plt
    from heapq import nlargest
    
    
    # Directory path containing your code files
    directory = "./src"
    
    # Extensions we're interested in
    extensions = [".ts", ".tsx"]
    
    # Initialize dictionary to store line counts for each bin
    line_counts = {}
    
    # Keep track of the largest files
    largest_files = []
    
    def count_lines(filepath):
        with open(filepath, "r") as file:
            lines = file.readlines()
            return len(lines)
    
    
    for root, dirs, files in os.walk(directory):
        # skip jest test files
        if root.find("__tests__") >= 0:
            continue
    
        for filename in files:
            _, file_extension = os.path.splitext(filename)
            if file_extension not in extensions:
                continue
    
            filepath = os.path.join(root, filename)
            line_count = count_lines(filepath)
    
            # Calculate bin index
            bin_index = (line_count // 10) * 10
    
            # Update line counts dictionary
            line_counts[bin_index] = line_counts.get(bin_index, 0) + 1
    
            # Add file and line count to the list of largest files
            largest_files.append((filepath, line_count))
    
    # Extract x and y data for the histogram
    x = list(line_counts.keys())
    y = list(line_counts.values())
    
    # Sort the largest files by line count in descending order
    largest_files = nlargest(5, largest_files, key=lambda item: item[1])
    
    # Print the largest files
    print("Top 5 Largest Files:")
    for file, line_count in largest_files:
        print(f"{file} - {line_count} lines")
    
    # Plot the histogram
    plt.bar(x, y, align="edge", width=10)
    plt.xlabel("Number of Lines of Code")
    plt.ylabel("Number of Files")
    plt.title("Distribution of Lines of Code")
    plt.show()