rubygrepmarkdownhashtag

Extracting hashtags and sections in document in ruby


I have a markdown text document with several sections and just below hashtags of the section. The hashtags are in the form #oneword# or #multiple words hashtag#.

I need to extract sections and their hashtags in ruby.

Example

# Section 1

#hash1# #hash tag 2# #hashtag3#

Some text

# Section 2

#hash1# #hash tag 4# #hash tag2#


Some text too

I want to get

{"Section 1"=>["hash1", "hash tag 2", "hashtag3"],
 "Section 2"=>["hash1", "hash tag 4", "hash tag2"]}

Can we get in from grep?


Solution

  • When faced with a problem such as this I tend to prefer the to use the builder pattern. It is a little verbose, but is normally very readable and very flexible.

    The main idea is you have a "reader" that simply looks at your input and looks for "tokens', in this case lines, and when it finds a token that it recognizes it informs the builder that it found a token of interest. The builder builds another object based on input from the "reader". Here is an example of a "DocumentBuilder" that takes input from a "MarkdownReader" that builds the Hash that you are looking for.

    class MarkdownReader
        attr_reader :builder
    
        def initialize(builder)
            @builder = builder
        end
    
        def parse(lines)
            lines.each do |line|
                case line
                when /^#[^#]+$/
                    builder.convert_section(line)
                when /^#.+\#$/
                    builder.convert_hashtag(line)
                end
            end
        end
    end
    
    class DocumentBuilder
        attr_reader :document
    
        def initialize()
            @document = {}
        end
    
        def convert_section(line)
            line =~ /^#\s*(.+)$/
            @section_name = $1
            document[@section_name] = []
        end
        
        def convert_hashtag(line)
            hashtags = line.split("#").reject {_1.strip.empty?}
            document[@section_name] += hashtags
        end
    end
    
    lines = File.readlines("markdown.md")
    builder = DocumentBuilder.new 
    reader = MarkdownReader.new(builder)
    reader.parse(lines)
    p builder.document
    
        => {"Section 1"=>["hash1", "hash tag 2", "hashtag3"], "Section 2"=>["hash1", "hash tag 4", "hash tag2"]}