search-engineblogscode-search-engine

Search Engine for a blog_website(searching inside links )


I created a very basic search option for my blog, and as per topics and key words it is generating results but what i am looking for is in certain articles i have to add links so if my search can go through those links that are basically external websites for example if i am referring to someone else blog for more information then search to find from that.Is it possible ? And i don't want to go for GCSE. Thanks in advance. It will be of great help.

Thanks again.


Solution

  • Yes, it is possible to write a bot to crawl external websites from links. I've made one. It crawled 100K+ website URLs. So yes, it is possible to make one, which can crawl links from your blog.

    To create a search engine, you'll need to know some internals regarding how they work...

    Search Bots work like this:

    1. Crawler fetches pages. This step is pretty easy, as it uses curl.
    2. Parser splits the HTML into pieces, so that data can be extracted from the page. This has 2 sub-components to it, which...

      a. Extracts any data from the page that you want to capture & then saves that data into a database.

      b. Extracts links & places them back into the crawling queue. This creates an infinite loop, so your bot never stops crawling... (Unless someone else's malformed URL crashes it, which happens a lot. So be ready to frequently fix it.)

    3. Indexer creates lookup indexes, which map keywords to the web page's contents. This has 2 sub-components to it, as it...

      a. Creates a Forward Index, which maps each document to keywords that are inside of that document.

      doc1 | bird, aviary, robin, dove, blue jay, cardinal
      doc2 | birds, bird watching, binoculars
      doc3 | cats, eat, birds
      doc4 | cats, generally, don't, like, water, nor, neighborhood, dogs
      doc5 | dog, shows, look, fun
      

      b. Creates an Inverted Index from the Forward Index, which reverses the indices. This allows users to search by keyword & then the search script looks up & suggests which documents, that users may want to view. Like so...

      bird | doc1, doc2
      cat  | doc3, doc4
      dog  | doc4, doc5
      

    Search Forms work like this:

    1. Search Form shows the HTML input box to the user.
    2. Search Script will search the Inverted Index to find which document links to display in the Search Engine Results Page.
    3. Search Engine Results Page (yes, SERP is an actual industry acronym for Search Engine Results Page). This displays the list of search result links. You can style it any way that you'd like & it doesn't have to look like Google's, Microsoft's Bing nor Yahoo's engines.

    Examples:

    Searching for:

    "bird" returns links to "doc1, doc2"
    "cat"  returns links to "doc3, doc4"
    "dog"  returns links to "doc4, doc5"
    

    Good luck building your search engine for your blog!