pythonweb-scrapingweb-crawler

Python : Web Scraping Specific Keywords


My Question shouldn't be too hard to answer, The problem im having is im not sure how to scrape a website for specific keywords.. I'm quite new to Python.. So i know i need to add in some more details , Firstly what i dont want to do is use Beautiful Soup or any of those libs, im using lxml and requests, What i do want to do is ask the user for an input for a website and once its provided , Send a request to the provided URL, once the request is made i want it to grab all the html which i believe ive done using html.fromstring(site.content) so all thats been done the problem im having is i want it to find any link or text with the ending '.swf' and print it below that.. Anyone know any way of doing this?

def ScrapeSwf():
     flashSite = raw_input('Please Provide Web URL : ')
     print 'Sending Requests...'
     flashReq = requests.get(flashSite)
     print 'Scraping...'
     flashTree = html.fromstring(flashReq.content)
     print ' Now i want to search the html for the swf link in the html'
     print ' And Display them using print probablly with a while condition'

Something like that .. Any help is highly appreciated


Solution

  • Here goes my attempt:

    import requests [1]
    response = requests.get(flashSite) [2]
    myPage = response.content [3]
    for line in myPage.splitlines(): [4]
        if '.swf' in line: [5]
            start = line.find('http') [6]
            end = line.find('.swf') + 4 [7]
            print line[start:end] [8]
    

    Explanation:

    1: Import the request module. I couldn't really figure out a way to get what I needed out of lxml, so I just stuck with this.

    2: Send a HTTP GET method to whatever site that has the Flash file

    3: Save its contents to a variable

    Yes, I realize you could condense lines 2 and 3, I just did it this way because I felt it makes a bit more sense to me.

    4: Now iterating through each line in the code, going line by line.

    5: Check to see if '.swf' is in that line

    Lines 6 through 8 demonstrate the string slicing method that @GazDavidson mentioned in his answer. The reason I add 4 in line 7 is because '.swf' is 4 characters long.

    You should be able to (roughly) get the result that provides a link to the SWF file.