pythoncountcpu-wordtxtedgar

word count from web text document result in 0


I tried the python codes from the article of Rasha Ashraf "Scraping EDGAR with Python". He used urllib2 which is now invalid in python 3, I guess. Thus, I changed it into urllib.

I could bring the following Edgar web page. However, the number of word counting resulted in 0 no matter how I tried to fix the codes. Please help me to fix this problem. FYI, I manually check on the URL page so that "ADDRESS", "TYPE", and "transaction" occur 5 times, 9 times, and 49 times each. Nevertheless, my faulty python result shows 0 results for these three words.

Here are the python codes of Rasha Ashraf amended by me (only the urllib part and web URL). The original URL contains vast text content. So I changed it into a more simple page of the web.

import time
import csv
import sys

CIK = '0001018724'
Year= '2013'
string_match1= 'edgar/data/1018724/000112760220028651/0001127602-20-028651.txt'
url3= 'http://www.sec.gov/Archives/'+string_match1

import urllib.request
 
response3= urllib.request.urlopen(url3)
#output = response3.read()
#print(output)
words=  ['ADDRESS','TYPE', 'transaction']
count= {}
for elem in words:
    count[elem]= 0
    
for line in response3:
    elements= line.split()
    for word in words:
       count[word]= count[word] + elements.count(word)

print (CIK)
print (Year)
print (url3)
print (count)

=> The result of my codes so far

0001018724

2013

http://www.sec.gov/Archives/edgar/data/1018724/000112760220028651/0001127602-20-028651.txt

{'ADDRESS': 0, 'TYPE': 0, 'transaction': 0}

Solution

  • To get the correct count of the number of times each of your 3 strings (not words!) appear in the filing, try something like this:

    import requests
    url = "http://www.sec.gov/Archives/edgar/data/1018724/000112760220028651/0001127602-20-028651.txt"
    req = requests.get(url)
    
    words = ['address','type','transaction']
    filing = req.text
    for word in words:
        print(word,': ',filing.lower().count(word))
    

    Output:

    address :  5
    type :  9
    transaction :  49