pythonurlliburldecode

urldecode a list in Python


I'm attempting to open a CSV and decode the URL text, e.g. example.com?title=%D0%BF%D1%80%D0%B0%D0%B2%D0%BE%D0%B2%D0%B0%D1%8F+%D0%B7%D0%B0%D1%89%D0%B8%D1%82%D0%B0, and then save the file. I can do this easily with a string, but I'm struggling to do it with rows in a CSV.

My attempt so far:

#reading
file1 = open('example.csv', 'r')
reader = csv.reader(file1)
url = []
for rows in reader:
    url.append = urllib.unquote(rows).decode('utf8')
    #also tried "url.append(urllib.unquote(rows).decode('utf8'))", but same error
file1.close() 

#writing
file2 = open('example.csv', 'w')
writer = csv.writer(file2)
writer.writerows(url)
file2.close()

The error I'm receiving:

AttributeError: 'list' object has no attribute 'split'


Solution

  • There are a few mistakes in your approach.

    1. You don't seem to have CSV, but a regular text file with one value per line. There is no benefit in using the csv module here, Python can read text files just fine. In fact, "line-wise" the default mode when you open a text file for reading.
    2. When you read or write any text file, you must declare the encoding the text file is in when you open() it. Python has no magic text encoding detector, when you don't specify an encoding, reading the file properly may work on your machine and break on another, because different computer configurations may have different "default" encodings.
    3. URLs are complex data structures, applying "urldecode" to them is not good enough. You need to parse them - luckily an URL parser is built into Python. The URL parser will give you a ParseResult object that conveniently exposes all the different parts of the URL as properties.
    4. URLs consist of many parts, the query string is one of them.
    5. Query strings are complex data structures, applying "urldecode" to them is not good enough. You need to parse them - luckily a query string parser is built into Python. The query string parser will automatically decode the values for you and give you a dict that you can access with keys.
    6. .append is a function. You can't assign to it (.append = '...'), you need to call it (.append('...')).
    7. Lastly, it's easier to use a with block to work with files, because with blocks close the file automatically.

    Compare:

    from urllib.parse import urlparse, parse_qs
    
    with open('example.txt', 'r', encoding='utf-8') as file1:
        titles = []
    
        for url in file1:
            parts = urlparse(url)
            # -> ParseResult(
            #      scheme='http', netloc='example.com', path='', params='',
            #      query='title=%D0%BF%D1%80%D0%B0%D0%B2%D0%BE%D0%B2%D0%B0%D1%8F+%D0%B7%D0%B0%D1%89%D0%B8%D1%82%D0%B0',
            #      fragment='')
    
            q = parse_qs(parts.query)
            # -> {'title': ['правовая защита']}
    
            if 'title' in q:
                titles.append(q['title'][0])
    
        with open('titles.txt', 'w', encoding='utf-8') as file2:
            file2.writelines(titles)
    

    Using list comprehensions and dropping the unnecessary comments, we can compress the above code quite a bit:

    from urllib.parse import urlparse, parse_qs
    
    with open('example.txt', 'r', encoding='utf-8') as file1:
        queries = [parse_qs(urlparse(url).query) for url in file1]
    
    with open('titles.txt', 'w', encoding='utf-8') as file2:
        titles = [q['title'][0] for q in queries if 'title' in q]
        file2.writelines(titles)