pythonjsonpandaslist-comprehensionset-comprehension

Reading lines from a file using a generator comprehension vs a list comprehension


The below code is from chapter 3 of Python Data Science Handbook by Jake VanderPlas. Each line in the file is a valid JSON. While I don't think the specifics of the file are critical to answering this question, the url for the file is https://github.com/fictivekin/openrecipes.

# read the entire file into a Python array        
with open('recipeitems-latest.json', 'r') as f:            
    # Extract each line            
    data = (line.strip() for line in f)            
    # Reformat so each line is the element of a list            
    data_json = "[{0}]".format(','.join(data))        
# read the result as a JSON        
recipes = pd.read_json(data_json)

Two questions:

  1. why is a generator comprehension used rather than a list comprehension in the second line of the code? Since the desired final data structure is a list, I'm wondering why not work with only lists rather than working first with a generator and then a list?
  2. is it possible to use a list comprehension instead?

Solution

  • You have two questions in here:

    1. Why generator comp? Because you do not know the size of the JSON in advance. So better be safe and do not load the entire file in memory.
    2. Yes, it is possible to use List Comprehension. Just replace the parentheses with square brackets.
    >>> f = open('things_which_i_should_know')
    >>> data = (line.strip() for line in f)
    >>> type(data)
    <class 'generator'>
    >>> data = [line.strip() for line in f]
    >>> type(data)
    <class 'list'>
    >>> 
    

    Please see official documentation for more info.

    With a list comprehension, you get back a Python list; stripped_list is a list containing the resulting lines, not an iterator. Generator expressions return an iterator that computes the values as necessary, not needing to materialize all the values at once. This means that list comprehensions aren’t useful if you’re working with iterators that return an infinite stream or a very large amount of data. Generator expressions are preferable in these situations.