I have what I hope is an easy question. I am using the Google Storage Client library to loop over blobs in a bucket. After I get the list of blobs on the bucket I am unable to loop over the bucket unless I re-run the command to list the bucket.
I read the documentation on page iterators but I still dont quite understand why this sort of thing couldnt just be stored in memory like a normal variable in python. Why is this ValueError being thrown when I try to loop over the object again? Does anyone have any suggestions on how to interact with this data better?
For many sources of data, the potential returned items could be huge. While you may only have dozens or hundreds of objects in your bucket, there is absolutely nothing to prevent you from having millions (billions?) of objects. If you list a bucket, it would make no sense to return a million entries and have any hope of maintaining their state in memory. Instead, Google says you should "page" or "iterate" through them. Each time you ask for a new page, you get the next set of data and are presumed to have lost reference to the previous set of data ... and hence maintain only one set of data at a time at your client.
It is the back-end server that maintains your "window" into that data that is being returned. All you need do is say "give me more data ... my context is " and the next chunk of data is returned.
If you want to walk through your data twice then I would suggest asking for a second iteration. Be careful though, the result of the first iteration may not be the same as the second. If new files are added or old ones removed, the results will be different between one iteration and another.
If you really believe that you can hold the results in memory then as you execute your first iteration, save the results and keep appending new values as you page through them. This may work for specific use cases but realize that you are likely setting yourself up for trouble if the number of items gets too large.