pythonboto3

s3 urls - get bucket name and path


I have a variable which has the aws s3 url

s3://bucket_name/folder1/folder2/file1.json

I want to get the bucket_name in a variables and rest i.e /folder1/folder2/file1.json in another variable. I tried the regular expressions and could get the bucket_name like below, not sure if there is a better way.

m = re.search('(?<=s3:\/\/)[^\/]+', 's3://bucket_name/folder1/folder2/file1.json')
print(m.group(0))

How do I get the rest i.e - folder1/folder2/file1.json ?

I have checked if there is a boto3 feature to extract the bucket_name and key from the url, but couldn't find it.


Solution

  • Since it's just a normal URL, you can use urlparse to get all the parts of the URL.

    >>> from urlparse import urlparse
    >>> o = urlparse('s3://bucket_name/folder1/folder2/file1.json', allow_fragments=False)
    >>> o
    ParseResult(scheme='s3', netloc='bucket_name', path='/folder1/folder2/file1.json', params='', query='', fragment='')
    >>> o.netloc
    'bucket_name'
    >>> o.path
    '/folder1/folder2/file1.json'
    

    You may have to remove the beginning slash from the key as the next answer suggests.

    o.path.lstrip('/')
    

    With Python 3 urlparse moved to urllib.parse so use:

    from urllib.parse import urlparse
    

    Here's a class that takes care of all the details.

    try:
        from urlparse import urlparse
    except ImportError:
        from urllib.parse import urlparse
    
    
    class S3Url(object):
        """
        >>> s = S3Url("s3://bucket/hello/world")
        >>> s.bucket
        'bucket'
        >>> s.key
        'hello/world'
        >>> s.url
        's3://bucket/hello/world'
    
        >>> s = S3Url("s3://bucket/hello/world?qwe1=3#ddd")
        >>> s.bucket
        'bucket'
        >>> s.key
        'hello/world?qwe1=3#ddd'
        >>> s.url
        's3://bucket/hello/world?qwe1=3#ddd'
    
        >>> s = S3Url("s3://bucket/hello/world#foo?bar=2")
        >>> s.key
        'hello/world#foo?bar=2'
        >>> s.url
        's3://bucket/hello/world#foo?bar=2'
        """
    
        def __init__(self, url):
            self._parsed = urlparse(url, allow_fragments=False)
    
        @property
        def bucket(self):
            return self._parsed.netloc
    
        @property
        def key(self):
            if self._parsed.query:
                return self._parsed.path.lstrip('/') + '?' + self._parsed.query
            else:
                return self._parsed.path.lstrip('/')
    
        @property
        def url(self):
            return self._parsed.geturl()