pandasnumpygoogle-colaboratorydata-extractioninformation-extraction

Extract full link from a list in Google colab


I'm trying to extract a column of links from this kind of rows in a column

{'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q47099'}

To this: http://www.wikidata.org/entity/Q47099

Basically I would like to extract in a column diferent links like this one with pandas in Google Colab so I was using this code line after importing the csv: (org is the column in my csv file and links is the new column created)

data['links']=data['org'].str.findall('http://www.wikidata.org/entity/')

Then I tried with this other one:

data[data['org'].str.contains('www.wikidata.org')]

But both gave me the same result this:

Output from data.head(5).to_dict()

    {'links': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
 'org': {0: "{'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q47099'}",
  1: "{'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q565020'}",
  2: "{'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q576490'}",
  3: "{'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q590897'}",
  4: "{'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q604034'}"},
 'orgLabel': {0: "{'xml:lang': 'en', 'type': 'literal', 'value': 'Grupo Televisa, owner of TelevisaUnivision'}",
  1: "{'xml:lang': 'en', 'type': 'literal', 'value': 'Cuponzote'}",
  2: "{'xml:lang': 'en', 'type': 'literal', 'value': 'Casas GEO'}",
  3: "{'xml:lang': 'en', 'type': 'literal', 'value': 'Empresas ICA'}",
  4: "{'xml:lang': 'en', 'type': 'literal', 'value': 'Atletica'}"}}

Solution

  • If your org column contains a real dict, use:

    data[data['org'].str['value'].str.contains('www.wikidata.org')]
    #               ^^^^^^^^^^^^^
    

    If you want to extract the link:

    data['links'] = data['org'].str['value']
    

    Update

    Your column looks like a dict but it's a string. You have to evaluate before with ast.literal_eval:

    import ast
    
    data['org'] = data['org'].apply(ast.literal_eval)
    data['links'] = data['org'].str['value']
    print(data)
    
    # Output
                                        links                                                org                                           orgLabel
    0   http://www.wikidata.org/entity/Q47099  {'type': 'uri', 'value': 'http://www.wikidata....  {'xml:lang': 'en', 'type': 'literal', 'value':...
    1  http://www.wikidata.org/entity/Q565020  {'type': 'uri', 'value': 'http://www.wikidata....  {'xml:lang': 'en', 'type': 'literal', 'value':...
    2  http://www.wikidata.org/entity/Q576490  {'type': 'uri', 'value': 'http://www.wikidata....  {'xml:lang': 'en', 'type': 'literal', 'value':...
    3  http://www.wikidata.org/entity/Q590897  {'type': 'uri', 'value': 'http://www.wikidata....  {'xml:lang': 'en', 'type': 'literal', 'value':...
    4  http://www.wikidata.org/entity/Q604034  {'type': 'uri', 'value': 'http://www.wikidata....  {'xml:lang': 'en', 'type': 'literal', 'value':...