pandas numpy google-colaboratory data-extraction information-extraction

Extract full link from a list in Google colab

I'm trying to extract a column of links from this kind of rows in a column

{'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q47099'}

To this: http://www.wikidata.org/entity/Q47099

Basically I would like to extract in a column diferent links like this one with pandas in Google Colab so I was using this code line after importing the csv: (org is the column in my csv file and links is the new column created)

data['links']=data['org'].str.findall('http://www.wikidata.org/entity/')

Then I tried with this other one:

data[data['org'].str.contains('www.wikidata.org')]

But both gave me the same result this:

Output from data.head(5).to_dict()

    {'links': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
 'org': {0: "{'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q47099'}",
  1: "{'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q565020'}",
  2: "{'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q576490'}",
  3: "{'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q590897'}",
  4: "{'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q604034'}"},
 'orgLabel': {0: "{'xml:lang': 'en', 'type': 'literal', 'value': 'Grupo Televisa, owner of TelevisaUnivision'}",
  1: "{'xml:lang': 'en', 'type': 'literal', 'value': 'Cuponzote'}",
  2: "{'xml:lang': 'en', 'type': 'literal', 'value': 'Casas GEO'}",
  3: "{'xml:lang': 'en', 'type': 'literal', 'value': 'Empresas ICA'}",
  4: "{'xml:lang': 'en', 'type': 'literal', 'value': 'Atletica'}"}}

Solution

If your org column contains a real dict, use:

data[data['org'].str['value'].str.contains('www.wikidata.org')]
#               ^^^^^^^^^^^^^

If you want to extract the link:

data['links'] = data['org'].str['value']

Update

Your column looks like a dict but it's a string. You have to evaluate before with ast.literal_eval:

import ast

data['org'] = data['org'].apply(ast.literal_eval)
data['links'] = data['org'].str['value']
print(data)

# Output
                                    links                                                org                                           orgLabel
0   http://www.wikidata.org/entity/Q47099  {'type': 'uri', 'value': 'http://www.wikidata....  {'xml:lang': 'en', 'type': 'literal', 'value':...
1  http://www.wikidata.org/entity/Q565020  {'type': 'uri', 'value': 'http://www.wikidata....  {'xml:lang': 'en', 'type': 'literal', 'value':...
2  http://www.wikidata.org/entity/Q576490  {'type': 'uri', 'value': 'http://www.wikidata....  {'xml:lang': 'en', 'type': 'literal', 'value':...
3  http://www.wikidata.org/entity/Q590897  {'type': 'uri', 'value': 'http://www.wikidata....  {'xml:lang': 'en', 'type': 'literal', 'value':...
4  http://www.wikidata.org/entity/Q604034  {'type': 'uri', 'value': 'http://www.wikidata....  {'xml:lang': 'en', 'type': 'literal', 'value':...