beautifulsoupxml-parsingxmltocsv

How to extract xml tags with BeautifulSoup?


I am trying to extract the tags from this data:

[{"title":"Joshua Cohen","nid":"21706","type":"winner","changed":"1651960857","field_category":{"und":[{"tid":"219"}]},"field_year":{"und":[{"tid":"640"}]},"field_abbr_citation":{"und":[{"safe_value":"A mordant, linguistically deft historical novel about the ambiguities of the Jewish-American experience, presenting ideas and disputes as volatile as its tightly-wound plot."}]},"field_location_text":[],"field_publication":{"und":[{"safe_value":"The Netanyahus: An Account of a Minor and Ultimately Even Negligible Episode in the History of a Very Famous Family"}]},"field_publisher":{"und":[{"safe_value":"New York Review Books"}]},"field_teaser_thumbnail":[],"path_alias":"winners\/joshua-cohen"},{"title":"Louise Erdrich","nid":"21286","type":"winner","changed":"1623362816","field_category":{"und":[{"tid":"219"}]},"field_year":{"und":[{"tid":"632"}]},"field_abbr_citation":{"und":[{"safe_value":"A majestic, polyphonic novel about a community\u2019s efforts to halt the proposed displacement and elimination of several Native American tribes in the 1950s, rendered with dexterity and imagination."}]},"field_location_text":[],"field_publication":{"und":[{"safe_value":"The Night Watchman"}]},"field_publisher":{"und":[{"safe_value":"Harper"}]},"field_teaser_thumbnail":[],"path_alias":"winners\/louise-erdrich"},

But I cannot seem to get the tags; I am trying:

# Import BeautifulSoup
from bs4 import BeautifulSoup as bs
content = []
# Read the XML file
with open("file.xml", "r") as file:
    # Read each line in the file
    content = file.readlines()
    # Combine the lines in the list into a string
    content = "".join(content)
    bs_content = bs(content, "lxml")

result = bs_content.find_all("title")
print(result)

But I only get an empty [] Appreciate any help!


Solution

  • It is not XML its a JSON like structure, so simply iterate the list of dicts:

    l = [{"title":"Joshua Cohen","nid":"21706","type":"winner","changed":"1651960857","field_category":{"und":[{"tid":"219"}]},"field_year":{"und":[{"tid":"640"}]},"field_abbr_citation":{"und":[{"safe_value":"A mordant, linguistically deft historical novel about the ambiguities of the Jewish-American experience, presenting ideas and disputes as volatile as its tightly-wound plot."}]},"field_location_text":[],"field_publication":{"und":[{"safe_value":"The Netanyahus: An Account of a Minor and Ultimately Even Negligible Episode in the History of a Very Famous Family"}]},"field_publisher":{"und":[{"safe_value":"New York Review Books"}]},"field_teaser_thumbnail":[],"path_alias":"winners\/joshua-cohen"},{"title":"Louise Erdrich","nid":"21286","type":"winner","changed":"1623362816","field_category":{"und":[{"tid":"219"}]},"field_year":{"und":[{"tid":"632"}]},"field_abbr_citation":{"und":[{"safe_value":"A majestic, polyphonic novel about a community\u2019s efforts to halt the proposed displacement and elimination of several Native American tribes in the 1950s, rendered with dexterity and imagination."}]},"field_location_text":[],"field_publication":{"und":[{"safe_value":"The Night Watchman"}]},"field_publisher":{"und":[{"safe_value":"Harper"}]},"field_teaser_thumbnail":[],"path_alias":"winners\/louise-erdrich"},]
    
    for d in l:
        print(d['title'])
    

    Or while you have a string just convert it before via json.loads():

    import json
    
    l = '[{"title":"Joshua Cohen","nid":"21706","type":"winner","changed":"1651960857","field_category":{"und":[{"tid":"219"}]},"field_year":{"und":[{"tid":"640"}]},"field_abbr_citation":{"und":[{"safe_value":"A mordant, linguistically deft historical novel about the ambiguities of the Jewish-American experience, presenting ideas and disputes as volatile as its tightly-wound plot."}]},"field_location_text":[],"field_publication":{"und":[{"safe_value":"The Netanyahus: An Account of a Minor and Ultimately Even Negligible Episode in the History of a Very Famous Family"}]},"field_publisher":{"und":[{"safe_value":"New York Review Books"}]},"field_teaser_thumbnail":[],"path_alias":"winners\/joshua-cohen"},{"title":"Louise Erdrich","nid":"21286","type":"winner","changed":"1623362816","field_category":{"und":[{"tid":"219"}]},"field_year":{"und":[{"tid":"632"}]},"field_abbr_citation":{"und":[{"safe_value":"A majestic, polyphonic novel about a community\u2019s efforts to halt the proposed displacement and elimination of several Native American tribes in the 1950s, rendered with dexterity and imagination."}]},"field_location_text":[],"field_publication":{"und":[{"safe_value":"The Night Watchman"}]},"field_publisher":{"und":[{"safe_value":"Harper"}]},"field_teaser_thumbnail":[],"path_alias":"winners\/louise-erdrich"}]'
    
    for d in json.loads(l):
        print(d['title'])
    

    Output:

    Joshua Cohen
    Louise Erdrich