pythonweb-scrapingbeautifulsoupfind

Python - Scraping text within p → font


I am trying to scrape the information contained in this page: https://web.archive.org/web/20190718200413/https://public.era.nih.gov/pubroster/jsp/preRosIndex.jsp?CID=102353&AGENDA=365050

Basically I want to create a column with Name, Profession etc. (I know I will have to handle the fact that some individuals have more "lines" than others). So far, I am doing:

sep = soup.find_all("p")[1:]

and then I was thinking about something like this (not very elegant, but probably could do the job):

  for bullet in sep:
        if len(bullet.find_all("br"))==9:
            person = {}
            person["NAME"]=bullet.contents[0].strip()
            person["PROFESSION"]=bullet.contents[2].strip()
            person["DEPARTMENT"]=bullet.contents[6].strip()+" "+bullet.contents[8].strip()
            person["INSTITUTION"]=bullet.contents[12].strip()
            person["LOCATION"]=bullet.contents[14].strip()

(I would have to adjust the numbers, and create as many cases as needed for len(), but the idea is this one). However, when trying to test this code, bullet.contents[0].strip() only returns an empty value (for instance when I test it with sep[1].contents[0].strip(), I get "").

Any idea where this come from and how I could fix it?

Thanks!


Solution

  • contents will take into account any space followed by a <br/> or newline and so on.

    Here is one way of getting that data - you will need to fiddle with locators though, to properly get location, profession etc:

    import requests
    from bs4 import BeautifulSoup as bs
    import pandas as pd
    import time as t
    
    url = 'https://web.archive.org/web/20190718200413/https://public.era.nih.gov/pubroster/jsp/preRosIndex.jsp?CID=102353&AGENDA=365050'
    
    headers = {
        'accept-language': 'en-US,en;q=0.9',
        'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36'
    }
    
    s = requests.Session()
    s.headers.update(headers)
    
    s.get('https://web.archive.org/')
    t.sleep(1)
    r = s.get(url)
    soup = bs(r.text, 'lxml')
    people = soup.select_one('font[size="3"]').find_all_next('p')
    for p in people:
        elem_holding_the_data = p.select_one('font')
        person = {}
        person["NAME"]=elem_holding_the_data.select_one('font').get_text(strip=True, separator = ' ')
        extra_data = [x.strip() for x in elem_holding_the_data.contents if len(x) > 5 and not '*' in x]
        person["PROFESSION"]=extra_data[0]
        person["DEPARTMENT"] = extra_data[1]
        person["INSTITUTION"] = extra_data[2]
        try:
            person["LOCATION"] = extra_data[3]
        except Exception as e:
            person["LOCATION"] = None
        print(person)
    

    Result in terminal:

    {'NAME': 'BOTTINI,\xa0NUNZIO, MD, PHD', 'PROFESSION': 'PROFESSOR OF MEDICINE', 'DEPARTMENT': 'DIVISION OF RHEUMATOLOGY', 'INSTITUTION': 'DEPARTMENT OF MEDICINE', 'LOCATION': 'UNIVERSITY OF CALIFORNIA, SAN DIEGO'}
    {'NAME': 'ATAMAS,\xa0SERGEI\xa0P, MD, PHD', 'PROFESSION': 'EXECUTIVE DIRECTOR, RESEARCH', 'DEPARTMENT': 'CORBUS PHARMACEUTICALS, INC.', 'INSTITUTION': 'NORWOOD,\xa0\n\n\n    MA,\xa0\n\n\n    02062', 'LOCATION': None}
    {'NAME': 'BAIRD,\xa0ANDREW, PHD', 'PROFESSION': 'PROFESSOR/VICE CHAIR', 'DEPARTMENT': 'DEPARTMENT OF SURGERY', 'INSTITUTION': 'SCHOOL OF MEDICINE', 'LOCATION': 'UNIVERSITY OF CALIFORNIA AT SAN DIEGO'}
    {'NAME': 'BRINCKERHOFF,\xa0CONSTANCE\xa0E, PHD', 'PROFESSION': 'PROFESSOR', 'DEPARTMENT': 'DEPARTMENT OF MEDICINE AND BIOCHEMISTRY', 'INSTITUTION': 'NORRIS COTTON CANCER CENTER', 'LOCATION': 'GEISEL SCHOOL OF MEDICINE AT DARTMOUTH'}
    {'NAME': 'CAMPBELL,\xa0DANIEL\xa0J, PHD', 'PROFESSION': 'MEMBER', 'DEPARTMENT': 'BENAROYA RESEARCH INSTITUTE AT VIRGINIA MASON', 'INSTITUTION': 'SEATTLE,\xa0\n\n\n    WA,\xa0\n\n\n    98101', 'LOCATION': None}
    {'NAME': 'CHUONG,\xa0CHENG-MING, MD, PHD', 'PROFESSION': 'PROFESSOR', 'DEPARTMENT': 'DEPARTMENT OF PATHOLOGY', 'INSTITUTION': 'KECK SCHOOL OF MEDICINE', 'LOCATION': 'UNIVERSITY OF SOUTHERN CALIFORNIA'}
    {'NAME': 'CLARK,\xa0RACHAEL\xa0ANN, MD, PHD', 'PROFESSION': 'ASSOCIATE PROFESSOR', 'DEPARTMENT': 'BRIGHAM AND WOMENS HOSPITAL AND', 'INSTITUTION': 'HARVARD MEDICAL SCHOOL', 'LOCATION': 'BOSTON,\xa0\n\n\n    MA,\xa0\n\n\n    02115'}
    {'NAME': 'COHEN,\xa0PHILIP\xa0L, MD', 'PROFESSION': 'PROFESSOR EMERITUS', 'DEPARTMENT': 'DEPARTMENT OF MICROBIOLOGY AND IMMUNOLOGY', 'INSTITUTION': 'LEWIS KATZ SCHOOL OF MEDICINE', 'LOCATION': 'TEMPLE UNIVERSITY'}
    {'NAME': 'CRAFT,\xa0JOSEPH\xa0EDGAR, MD', 'PROFESSION': 'PROFESSOR', 'DEPARTMENT': 'DEPARTMENTS OF MEDICINE AND IMMUNOBIOLOGY', 'INSTITUTION': 'SCHOOL OF MEDICINE', 'LOCATION': 'YALE UNIVERSITY'}
    {'NAME': 'CUI,\xa0RUTAO, MD', 'PROFESSION': 'PROFESSOR', 'DEPARTMENT': 'VICE CHAIR OF LABORATORY ADMINISTRATION', 'INSTITUTION': 'DIRECTOR, LABORATORY OF MELANOMA BIOLOGY', 'LOCATION': 'DEPT OF PHARMACOLOGY AND EXPERIMENTAL THERAPEUTICS'}
    {'NAME': "D'ORAZIO,\xa0JOHN\xa0A, MD, PHD", 'PROFESSION': 'PROFESSOR', 'DEPARTMENT': 'DIVISION OF HEMATOLOGY AND ONCOLOGY', 'INSTITUTION': 'DEPARTMENT OF PEDIATRICS', 'LOCATION': 'COLLEGE OF MEDICINE'}
    {'NAME': 'DEMIRCI,\xa0F YESIM, MD', 'PROFESSION': 'ASSOCIATE PROFESSOR', 'DEPARTMENT': 'DEPARTMENT OF HUMAN GENETICS', 'INSTITUTION': 'UNIVERSITY OF PITTSBURGH', 'LOCATION': 'PITTSBURGH,\xa0\n\n\n    PA,\xa0\n\n\n    15260'}
    {'NAME': 'ECHEVERRI,\xa0KAREN, PHD', 'PROFESSION': 'ASSISTANT PROFESSOR', 'DEPARTMENT': 'EUGENE BELL CENTER FOR REGENERATIVE BIOLOGY', 'INSTITUTION': 'AND TISSUE ENGINEERING', 'LOCATION': 'MARINE BIOLOGICAL LABORATORY'}
    {'NAME': 'EISENBERG,\xa0ROBERT\xa0A, MD', 'PROFESSION': 'EMERITUS PROFESSOR OF MEDICINE', 'DEPARTMENT': 'DIVISION OF RHEUMATOLOGY', 'INSTITUTION': 'UNIVERSITY OF PENNSYLVANIA', 'LOCATION': 'PHILADELPHIA,\xa0\n\n\n    PA,\xa0\n\n\n    19104'}
    {'NAME': 'EZHKOVA,\xa0ELENA, PHD', 'PROFESSION': 'ASSOCIATE PROFESSOR', 'DEPARTMENT': 'DEPARTMENT OF CELL, DEVELOPMENTAL,', 'INSTITUTION': 'AND REGENERATIVE BIOLOGY', 'LOCATION': 'ICAHN SCHOOL OF MEDICINE AT'}
    {'NAME': 'GALLAGHER,\xa0KATHERINE\xa0ANN, MD', 'PROFESSION': 'ASSOCIATE PROFESSOR', 'DEPARTMENT': 'DEPARTMENTS OF SURGERY AND MICROBIOLOGY', 'INSTITUTION': 'AND IMMUNOLOGY', 'LOCATION': 'UNIVERSITY OF MICHIGAN'}
    {'NAME': 'GOLEVA,\xa0ELENA, PHD', 'PROFESSION': 'ASSOCIATE PROFESSOR', 'DEPARTMENT': 'DEPARTMENT OF PEDIATRICS', 'INSTITUTION': 'NATIONAL JEWISH HEALTH', 'LOCATION': 'DENVER,\xa0\n\n\n    CO,\xa0\n\n\n    80220'}
    {'NAME': 'HE,\xa0YU-YING, PHD', 'PROFESSION': 'ASSOCIATE PROFESSOR', 'DEPARTMENT': 'DEPARTMENT OF MEDICINE', 'INSTITUTION': 'SECTION OF DERMATOLOGY', 'LOCATION': 'CANCER RESEARCH CENTER'}
    {'NAME': 'HORSLEY,\xa0VALERIE, PHD', 'PROFESSION': 'ASSOCIATE PROFESSOR', 'DEPARTMENT': 'DEPARTMENT OF MOLECULAR, CELLULAR', 'INSTITUTION': 'AND DEVELOPMENTAL BIOLOGY', 'LOCATION': 'YALE UNIVERSITY'}
    {'NAME': 'JAMESON,\xa0JULIE\xa0M, PHD', 'PROFESSION': 'ASSOCIATE PROFESSOR', 'DEPARTMENT': 'DEPARTMENT OF BIOLOGY', 'INSTITUTION': 'CALIFORNIA STATE UNIVERSITY SAN MARCOS', 'LOCATION': 'SAN MARCOS,\xa0\n\n\n    CA,\xa0\n\n\n    92096'}
    {'NAME': 'JONES,\xa0LAMONT, MD, MBA', 'PROFESSION': 'VICE CHAIR AND OTOLARYNGOLOGY SERVICE CHEF', 'DEPARTMENT': 'DEPARTMENT OF OTOLARYNGOLOGY HNS', 'INSTITUTION': 'HENRY FORD HOSPITAL', 'LOCATION': 'DETROIT,\xa0\n\n\n    MI,\xa0\n\n\n    48202'}
    {'NAME': 'KESWANI,\xa0SUNDEEP\xa0G, MD', 'PROFESSION': 'ASSOCIATE PROFESSOR', 'DEPARTMENT': 'DIVISION OF PEDIATRIC, THORACIC AND FETAL SURGERY', 'INSTITUTION': 'TEXAS CHILDREN?S HOSPITAL', 'LOCATION': 'BAYLOR COLLEGE OF MEDICINE'}
    {'NAME': 'LECHLER,\xa0TERRY\xa0H, PHD', 'PROFESSION': 'ASSOCIATE PROFESSOR', 'DEPARTMENT': 'DEPARTMENT OF DERMATOLOGY AND CELL BIOLOGY', 'INSTITUTION': 'DUKE UNIVERSITY MEDICAL CENTER', 'LOCATION': 'DURHAM,\xa0\n\n\n    NC,\xa0\n\n\n    27710'}
    {'NAME': 'LIAO,\xa0WILSON, MD', 'PROFESSION': 'ASSOCIATE PROFESSOR', 'DEPARTMENT': 'DEPARTMENT OF DERMATOLOGY', 'INSTITUTION': 'UNIVERSITY OF CALIFORNIA, SAN FRANCISCO', 'LOCATION': 'SAN FRANCISCO,\xa0\n\n\n    CA,\xa0\n\n\n    94143'}
    {'NAME': 'LIU,\xa0PENG, MD, PHD', 'PROFESSION': 'ASSOCIATE PROFESSOR', 'DEPARTMENT': 'DEPARTMENT OF MEDICINE', 'INSTITUTION': 'THURSTON ARTHRITIS RESEARCH CENTER', 'LOCATION': 'UNIVERISTY OF NORTH CAROLINA AT CHAPEL HILL'}
    {'NAME': 'MARSHAK-ROTHSTEIN,\xa0ANN, PHD', 'PROFESSION': 'PROFESSOR', 'DEPARTMENT': 'DEPARTMENT OF MEDICINE / RHEUMATOLOGY', 'INSTITUTION': 'UNIVERSITY OF MASSACHUSETTS MEDICAL SCHOOL', 'LOCATION': 'WORCESTER,\xa0\n\n\n    MA,\xa0\n\n\n    01605'}
    {'NAME': 'MCCORMICK,\xa0THOMAS\xa0S, PHD', 'PROFESSION': 'ASSOCIATE PROFESSOR', 'DEPARTMENT': 'DEPARTMENT OF DERMATOLOGY', 'INSTITUTION': 'CASE WESTERN RESERVE UNIVERSITY', 'LOCATION': 'CLEVELAND,\xa0\n\n\n    OH,\xa0\n\n\n    44106'}
    {'NAME': 'MORGAN,\xa0BRUCE\xa0A, PHD', 'PROFESSION': 'ASSOCIATE PROFESSOR', 'DEPARTMENT': 'DEPARTMENT OF DERMATOLOGY', 'INSTITUTION': 'CUTANEOUS BIOLOGY RESEARCH CENTER', 'LOCATION': 'MASSACHUSETTS GENERAL HOSPITAL'}
    {'NAME': 'NARENDRAN,\xa0RAJESH, MBBS, MD', 'PROFESSION': 'ASSOCIATE  PROFESSOR', 'DEPARTMENT': 'DEPARTMENT OF RADIOLOGY', 'INSTITUTION': 'SCHOOL OF MEDICINE', 'LOCATION': 'UNIVERSITY OF PITTSBURGH'}
    {'NAME': 'NARMONEVA,\xa0DARIA\xa0A, PHD', 'PROFESSION': 'ASSOCIATE PROFESSOR', 'DEPARTMENT': 'DEPARTMENT OF BIOMEDICAL ENGINEERING', 'INSTITUTION': 'COLLEGE OF ENGINEERING & APPLIED SCIENCE', 'LOCATION': 'UNIVERSITY OF CINCINNATI'}
    {'NAME': 'NATH,\xa0SWAPAN\xa0K, PHD', 'PROFESSION': 'ADJUNCT PROFESSOR', 'DEPARTMENT': 'DEPARTMENT OF ARTHRITIS/IMMUNOLOGY', 'INSTITUTION': 'OKLAHOMA MEDICAL RESEARCH FOUNDATION', 'LOCATION': 'OKLAHOMA CITY,\xa0\n\n\n    OK,\xa0\n\n\n    73104'}
    {'NAME': 'NIEWOLD,\xa0TIMOTHY\xa0B, MD', 'PROFESSION': 'JUDITH AND STEWART COLTON PROFESSOR OF MEDICINE AND PATHOLOGY', 'DEPARTMENT': 'DIRECTOR, COLTON CENTER FOR AUTOIMMUNITY', 'INSTITUTION': 'DEPARTMENT OF MEDICINE', 'LOCATION': 'NEW YORK UNIVERSITY'}
    {'NAME': 'OH,\xa0JULIA\xa0S, PHD', 'PROFESSION': 'ASSISTANT  PROFESSOR', 'DEPARTMENT': 'THE JACKSON LABORATORY FOR GENOMIC MEDICINE', 'INSTITUTION': 'FARMINGTON,\xa0\n\n\n    CT,\xa0\n\n\n    06032', 'LOCATION': None}
    {'NAME': 'ORMSETH,\xa0MICHELLE\xa0JANE, MD', 'PROFESSION': 'ASSISTANT PROFESSOR', 'DEPARTMENT': 'DIVISION OF RHEUMATOLOGY AND IMMUNOLOGY', 'INSTITUTION': 'DEPARTMENT OF MEDICINE', 'LOCATION': 'VANDERBILT UNIVERSITY MEDICAL CENTER'}
    {'NAME': 'PERL,\xa0ANDRAS, MD, PHD', 'PROFESSION': 'PROFESSOR', 'DEPARTMENT': 'DEPARTMENT OF MEDICINE', 'INSTITUTION': 'STATE UNIVERSITY OF NEW YORK', 'LOCATION': 'SYRACUSE,\xa0\n\n\n    NY,\xa0\n\n\n    13210'}
    {'NAME': 'POPE,\xa0RICHARD\xa0M, MD', 'PROFESSION': 'SOLOVY/ARTHRITIS RESEARCH SOCIETY PROFESSOR', 'DEPARTMENT': 'DIVISION OF RHEUMATOLOGY', 'INSTITUTION': 'DEPARTMENT OF MEDICINE', 'LOCATION': 'FEINBERG SCHOOL OF MEDICINE'}
    {'NAME': 'QUINN,\xa0KYLE\xa0PATRICK, PHD', 'PROFESSION': 'ASSISTANT PROFESSOR', 'DEPARTMENT': 'COLLEGE OF ENGINEERING', 'INSTITUTION': 'DEPARTMENT OF BIOMEDICAL ENGINEERING', 'LOCATION': 'UNIVERSITY OF ARKANSAS'}
    {'NAME': 'SIMPSON,\xa0DAVID\xa0G, PHD', 'PROFESSION': 'ASSOCIATE PROFESSOR', 'DEPARTMENT': 'ANATOMY AND NEUROBIOLOGY DEPARTMENT', 'INSTITUTION': 'VIRGINIA COMMONWEALTH UNIVERSITY', 'LOCATION': 'RICHMOND,\xa0\n\n\n    VA,\xa0\n\n\n    23298'}
    {'NAME': 'STRONG,\xa0CRISTINA\xa0DE GUZMAN, PHD', 'PROFESSION': 'ASSISTANT PROFESSOR', 'DEPARTMENT': 'DIVISION OF DERMATOLOGY', 'INSTITUTION': 'DEPARTMENT OF INTERNAL MEDICINE', 'LOCATION': 'CENTER OF THE STUDY OF ITCH'}
    {'NAME': 'TOMIC-CANIC,\xa0MARJANA, PHD', 'PROFESSION': 'VICE CHAIR OF RESEARCH', 'DEPARTMENT': 'DEPARTMENT OF DERMATOLOGY AND CUTANEOUS SURGERY', 'INSTITUTION': 'DIRECTOR, WOUND HEALING AND REGENERATIVE', 'LOCATION': 'MEDICINE RESEARCH PROGRAM'}
    {'NAME': 'TUMBAR,\xa0TUDORITA, PHD', 'PROFESSION': 'PROFESSOR', 'DEPARTMENT': 'DEPARTMENT OF MOLECULAR BIOLOGY', 'INSTITUTION': 'AND GENETICS', 'LOCATION': 'CORNELL UNIVERSITY'}
    {'NAME': 'WILGUS,\xa0TRACI\xa0A, PHD', 'PROFESSION': 'ASSOCIATE PROFESSOR', 'DEPARTMENT': 'DEPARTMENT OF PATHOLOGY', 'INSTITUTION': 'THE OHIO STATE UNIVERSITY', 'LOCATION': 'COLUMBUS,\xa0\n\n\n    OH,\xa0\n\n\n    43210'}
    {'NAME': 'GERSCH,\xa0ROBERT, PHD', 'PROFESSION': 'SCIENTIFIC REVIEW OFFICER', 'DEPARTMENT': 'CENTER FOR SCIENTIFIC REVIEW', 'INSTITUTION': 'NATIONAL INSTITUTES OF HEALTH', 'LOCATION': 'BETHESDA,\xa0\n\n\n    MD,\xa0\n\n\n    20817'}
    {'NAME': 'CARTER,\xa0LATONYA\xa0A', 'PROFESSION': 'EXTRAMURAL SUPPORT ASSISTANT', 'DEPARTMENT': 'CENTER FOR SCIENTIFIC REVIEW', 'INSTITUTION': 'NATIONAL INSTITUTES OF HEALTH', 'LOCATION': 'BETHESDA,\xa0\n\n\n    MD,\xa0\n\n\n    20892'}
    

    Lastly, BeautifulSoup documentation.