elasticsearchcurlelastic-stackelasticsearch-6

ElasticSearch: How to query by multiple conditions in different locations?


I've been trying to build this ElasticSearch Query on the Danish CVR database API so far without success. Basically I'm trying to find companies where

  1. The company has a relationship with "deltager" (participant) with "enhedsNummer" (ID) equal NUMBER
  2. The relationship is still active, i.e. the "end of period" field is null

How do I construct a query that has multiple conditions like this?

    'query': {
        'bool': {
            'must': [
                {
                    'term': {'Vrvirksomhed.deltagerRelation.deltager.enhedsNummer': NUMBER},
        AND            
                    'term': {'Vrvirksomhed.deltagerRelation.organisationer.attributter.vaerdier.periode.gyldigTil': null}
                },
            ],
        },
    },
}

FYI: database mapping may be found at http://distribution.virk.dk/cvr-permanent/_mapping


Solution

  • It doesn't appear like ElasticSearch Queries are as dynamic as I had wanted (or I don't know how use them). Instead, it appears that the Python code below is the best choice for generating the desired outcome:

    import requests
    import pandas as pd
    
    # creation of empty lists:
    virksomhedsnavne = []
    virksomhedscvr = []
    relation_fra = []
    relation_til = []
    
    # Pulling data (apparently limited to 3000 elements at a time):
    for i in range(20):
    
        if i == 0:
            highestcvrnummer = 0
        else:
            highestcvrnummer = max(virksomhedscvr)
        
        headers = {
            'Content-Type': 'application/json',
        }
    
        json_data = {
    
            "_source": ["Vrvirksomhed.cvrNummer", "Vrvirksomhed.navne", "Vrvirksomhed.virksomhedMetadata.nyesteNavn.navn", "Vrvirksomhed.deltagerRelation"],
    
            "sort" : [{"Vrvirksomhed.cvrNummer" : {"order":"asc"}}],
          
            "query": {
                "bool": {
                    "must": [
                        {
                            "term": {
                                "Vrvirksomhed.deltagerRelation.deltager.enhedsNummer": "some_value"
                            }
                        },
                        {
                            "range":{
                                "Vrvirksomhed.cvrNummer": {
                                    "gt": highestcvrnummer
                                }
                            
                            }
                        }
                    ]
                }
            },
            'size': 3000
        }
    
        response = requests.post('http://distribution.virk.dk/cvr-permanent/virksomhed/_search', headers=headers, json=json_data, auth=('USERNAME', 'PASSWORD'))
        json_data = response.json()['hits']['hits']
    
        # Aggregate and format data neatly
        for data in json_data:
            virksomhed_data = data['_source']['Vrvirksomhed']
            virksomhedscvr.append(virksomhed_data['cvrNummer'])
            try:
                virksomhedsnavne.append(virksomhed_data['virksomhedMetadata']['nyesteNavn']['navn'])
            except:
                virksomhedsnavne.append(virksomhed_data['navne'][0]['navn'])
            
            # Loop through all "deltagere" and find match with value
            for relation in virksomhed_data['deltagerRelation']:
    
                # If match found
                if relation['deltager']['enhedsNummer'] == some_value:
                    
                    # Make sure most recent period is chosen
                    antalopdateringer = len(relation['organisationer'])-1
                    relation_gyldig = relation['organisationer'][antalopdateringer]['medlemsData'][0]['attributter'][0]['vaerdier'][0]['periode']
                    relation_fra.append(relation_gyldig['gyldigFra'])
                    relation_til.append(relation_gyldig['gyldigTil'])
                    break
    
    #export to excel
    dict = {'CVR nummer':virksomhedscvr, 'navn':virksomhedsnavne, 'Relation fra':relation_fra, 'Relation til':relation_til}
    df = pd.DataFrame(dict)
    df.to_excel("output.xlsx")
    

    If anyone else is working with the Danish CVR register's API, I hope this helps!

    Also, if you find a better solution, please let me know :)