pythondata-structures

Efficient data structure for matching items in two lists of data -python


I have two lists where one list is filled with ID's and the other list is filled with process names. Multiple process names can share an ID.
I want to be able to create a data structure where I can use a specific ID and then return the list of processes associated with that ID. I also want to be able to use a specific process name and return the list of ID's it's connected with.

I know I could create a dictionary for this but the list of ID's will be very large and will continue to grow but the process names will be fairly short (fewer than 20 characters most likely) and will not grow as much. I'm not sure how efficient a dictionary will be in the long run.

I want to be able to store the connections between the two lists locally on my computer so I can reload them into my program to add new connections for new ID's.

| ID's     | Process  |
| -------- | -------- |
| ID 1     | Process 1|
| ID 2     | Process 2|
   ...          ...

The two lists are something along those lines but the ID's list is significantly larger.

I tried imagining this as a matching section on an exam where there are two columns and we have to connect the items in each column to the other. Multiple items can share the same connection. I implemented this with two dictionaries that established relationships between 'column A' and 'column B' (where A is the ID and B is the process name). The structure works well but I am not sure if it will be efficient for the rapidly growing column A and I'm not sure how to efficiently store the data on my PC. I am wondering to see if this structure could be improved on or if there is another one that would be much better.

EDIT: Sorry for a delayed response. I tested out my structure and used dictionaries and sets to make it work. The number of connections to the db will most likely be around 100 times a day and will not increase for the time being. I am using dictionaries to associate the ids with processes and vice versa and storing them separately as a JSON file. I also have another JSON file tracking the ids I "processed" (the ids I matched with a process). This is so I can search within the processed ids to prevent processing them again. Seems to be working well now and looks like it will work for the long term. Thanks y'all for the suggestions! It helped me wrap my head around this problem.


Solution

  • If the worry is that your dictionary will repeat the same process names over and over, wasting space and processing time to manage it, you could use the list indices in the dictionaries instead of the string values themselves. So, instead of adding to the, let's assume ids2procs, dictionary with ids2procs['ID 1'] = ['Process 1', Process 2'], you would use ids2procs[0] = [0, 1], where 'ID 1' is the zeroth entry in the list of IDs, and 'Process 1' and Process 2' are the zeroth and first entries in the list of processes. You might also need reverse dictionaries to look up the list indices from the ID and process names.

    But, python may be smart enough using symbol tables underneath the hood so that this actually doesn't help you. As @trincot said, you can test it out and see.