phpalgorithmcsvlongest-prefix

Fastest way to find longest match on a 50K line CSV file in PHP


In a voice application I have to find the longest match of a prefix on an international phone number. I have a rates table that is 50K lines that is stored in CSV file that gets updated with new rates periodically (CSV column headers contain prefix, country rate etc). The application uses a REST API to show users the cost of calling a destination based on the phone they enter. Can't use a simple KVS as there are multiple matches and need the longest prefix match. The API gets hit ALOT so using a DB directly is too slow/heavy (using APC here but didn't seem to make that much difference). The best algorithm I can come up with is shown below but it still takes almost 1s to complete on decent machine. Any PHP algorithm gurus have a better method?

    function getRate($phoneNumber) { 

        if (!apc_fetch('ALL_RATES')){

            $all_rates = array_map('str_getcsv', file('/var/billing/rates.csv'));
            apc_store('ALL_RATES', $all_rates);

        } else {

            $all_rates = apc_fetch('ALL_RATES');
        } 

        $noOfCountries = sizeof($all_rates);    
        $bestMatch = 0;


        for ($n=1;$n<$noOfCountries;$n++) {

            $country = $all_rates[$n];
            $country_prefix = $country[0];

            $match = stripos($phoneNumber, $country_prefix);

            if ($match===0){

                if (strlen($country_prefix) > $bestMatch) {

                    $bestMatch = strlen($country_prefix);
                    $matchedCountry = $n;

                }

            }

        }

        $prefix = $all_rates[$matchedCountry][0];
        $country = $all_rates[$matchedCountry][1];
        $rate = $all_rates[$matchedCountry][2];

        return array($country,$prefix,$rate);

    }
}

Solution

  • Well, you might shove off 200-300ms seconds if you write your own stripos, as you only need to do prefix matching, instead of trying to match prefix on any position.

    Though, this is what I recommend:

    1) Ditch the CSV format and start using decent relational database, MySQL is good. Ps the statement "db is too slow/heavy" made no sense. If you set everything up correctly, matching the prefix through database will take 0 seconds(yes, you read it right, few milliseconds). SQL supports full-text scanning with prefixes. Store the length of each phone number, and index that too.

    2) Start caching the requests.

    As for your CSV solution, you can get nice performance boost if you have your phone numbers stored as prefixTree.csv, after that, you can get all the phone numbers fast that start with specific prefix. Ps, don't load csv file every time to memory when you get request. That's super slow! Cache it as static(does PHP have statics?)

    More information: http://phpir.com/tries-and-wildcards/