phparraysoptimization

Filter an array if a value is encountered 3 or more times


Hey so i built a script that filters combo lists, it outputs only combos that not repeat itself more than 2 times but its super slow here is my script:

<?php
ini_set('max_execution_time', '-1');
ini_set('memory_limit', '-1');
$fileCombo = file("ww.txt", FILE_IGNORE_NEW_LINES);
$output = fopen("workpless.txt", "a") or die("Unable to open file!");

//all domains of the entire list
$domains = array();
//only domains that repeat themself less than 2 times
$less = array();

//takes the combo list explode it to domain names
foreach ($fileCombo as $combo) {
    $pieces = explode(":", $combo);
    $email = explode("@", $pieces[0]);
    //import domains to array
    $domains[] = strtolower($email[1]);
}
//count each string in the array
$ac = array_count_values($domains);
//this foreach just filter all the domains that not repeat themself over 2 times
foreach ($ac as $email => $item) {
    if($item <= 2) {
        $less[] = $email;
    }
}

/* this foreach is the one that makes all the trubles,
it takes all the domains that the last foreach filtered 
and its runing it 1 by 1 on the entire combo list to get
the actual combo */

foreach ($less as $find) {
    $matches = array_filter($fileCombo, function($var) use ($find) { return preg_match("/\b$find\b/i", $var); });
    foreach ($matches as $match) {
        $data = $match . PHP_EOL;
        fwrite($output, $data);
    }
}

fclose($output);
?>

pseudocode(the best i can do):

file1:
exaple@example.com:password
exaple@example.com:password
exaple@example.com:password
exaple@example1.com:password
exaple@example2.com:password

array "fileCombo" load file1 into the array
splitting each line by ":" so you will get [0]example@example.com, [1]password
splitting value [0] by "@" so you will get [0]example, [1]example.com
putting value [1] into new array called "domains"
counting how many duplicates of each domain
putting all the domains that have less than 2 dupes inside new array that called "less"
runing 1 by 1 each domain in "less" array on "fileCombo" array
if "less" value was found inside "fileCombo" array value Than
write the entire line from "fileCombo" into a text file

This script used for big files with 2~5M lines evrey time thats why i need it to be optimized (its fast when u run like 20k lines on it).


Solution

  • UPDATED: To display all related lines for that domain at the cost of more 5 seconds for 1M lines file

    Tested on 80,000 lines (40,000 unique lines) - 2.5 MB

    Memory Usage
    
    69,994,816 bytes
    70,246,808 bytes (process)
    71,827,456 bytes (process peak)
    
    Execution Time
    0.54409 seconds
    

    Tested on 1,000,000 lines (500,000 unique lines) - 33 MB

    Memory Usage
    
    864,805,152 bytes
    865,057,144 bytes (process)
    866,648,064 bytes (process peak)
    
    Execution Time
    8.9173 seconds
    

    My test Machine is i7-3612QM (CPU Mark 6833) 4GB RAM SSD

    Sample from 80,000 lines file

    exaple@example.com:password
    exaple@example1.com:password
    exaple@example1.com:password
    exaple@example1.com:password
    exaple@example2.com:password
    exaple@example2.com:password
    exaple@example3.com:password
    exaple@example3.com:password
    

    Here is your new version :))

    <?php
    // System Start Time
    define('START_TIME', microtime(true));
    
    // System Start Memory
    define('START_MEMORY_USAGE', memory_get_usage());
    
    function show_current_stats() {
    ?>
        <b>Memory Usage</b>
        <pre>
        <?php print number_format(memory_get_usage() - START_MEMORY_USAGE); ?> bytes
        <?php print number_format(memory_get_usage()); ?> bytes (process)
        <?php print number_format(memory_get_peak_usage(TRUE)); ?> bytes (process peak)
        </pre>
    
        <b>Execution Time</b>
        <pre><?php print round((microtime(true) - START_TIME), 5); ?> seconds</pre>
    <?php
    }
    
    // Script start here
    
    $fileCombo = file("ww.txt", FILE_IGNORE_NEW_LINES);
    $output = fopen("workpless.txt", "a") or die("Unable to open file!");
    
    //all domains of the entire list
    $domains = array();
    //only domains that repeat themself less than 2 times
    $less = array();
    //let make relateion between domains and its position(keys) in fileCombo
    $domains_keys = array();
    
    //takes the combo list explode it to domain names
    foreach ($fileCombo as $key => $combo) {
        $pieces = explode(":", $combo);
        $email = explode("@", $pieces[0]);
        //import domains to array
        $domains[] = strtolower($email[1]);
    
        // check if domain exists or create new domain in $domains_keys array
        if (isset($domains_keys[strtolower($email[1])] )) {
            $domains_keys[strtolower($email[1])][] = $key;
        } else {
            $domains_keys[strtolower($email[1])] = array($key);
        }
    }
    //count each string in the array
    $ac = array_count_values($domains);
    //this foreach just filter all the domains that not repeat themself over 2 times
    foreach ($ac as $email => $item) {
        if($item <= 2) {
            $less[] = $email;
        }
    }
    
    foreach ($less as $find) {
        array_map(function($domain_key) use ($fileCombo, $output) {
            $data = $fileCombo[$domain_key] . PHP_EOL;
            fwrite($output, $data);
        }, $domains_keys[$find]);
    }
    
    fclose($output);
    
    // uncomment to show stats : Credit go to micromvc
    /* show_current_stats(); */
    

    output

    exaple@example.com:password
    exaple@example2.com:password
    exaple@example2.com:password
    exaple@example3.com:password
    exaple@example3.com:password