phpregexpreg-matchlinkedin-apipdf-parsing

Check If Location Value Is Present In Array


I am writing a script to parse LinkedIn-CV. I am stuck at the work experience section. Currently I am able to extract the work experience text from the PDF. But I have an issue with the location key as it is optional.

Array
(
    [0] => Company 1
    [1] => Software Engineer
    [2] => July 2020 - Present  (1 month)   
    [3] => Pretoria, Gauteng, South Africa //this key is optional
    [4] => Company 2
    [5] => CTO
    [6] => September 2016 - Present     (3 years 11 months) 
    [7] => Pretoria, South Africa //this key is optional
)

The format is as follows:

I tried using array_chunk($array, 4); But that only works if the location is present in the array.

My other attempt was to search for the presence of a country in the entire array, but that is tricky as the title of some companies contain countries. like MTN - South Africa.

My last attempt is to try to write a regex to check for the pattern of location. LinkedIn parses it as City, Province, Country for South Africa. But for other countries it parses as City, Country. But i have not been able to get this correctly. I tried preg_match('#\((,*?)\)#', $value, $match) where $value is the value of the string for the current iteration

I would like to have an array for each work experience which could either include location or not. For example:

Array
(
    [0] => Array
        (
            [0] => Company 1
            [1] => Software Engineer
            [2] => July 2020 - Present  (1 month)   
            [3] => Pretoria, Gauteng, South Africa
        )

    [1] => Array
        (
            [0] => Company 2
            [1] => CTO
            [2] => September 2016 - Present     (3 years 11 months) 
            [3] => Pretoria Area, South Africa
        )

)

I appreciate your help.

EDIT:

Main String (work experience)

$string = 'Company 1 Software Engineer July 2020 - Present  (1 month) Pretoria, Gauteng, South Africa Company 2 CTO September 2016 - Present  (3 years 11 months) Pretoria Area, South Africa';

$array = splitNewLine($string);

function splitNewLine($text) {
    $code = preg_replace('/\n$/', '', preg_replace('/^\n/', '', preg_replace('/[\r\n]+/', "\n", $text)));
    return explode("\n", $code);
}

Solution

  • You could grab lines 4 at a time, then check the location with a proper regular expression, and then adjust the position of the next chunk accordingly:

    function computeExperiences(array $lines): array
    {
      $experiences = [];
    
      $position = 0;
      while ($chunkLines = array_slice($lines, $position, 4)) {
        $experience = array_slice($chunkLines, 0, 3);
        $locationIsPresent = isset($chunkLines[3]) && preg_match('/\w+,\s\w+(?:,\s\w+)?/', $chunkLines[3]);
        if ($locationIsPresent) {
          $experience[] = $chunkLines[3];
          $position += 4;
        } else {
          $position += 3;
        }
        $experiences[] = $experience;
      }
    
      return $experiences;
    }
    

    Demo