phpregexsubstringquery-stringtext-extraction

Get querystring value of hyperlink in a scraped webpage


I am trying to extract a value from a URL querysyring. Here is a portion of the input text:

u0026amp;sw=0.1\u0026amp;t=vjVQa1PpcFMYuRsz10_H-1z41mWWe8d6ENEnBLE7gug%3D

I need to isolate the substring between t= and %3D to get:

vjVQa1PpcFMYuRsz10_H-1z41mWWe8d6ENEnBLE7gug

So far I am using this [^(t=)]\S{42}, but it is matching all strings, how do I get it to just match that t value?


Solution

  • The page you link to doesn't appear to contain the string you are searching for? But to match that string anywhere in the page then you would need...

    /t=\S{42}/
    

    I don't see any need for character classes [...] or parenthesised sub patterns...?

    EDIT#1

    However, if you are trying to extract that 42 char token then you will need a parenthesised sub pattern...

    /t=(\S{42})/
    

    EDIT#2

    An example of extracting the token. I've changed this from 42 to 43 chars, since all your examples do seem to include a token of 43 chars.

    // This is just some example text from which we want to extract the token...
    $text = <<<EOD
    SomeText=jkasdhHASGjajAHSKAK?asdjladljasdllkasdjllasdasdl
    asdjasiSTARTHERE;t=vjVQa1PpcFMYuRsz10_H-1z41mWWe8d6ENEnBLE7gug%3DENDHEREasdasd
    SomeMoreText;t=ThisIsTooShort%3Dklaksj
    EOD;
    
    if (preg_match('/;t=([a-zA-Z0-9_-]{43})%3D/',$text,$matches)) {
        // Match... vjVQa1PpcFMYuRsz10_H-1z41mWWe8d6ENEnBLE7gug
        echo 'TOKEN: '.$matches[1];
    } else {
        // No match
    }
    

    I've changed the pattern to be more restrictive, rather than any non-space char. It is now any letter, number, underscore or hyphen. It must now end in %3D, and there is a semicolon (";") before the "t=".