google-sheetsinstagram

Scraping Instagram data using google spreadsheet?


I need data like bio, and number of posts from public Instagram account using google spreadsheet. I'm able to extract number of followers and following. Can you help ?


Solution

  • This formula is going to look really complicated but really all it is - is an importxml formula to pull in the data from the "script" section which has the pieces you want... then using a bunch of regexreplace/extract functions I clean up the data into a readable format:

    take this public page for example: http://www.instagram.com/salesforce/

    Then in B1 or C1 enter this:

    =iferror(arrayformula(regexreplace({arrayformula(regexextract(transpose(split(regexreplace(regexreplace(concatenate(IMPORTXML(Sheet2!A1,"//script")),"\n",""),"(^.*""ProfilePage"": \[{""user"": {""username"": "")(.*)(nodes.*)","$2"),", """,false)),"(^.*)"": .*")),arrayformula(regexextract(transpose(split(regexreplace(regexreplace(concatenate(IMPORTXML(Sheet2!A1,"//script")),"\n",""),"(^.*""ProfilePage"": \[{""user"": {""username"": "")(.*)(nodes.*)","$2"),", """,false)),"^.*"": (.*)"))},"[""}{]","")))
    

    I ended up using a literal array so that I could effectively split the field names from the values , obviously you can format however you really want , but see the image here that demonstrates the fields it pulls:

    enter image description here

    ALso note that the followers, followed_by, and media: count are the fields your mentioned (e.g. # of posts is called media count) and then the biography of course is self explanatory

    Update: In answer to your comment - if you want to get the other 2 values out you can do it either in a single regexextract function like this:

    If you using the raw import data these regexes work:

    Media count:

    =REGEXEXTRACT(concatenate(IMPORTDATA(E1)),"""media: {""count"": (\d+)page_info: {")
    

    Biography:

    =REGEXEXTRACT(concatenate(IMPORTDATA(E1)),"biography: ""(.*)""full_name")
    

    If your using the importxml method these work:

    =REGEXEXTRACT(A1,"biography"": ""(.*)"", "".*""media"": {""count"": (\d+), ""page_info""")

    That creates 2 capture groups which automatically puts them into their own adjacent cells, or you can do them individually which is:

    and for biography:

    =REGEXEXTRACT(A1,"biography"": ""(.*)"", "".*""media")
    

    media count:

    =REGEXEXTRACT(A1,"media"": {""count"": (\d+), ""page_info""")