[SOLVED] extracting data from a chart in a web

Extracting some text from images is difficult. As @Wimpel already said, extracting solid data from images or the text in there is very difficult. in addition, how should the code know which kind of chart the figure represents? Sure, there are some digitalization tools for scatter or point based charts like digitize. But in general, it's better to mine the underlying data directly. Still, I built this code for your specific example.

library(tesseract)
library(rvest)
library(dplyr)
library(tidyr)
library(tidyverse)
library(magick)
library(data.table)
# Read the webpage
html_url <- read_html("https://www.statista.com/chart/25619/asylum-grants-in-the-us-by-nationality/")

image_url <- html_url %>% html_elements("img") %>% html_attr("src")

graphics <- image_url[grepl("Infographic", image_url)]
# Download the image
download.file("https://cdn.statcdn.com/Infographic/images/normal/25619.jpeg", destfile = "chart_image.png", mode = "wb")

# Load and preprocess image
img <- image_read("chart_image.png") %>%
  image_resize("800x800") %>%
  image_convert(colorspace = "gray")

# Save processed image and apply OCR
image_write(img, "processed_image.png")
text <- tesseract::ocr("processed_image.png")

text_to_asylum_df <- function(text) {
  # Split text into lines
  lines <- strsplit(text, "\n")[[1]]
  
  # Filter out empty lines and header/footer
  data_lines <- lines[grepl("[0-9]", lines)]
  
  # Extract country and number using regex
  asylum_data <- lapply(data_lines, function(line) {
    # Extract country (word characters at start of line)
    country <- gsub("^([A-Za-z ]+).*$", "\\1", line)
    country <- trimws(country)
    
    # Extract number (digits, possibly with comma or period)
    number <- gsub("[^0-9,.]", "", line)
    number <- gsub(",", "", number)
    number <- gsub("\\.", "", number)
    number <- as.numeric(number)
    
    return(c(country = country, granted = number))
  })
  
  # Convert to dataframe
  df <- as.data.frame(do.call(rbind, asylum_data))
  
  # Convert granted column to numeric
  df$granted <- as.numeric(as.character(df$granted))
  
  # Add year as attribute
  attr(df, "year") <- 2022
  
  return(df)
}

# Create the dataframe
asylum_df <- text_to_asylum_df(text)

# View the result
print(asylum_df)

As you can see, China and Venezuela are not even recognized by tesseract.

Output:

> print(asylum_df)
           country granted
1  asylum in the U    2022
2 El Salvador S TS    2639
3        Guatemala    2329
4            india   22203
5         Honduras    1829
6      Afghanistan    1493
7           turkey    1228

Or for a more solid approach we can use Google's Gemini via API in R, please follow these steps.

Step: Get API Key - You can access the Gemini API by visiting this link : Google AI Studio. Once you have access, you can create an API key by clicking on Create API Key button. Copy and save your API key for future reference.
Step: Install the Required Libraries - Before we can start using Gemini AI Model in R, we need to install the necessary libraries. The two libraries we will be using are httr and jsonlite. The "httr" library allows us to post our question and fetch response with Gemini API, while the "jsonlite" library helps to convert R object to JSON format.

Please note that the Gemini API is currently available for free. In the future, there may be a cost involved in using the Gemini API. Check out the pricing page here.

To install these libraries, you can use the following code in R. There will be a prompt asking for your API-key, please paste it to the console! We will then handover the image to gemini-1.5-flash-latest and ask it to analyze the chart and give us only comma separated data. We will then read the output with read.csv(textConnection(image_content_csv)) and voilà, there is our dataframe!

install.packages("httr")
install.packages("jsonlite")

Then use the following code to analyze your chart image:

# Load necessary libraries
library(httr)
library(base64enc)
library(jsonlite)

# Read the webpage and find your image as before

figure_url <- "https://cdn.statcdn.com/Infographic/images/normal/25619.jpeg"

# Function
gemini_vision <- function(prompt, 
                          image,
                          temperature=0.5,
                          max_output_tokens=4096,
                          api_key=Sys.getenv("GEMINI_API_KEY"),
                          model = "gemini-1.5-flash-latest") {
  
  if(nchar(api_key)<1) {
    api_key <- readline("Paste your API key here: ")
    Sys.setenv(GEMINI_API_KEY = api_key)
  }
  
  model_query <- paste0(model, ":generateContent")
  
  response <- POST(
    url = paste0("https://generativelanguage.googleapis.com/v1beta/models/", model_query),
    query = list(key = api_key),
    content_type_json(),
    encode = "json",
    body = list(
      contents = list(
        parts = list(
          list(
            text = prompt
          ),
          list(
            inlineData = list(
              mimeType = "image/png",
              data = base64encode(image)
            )
          )
        )
      ),
      generationConfig = list(
        temperature = temperature,
        maxOutputTokens = max_output_tokens
      )
    )
  )
  
  if(response$status_code>200) {
    stop(paste("Error - ", content(response)$error$message))
  }
  
  candidates <- content(response)$candidates
  outputs <- unlist(lapply(candidates, function(candidate) candidate$content$parts))
  
  return(outputs)
  
}

image_content_csv <- gemini_vision(prompt = "Can you analyze this chart and print out only a comma seperated table of the data with headers, nothing else. Thanks!", 
              image = figure_url)

df_ai_response <- read.csv(textConnection(image_content_csv))

Which finally gives us:

> df_ai_response
  Nationality Count
1       China  4589
2   Venezuela  3691
3 El Salvador  2639
4   Guatemala  2329
5       India  2203
6    Honduras  1829
7 Afghanistan  1493
8      Turkey  1228