pythonpowershellpippython-tesseractpoppler

How do I Download Poppler and Tesseract Programmatically with PowerShell


In Python, there are two libraries which are often used in tandem, Poppler and Tesseract. They both need external downloads to function: Poppler, Tesseract. The general recommendation for Windows is to download these files separately from the pip install, and then set the path to them. This solution does not work for me, because they take up too much space in my project folder.

Right now, within my project folder, I have two folders, Poppler and Tesseract, which contain all the necessary information. I set them as such:

pytesseract.tesseract_cmd = path_to_tesseract 
#and
convert_from_path(file_path, poppler_path = POPPLER_PATH)

However, this doesn't work in production, because they take up so much space in my folder. What I need, is to somehow download them both somewhere relative to the pip installs, so I don't need to set a path for either.

Right now, I have a PowerShell script which pip installs everything I need. I should be able to download Tesseract and Poppler at the same time as the rest of my pip installs.

$libraries = @(
    "pdf2image", # turnIntoImage()
    "pytesseract"
)

foreach ($lib in $libraries) {
    Write-Host "Installing $lib..."
    pip install $lib
}

# Add code here which downloads Poppler and Tesseract

I've done a lot of research, and this is what I've tried:


Solution

  • This might give you a start on how you can approach it programmatically. It isn't as straight forward, as one of the downloads requires web scraping.

    First, regarding the location relative to pip, my assumption is that you one to drop these downloads where pip installs all Python Modules, in which case first this looks to work to get that location (not sure if there is a better / easier way):

    $piplocation = (pip show pip | Select-String '(?<=^Location: ).+').Matches[0].Value
    

    Then, using that location to drop the downloads; for Tesseract you can use the github API to get the download link for the latest release:

    $req = Invoke-RestMethod https://api.github.com/repos/tesseract-ocr/tesseract/releases/latest
    
    # NOTE: Use `$req.tarball_url` if you want the `.tar.gz` instead of the `.zip`
    $downloadPath = Join-Path $piplocation "tesseract-ocr.$($req.name).zip"
    Invoke-WebRequest $req.zipball_url -OutFile $downloadPath
    

    EDIT: OP has found a much better and more reliable way to obtain the latest Poppler build using the GitHub API, see his answer.

    Then for Poppler, looks like you need web scraping to get the link... This might work for now, but as a disclaimer always, be aware web scraping isn't a robust solution to the problem. You should research if they have an API to get the latest download link.

    $latest = (Invoke-WebRequest https://poppler.freedesktop.org/).Links |
        Where-Object outerHtml -Match '(?<=a href=")poppler.+?\.tar\.xz(?=")' |
        Select-Object -ExpandProperty href
    
    Invoke-WebRequest https://poppler.freedesktop.org/$latest -OutFile $piplocation
    

    And that's it, now in $piplocation you should be able both ready to extract:

    PS ..\pwsh> Get-ChildItem $piplocation -File | Where-Object Name -Match 'tesseract-ocr|poppler'
    
        Directory: C:\Users\...\AppData\Local\Programs\Python\...\site-packages
    
    Mode                 LastWriteTime         Length Name
    ----                 -------------         ------ ----
    -a---          11/18/2025  5:16 PM        1988596 poppler-25.11.0.tar.xz
    -a---          11/18/2025  5:16 PM        2490329 tesseract-ocr.5.5.1.zip