Quantcast
Channel: Planet Python
Viewing all articles
Browse latest Browse all 22462

Chris Hager: Find broken hyperlinks in a PDF document with PDFx

$
0
0

PDFx is a free command-line tool to extract references, links and metadata from PDF files. You can also use it to find broken links in a PDF file, using pdfx -c:

PDFx Link Checker

For each URL and PDF reference, pdfx performs a HEAD request and checks the status code. It there are broken links, PDFx print the link with the page number where the link was found in the original pdf:

$ pdfx https://weakdh.org/imperfect-forward-secrecy.pdf
Document infos:
- CreationDate = D:20150821110623-04'00'
...

Summary of link checker:
33 working
1 broken (reason: 303)
  - http://www.nytimes.com/interactive/2013/11/23/us/politics/23nsa-sigint-strategy-document.html (page 13)
1 broken (reason: 403)
  - http://www.nytimes.com/interactive/2013/11/23/us/ (page 13)
1 broken (reason: 404)
  - https://github.com/bumptech/stud/blob/ (page 13)

Installing PDFx

You can simply install PDFx with easy_install or pip and run it like this:

$ sudo easy_install -U pdfx
...
$ pdfx <pdf-file-or-url>

Run pdfx -h to see the help output:

$ pdfx -h
usage: pdfx [-h] [-d OUTPUT_DIRECTORY] [-c] [-j] [-v] [-t] [-o OUTPUT_FILE]
            [--version]
            pdf

Extract metadata and references from a PDF, and optionally download all
referenced PDFs. Visit https://www.metachris.com/pdfx for more information.

positional arguments:
  pdf                   Filename or URL of a PDF file

optional arguments:
  -h, --help            show this help message and exit
  -d OUTPUT_DIRECTORY, --download-pdfs OUTPUT_DIRECTORY
                        Download all referenced PDFs into specified directory
  -c, --check-links     Check for broken links
  -j, --json            Output infos as JSON (instead of plain text)
  -v, --verbose         Print all references (instead of only PDFs)
  -t, --text            Only extract text (no metadata or references)
  -o OUTPUT_FILE, --output-file OUTPUT_FILE
                        Output to specified file instead of console
  --version             show program's version number and exit

For more examples and infos, take a look at the PDFx project page. You can find the code on Github, the code is released under the Apache license.


Feedback, ideas and pull requests are welcome! You can also reach me on Twitter via @metachris.


Viewing all articles
Browse latest Browse all 22462

Trending Articles