PDF Citation Highlighter

From semantic-mediawiki.org
MediaWiki Users and Developers Conference Fall 2024
PDF Citation Highlighter
Talk details
Description: A lightweight Python application that identifies and highlights continuous text patterns (such as citation text) in PDF documents, whether uploaded to MediaWiki or from external sources.
Speaker(s): Anton Krom
Type: Lightning talk
Audience: Everyone
Event start: 2024/11/04 14:50:00
Event finish: 2024/11/04 15:00:00
Length: 10 minutes
Video: not available
Keywords: PDF, cargo, citations
Give feedback

A lightweight Python application that identifies and highlights continuous text patterns (such as citation text) in PDF documents, whether uploaded to MediaWiki or from external sources. The system uses MediaWiki templates and Cargo storage to catalog references and safeguard against potential system exploits.

Academic research papers are often distributed in PDF format, which preserves the original page layout, including complex formatting and the author's exact text. One can verify the accuracy of a provided citation by comparing the citation text with the original document. In some cases, viewing the citation in the broader context of the entire work is essential to understanding its significance fully.

Surprisingly, searching for a continuous text pattern (such as a paragraph or arbitrary sequence of sentences) doesn't work by default. I found a Python library that works with PDF rectangles and developed a Python web app to retrieve a PDF file via URL, locate the specified text pattern, and highlight it.

For example, the app can be accessed with a GET request like:

GET /?url=https://example.com/pdf.pdf&page=6&search=<any continuous pattern>#page=6

The app was built using the Flask framework. Additionally, there are MediaWiki templates designed to:

  • Set the URL, citation text, and page number, and store this data in a Cargo table.
  • Display the citation text and generate a link highlighting the text in the PDF (with the highlight color customizable via template parameter).

The app will display a user-friendly error message if the provided URL is not found in the Cargo database.

Limitations: The app should have direct access to the PDF file.