Python Script to Extract Text from PDFs (Online and Offline)

  • Job DurationLess than a week
  • Project LevelBasic Level
  • Project deadlineExpired

Project detail

We need a Python script that can help us extract the contents of a PDF for our article summarization application.

Requirements:
1. The script should work well on both offline and online PDFs (example: https://link.springer.com/content/pdf/10.1007/s42489-020-00057-w.pdf).

2. Most of our documents are research papers. So, it should be layout-aware and be able to extract each individual section (heading + paragraphs) so we can summarize it.

For example, for this PDF: https://link.springer.com/content/pdf/10.1007/s42489-020-00057-w.pdf), the output should be something like:
{
“title”: “Trustworthy COVID‑19 Mapping: Geo‑spatial Data Literacy Aspects of Choropleth Maps”,
“authors”: “Carsten Juergens”,
“pub_date”: “Received: 6 August 2020 / Accepted: 7 October 2020 / Published online: 23 October 2020”,
“sections”: [
{
“title”: “Abstract”,
“text”: “Since the COVID-19 (coronavirus disease 2019) pandemic is a global phenomenon, many scientists and research organizations create thematic maps to visualize and understand the spatial spread of the disease and to inform mankind. Nowadays,
Geographic Information Systems (GIS) and web mapping technologies enable people to create digital maps on demand.
This fosters the permanent update of COVID-19 map products, even by non-cartographers, and their publication in news,
media and scientifc publications. With the ease and speed of map-making, many map creators seem to forget about the
fundamental principles of good and easy-to-read thematic choropleth maps, which requires geo-spatial data literacy. Geospatial data literacy is an important skill, to be able to judge the reliability of spatial data, and to create ingenuous thematic
maps. This contribution intends to make people of disciplines other than those that are map-related aware of the power of
thematic maps and how one can create trustworthy thematic maps instead of misleading thematic maps which could, in a
worst case, lead to misinterpretation.”
},
{
“title”: “1 Introduction”,
“text”: “GIS and web-mapping technologies play an essential…”
},
{
“title”: “2 Geo‑spatial Data”,
“text”: “Geo-spatial data is composed of descriptive/thematic content…”
},
{
“title”: “3 Geo‑spatial Data Literacy”,
“text”: “The cartographer Monmonier became famous for…”
},
{
“title”: “4 Conclusion”,
“text”: “Choropleth maps are used for efective…”
}
]
}

3. It should also be able to extract the metadata like title, author names and pub_date as shown in the example above.

4. The script should be efficient and run within 1 second.

Skills Required

Industry Categories

Freelancer type required for this project