data extraction by creating a new python module
Project detail
I would like to have a similar package for tables, images and text extraction. Where tables, images and text will be my sub-packages. Say we create a package named BLA, we should be able to extract tables from a particular pdf using a single line of code for example;
import BLA
from BLA import BLA.tables.txt
page_number will be another module which will extract tables from a particular page or all the pages if set to default. (something like this)
BLA.tables.txt.page_number
random_name=BLA.tables.txt. page_number (name of the pdf file)
print(random_name)
This should now give me the tables that are present on that particular page. Similarly, for images and text.
There are many open source packages like Tabula-py for table extraction.PyPDF2 for table, text and also images I think, pdfminer is used only for text extraction and then there is Camelot again for tables. I hope this is a bit clear about what is to be done in the project.