Convert PDF files to TEXT files (OCR optical character recognition)
Project detail
I am looking for a person who has vast experience using OCR software (optical character recognition software) and who can help me choose a solution for my needs.
I have many PDF documents that need to be recognized (OCR) and converted into delimited text/ASCII format so that they can be loaded into a database on a Windows machine, in a Windows Server environment.
The PDF documents are lists of data (I have included an example which is explained below). The PDF files all have essentially the same format, with very small variations.
The final solution will be called from a management system (spawning a DOS box) and run either via an executable or a batch file, for example:
C:> ocr.exe input.pdf output.txt
or
C:> ocr.bat input.pdf output.txt
The input.pdf file is the PDF that needs to be scanned and converted into text, and the results of which are then sent to the output.txt file.
Included in this project is a 3 page test PDF document which contains test data using the format I need interpreted.
The second page is slightly crooked, and this is intentional. And the third page is a copy of the first page, but placed upside down, which also intentional.
The actual data is printed on paper from an old legacy system and then scanned with a photocopier-scanner to create a PDF document.
Since the pages are manipulated by a human, placing them in the photocopier, some pages can be crooked, and some may be upside down.
This is why your program must recognize these anomalies and correct them.
There are usually several hundred pages of names, and therefore many pages in the PDF files.
The data to be extracted from the PDF file is primarily the list of names with the various columns of data next to each name.
In the header section of the top of the page, there is some data that needs to be identified:
Date
Time
ESTABLISHMENT:
LOCKERS :
LOCK :
FILES / OPTI.:
The resulting text file should have the data at the top of the page clearly identified so that the program reading can understand how to get it, so for example you could put lines in the text output file like this:
ESTABLISMENT=A7F
LOCKERS=AV
LOCK=8
FILES/OPTI.=0
Then each of the lines of name data can be put like this, with the pipe symbol (|) as the field delimiter:
NAMEINFO=6AAX|FAFORGE|NATASHA|4JA-01A01X-27|DF|A|1|0|1
Here is what I want you to do:
1) Look at the test PDF file provided and determine if you can do this work;
2) If you can do the work, then choose the correct OCR tool to do the job (example: Tesseract/Tensorflow/Open CV);
3) Create a program or script to execute the task (via .exe or .bat);
4) Provide the program or script to me to test;
5) If it works, you will then explain in a DETAILED document, how you made it work so that I can understand each step;
6) Then you get paid;
I understand that most OCR solutions require you to tweek or configure them to be able to better understand the content of the PDF file. You will need to explain this in your documentation.
PYTHON USERS: If you are using python and installing special packages, all of these packages MUST run on Windows Server. If you are creating an executable from all of them, you must explain all the steps showing how you did it.
Please read and understand the project details carefully. Your bid on this project is your final bid. If you are awarded the project you cannot ask for more money or a tip after the project is awarded. You will be paid what you bid. If you have any questions, please ask them before you bid. Your level of professionalism will determine if I do future work with you, as this is the first phase of a multi-phase project (There are 2 more, more complex, PDF formats I need interpreted, plus more stuff).
Thank you
Open CV
tensorflow
Tesseract
pytorch
caffe
keras