Home > Forum Home > Developing Business Administration Solutions > Importing Data from PDF Files Share

Importing Data from PDF Files

Excel Help for Importing Data From Pdf Files in Developing Business Administration Solutions


Forum TopicPost Reply Login

Importing Data From Pdf Files

Rate this:
(4.2/5 from 24 votes)
HappyBusiness Spreadsheets has developed a free Excel program to extract and import PDF data into Excel which can be downloaded and used without restriction.

There is a common need to extract and import specific data from PDF files into Excel. Since Excel does not natively support the reading of PDF content, utilities are needed to convert the PDF file content for the Excel format. Several commercial applications accomplish this; however it is often the case where only specific data is required to be imported from multiple PDF files into one structured format.

We created such an application by using VBA code in conjunction with an open source PDF to Text conversion utility, which can be found at Foolabs.

[Download the free PDF data import Excel program here]

Update: 19-Feb-2012
A new version also extracts multiple instances of the same data matching pattern from one or more PDF files.

The program relies on the conversion utility (included in the download) and all PDF files to reside in the same directory as the Excel application. Text or data to extract are defined in the Control sheet by specifying start text, end text and multiple replacements routines with wildcard support. This enables flexibility to obtain comparable data from multiple PDF files based on patterns independent of different PDF file structures.

As many extraction rules as required can be set in order to create a table of information imported by extraction rule and PDF file name. Information on how to set up rules is available within the Excel application with a help icon and cell comments. The VBA code is commented and open for modification.

Any improvements or new features to the code are welcome to be posted here so that we can update the download version to the benefit of everyone.
 Excel Business Forums Administrator
 Posted by on
 
Replies - Displaying 1 to 10 of 88Order Replies By: Most Recent | Chronological | Highest Rated
Confused
Rate this:
(4.5/5 from 4 votes)
Excellent Tool and very Handy, But I have one small issue here , How is it handled when we have source/destination text appears many times.

Ex:
  Date 07/20/2008 Depth :00 m 

   When I try to grab the Date text, B/W Date and Depth it gives me wrong data as I have Depth field some where else also on the same doc.

Any inputs will be appreciated 
 Posted by on
Oops
Rate this:
(4/5 from 4 votes)
At this stage, the program can find the last instance of the text match in the PDF files.  It could be modified to save all matches to memory and then output them by modifying the VBA code. 

To extract exact instances of the text you need to be explicit with the start and end text by providing as much unique text as possible so that the program does not confuse it with other text in the PDF file.  In this case, there may be other text after or before "Depth" that can be specified  in order to extract the correct instance.
 Excel Business Forums Administrator
 Posted by on
Confused
Rate this:
(3/5 from 1 vote)
Hi, this is very neat and quite useful. Nice work.

I was wondering, is there was someway to make this work with interactive Pdf Forms? The idea is to create a fillable form (I use Foxit Phantom) and distribute to users in the ubiquitous pdf format. They fill it and return it and then we extract the data into an excel sheet.

This would be a great addition if possible. Specially because the great thing about your tool is that it skips the need for the user to convert the file to a .xls manually. I cannot seem to find any way to do this, without actually having my end user first convert to xls and then import. When the forms number in the thousands, that's just impracticle.

Any help would be greatly appreciated. 
 "In all things, be men"
 Posted by on
Oops
Rate this:
(2/5 from 2 votes)
To recuperate interactive form data, the best solution is to use a script to load to a data connection.

Foxit Phantom PDF provides some type of ODBC connectivity via JavaScript.  Also, Adobe Acrobat provides a tool to extract XFA data into a spreadsheet.
 Excel Business Forums Administrator
 Posted by on
Confused
Rate this:
(4/5 from 2 votes)
Hi! Congrats on this great tool!

I'd like to know how to modify the code so one could get multiple data from a single PDF.
Example:
----
New Contact
----
[Start Text:Name] data collected  [End Text]
[Start Text:Phone] data collected  [End Text]
----
New Contact
----
[Start Text:Name] data collected  [End Text]
[Start Text:Email] data collected  [End Text]
[Start Text:Address] data collected  [End Text]

Note that each client may have different fields, but the current control page its already taking care of various fields.  It would be necessary to create a check for a separator ("New Contact" in this case)
Is this possible or it would be required to develop a new code?

Thanks from Brasil 
 Alexandre
 Posted by on
Oops
Rate this:
(3/5 from 1 vote)
To recuperate multiple instances of the same rule would require a modification to the code.  The only solution currently is to define unique rules for each instance.
 Excel Business Forums Administrator
 Posted by on
Confused
Rate this:
(3/5 from 1 vote)
and how would it look like, to take more repeatable data from one .pdf file?
which part of the code shold be changed?  maybe "For Each .... in .... loop" will help here, but which variable is  to be repeated ?


i need it, cause i got every day statement in .pdf with 120 pages and untill now i have coppied every ... 

Transaction Amount, 
Payment Details, 
Name/Address ... ect ... (around 400 copies and 400 paste-s :S)


x10 TNX
 Posted by on
Oops
Rate this:
(3/5 from 1 vote)
This is definitely something that would be good to add to the code as an option. At the moment, the last instance is returned as the result.  The code loops through each line of the text content of the PDF looking for the start pattern and end pattern and saving the part in between in memory. The prt of code in the Run_Extraction() routine that gets the text is right before the end of the loop as:

VBA Code:
'test if in text gathering blCont and append
If blCont = True And strTemp <> "" And Not strLine Like "*" & cs & "*" Then
        strTemp = strTemp & Chr(10) & Trim(strLine)
End If

This needs to be modified with another variable to hold what was previously found and appended to the overall result.
 Excel Business Forums Administrator
 Posted by on
Happy
Rate this:
(4/5 from 2 votes)
It seems that a frequently required solution is to extract multiple instances of the same text pattern from one or more PDF files.  The logic behind this is slightly different than the original setup which attempts to line up multiple data for the same pattern from multiple PDF files.

The new version, which can be downloaded from the original link above, adds a second table of results in the Output sheet which lists multiple instances of data matched within and across PDF files processed. This approach retains the benefit of the original consolidation approach while adding the support for multiple data instances within each file.

Another feature added is the ability to retain the text files generated for each PDF file to assists with pattern matching setup and for alternative use of the content. The new version has cell comments with detailed information on the logic.

Please post your feedback here so that we can continue to improve this free and open source solution for importing PDF data into Excel.
 Excel Business Forums Administrator
 Posted by on
Applaud
Rate this:
(4/5 from 2 votes)
Great addition!
Testing the new version i've discovered that another problem is now fixed. :)
Here in Brasil we use a comma as a decimal separator, in the old version that caused the script to end the field when a comma was found inside the text. Now its working fine.
The multiple instances of text is now found and allocated in the "Multiple Instances Data" table.
Is it possible get the results in a table like the Combined Last Instances, with the columns in order? 
 Alexandre
 Posted by on
 Displaying page 1 of 9 

Excel templates and solutions matched for Importing Data from PDF Files:

Solutions: Export MapPoint Waypoints Survey Data Analysis