Tables in PDF files
I have come across so many public data repositories that hold data in PDF format. Other websites have tables within documents such as annual reports etc., also in PDF format. A data source for PDFs or tables from PDFs would be awesome!
Thanks everyone for your feedback – We plan to release a PDF connector to import data from tables by the end of 2017 or early 2018. Stay tuned for more updates as we get closer.
I have the same need as everyone else but I'm not sure if this functionality should be built within Power Query. I have an overall needs to get data from web pages (aka web scraping) behind logins and also to download and parse PDF tables. I currently use a third-party web scraping and PDF extract service to do this and it works. I think having a a service by Microsoft with PowerQuery integration and Microsoft Flow integration would be beneficial.
Peter Schmidt commented
I have a customer that wishes to analyse their phone bill, but the "electronic" version their provider sends them is a 400 page PDF document!! You can use Excel as an intermediate step, but columns get transposed so manual data wrangling is still required. This feature cannot come soon enough!
thomas jackson commented
much needed to keep on top of industry publications that are more commonly released as pdf.
Colin Miles commented
Many individual business are now sending PDF receipts via email, ability to parse data would be amazing for granular project visibility.
Any update on this, start
Max Gregson commented
This is huge for professional services firms too. Even if the result had to be cleaned up after. The biggest ask we get is to be able to extract the data held in tables within pdfs which feels like it should be an easy/easier solution.
Brian Spiller commented
Exactly as Ken Puls states...
I am looking at a bank statement that is 186 (and sometimes statements that much, much bigger) pages long. I can use NitroPro to convert the document to straight txt file and then bring it into Excel, either directly or through PQ.
But cutting that conversion step to txt is one of the big reasons I thought PQ exist?
Considering that some PDF's are the only source for certain data dissemination, I am really surprised that it is not yet a valid source for PQ.
Ken Puls commented
Since this got merged from a different thread, I just want to clarify something as the topic is not quite the same...
What I'm looking for is the ability to read from a PDF. While extracting tables would be nice, my priority would be to read the PDF as a text file so that I can do my own parsing of any of the data inside. I.e. I don't want this restricted to only pulling in data that looks like a table.
Anthony Newell commented
Here's my input on this idea:
1) Ability to extract from a document (PDF or Word) If you received a data source on a regular basis in document format that had a regular embedded table of data you could extract it using PQ
2) Convert a set of reports in PBI to PDF document to enable you to produce and distribute a hard copy report pack by email or upload to Sharepoint. Sometimes the requirement is to have reports consumed in this way so this is greater flexibility opening up more usage possibilities for PBI
Mauro Gamberini commented
General Ledger commented
I can't believe having a PDF as a source file is not already included. PDF's are so common and are the most troublesome to work with.
Ken Puls commented
Please bring this to Excel as well. I get this question EVERY time I teach a course on using Power Query. It's a very big need!
Engaged User commented
Agree, sometimes you just dont have access to the nice to have CSV file. If the PDF was generated from an Excel file to begin with reverting it back would be awesome.
It would be great if PBI Desktop could load PDF files - both physical and scanned.
Kamel S. Abou Saleh commented
I also vote for PDF
This would be super for government data sources. Example: http://www.dfw.state.or.us/MRP/salmon/Historical_Data/docs/TrollEffTable.pdf
Gerry Baerman commented
I'll add a third vote for this. As Gogula indicates, PDFs are the rule for a lot of public domain data on the Web, especially from the US Gov. Personally, I hate PDFs and my choice would be to simply make them illegal :) , but if we have to live with them, we're going to need a way to mine the data from that hideous file format.