How can we improve Power BI?

Tables in PDF files

I have come across so many public data repositories that hold data in PDF format. Other websites have tables within documents such as annual reports etc., also in PDF format. A data source for PDFs or tables from PDFs would be awesome!

2,851 votes
Sign in
Check!
(thinking…)
Reset
or sign in with
  • facebook
  • google
    Password icon
    Signed in as (Sign out)

    We’ll send you updates on this idea

    Gogula Aryalingam shared this idea  ·   ·  Flag idea as inappropriate…  ·  Admin →

    261 comments

    Sign in
    Check!
    (thinking…)
    Reset
    or sign in with
    • facebook
    • google
      Password icon
      Signed in as (Sign out)
      Submitting...
      • Anonymous commented  ·   ·  Flag as inappropriate

        I have the same need as everyone else but I'm not sure if this functionality should be built within Power Query. I have an overall needs to get data from web pages (aka web scraping) behind logins and also to download and parse PDF tables. I currently use a third-party web scraping and PDF extract service to do this and it works. I think having a a service by Microsoft with PowerQuery integration and Microsoft Flow integration would be beneficial.

        Just imagine the mountains of data locked up on web pages and PDFs but most tables are not so simple to parse with javascript post-backs, badly coded websites and PDF that have tables but not easy to magically parse all the data. Having a visual tool that helps in writing the powerquery code in debug mode as you step through a website or PDF is needed to make it a strong offering instead of just good enough.

      • Peter Schmidt commented  ·   ·  Flag as inappropriate

        I have a customer that wishes to analyse their phone bill, but the "electronic" version their provider sends them is a 400 page PDF document!! You can use Excel as an intermediate step, but columns get transposed so manual data wrangling is still required. This feature cannot come soon enough!

      • Colin Miles commented  ·   ·  Flag as inappropriate

        Many individual business are now sending PDF receipts via email, ability to parse data would be amazing for granular project visibility.

      • Max Gregson commented  ·   ·  Flag as inappropriate

        This is huge for professional services firms too. Even if the result had to be cleaned up after. The biggest ask we get is to be able to extract the data held in tables within pdfs which feels like it should be an easy/easier solution.

      • Brian Spiller commented  ·   ·  Flag as inappropriate

        Exactly as Ken Puls states...
        I am looking at a bank statement that is 186 (and sometimes statements that much, much bigger) pages long. I can use NitroPro to convert the document to straight txt file and then bring it into Excel, either directly or through PQ.
        But cutting that conversion step to txt is one of the big reasons I thought PQ exist?
        Considering that some PDF's are the only source for certain data dissemination, I am really surprised that it is not yet a valid source for PQ.

      • Ken Puls commented  ·   ·  Flag as inappropriate

        Since this got merged from a different thread, I just want to clarify something as the topic is not quite the same...

        What I'm looking for is the ability to read from a PDF. While extracting tables would be nice, my priority would be to read the PDF as a text file so that I can do my own parsing of any of the data inside. I.e. I don't want this restricted to only pulling in data that looks like a table.

      • Anthony Newell commented  ·   ·  Flag as inappropriate

        Here's my input on this idea:

        1) Ability to extract from a document (PDF or Word) If you received a data source on a regular basis in document format that had a regular embedded table of data you could extract it using PQ

        2) Convert a set of reports in PBI to PDF document to enable you to produce and distribute a hard copy report pack by email or upload to Sharepoint. Sometimes the requirement is to have reports consumed in this way so this is greater flexibility opening up more usage possibilities for PBI

      • General Ledger commented  ·   ·  Flag as inappropriate

        I can't believe having a PDF as a source file is not already included. PDF's are so common and are the most troublesome to work with.

      • Ken Puls commented  ·   ·  Flag as inappropriate

        Please bring this to Excel as well. I get this question EVERY time I teach a course on using Power Query. It's a very big need!

      • Engaged User commented  ·   ·  Flag as inappropriate

        Agree, sometimes you just dont have access to the nice to have CSV file. If the PDF was generated from an Excel file to begin with reverting it back would be awesome.

      • Paul commented  ·   ·  Flag as inappropriate

        It would be great if PBI Desktop could load PDF files - both physical and scanned.

      • Gerry Baerman commented  ·   ·  Flag as inappropriate

        I'll add a third vote for this. As Gogula indicates, PDFs are the rule for a lot of public domain data on the Web, especially from the US Gov. Personally, I hate PDFs and my choice would be to simply make them illegal :) , but if we have to live with them, we're going to need a way to mine the data from that hideous file format.

      Feedback and Knowledge Base

      Ready to get started?

      Try new features of Power BI today by signing up and learn more about our powerful suite of apps.