Dataflows Direct Query
Dataflows would be really useful if you could connect to them via DirectQuery. This would mean that we could query enormous dataflows that are not practical to load to the Data model and can make aggregate dataflows that we load to the Data model.
Unsure if this is possible with the architecture of a Dataflow, but it could be quite handy if it is possible.
I totally support this idea. Right now, I have Dataflow with 55million records. Everytime I need to use this dataflow to create a new report... I am forced to download between 5 to 10 GB of data to my desktop. The initial understanding was the Datasets will establish a live connection to Dataflows. I was disappointed that its not! Here are some screenshots and scenarios I listed:
James Wheeler commented
The reality is that there are two classes of Power BI users.
There are software engineers / data experts that typically own the data sources that feed Power BI. These people have the expertise to write complex queries and do other calculations against the source data. You need to give these people a way to put data from their source into the service efficiently to be used by the other type of user. These people NEVER want to pull the same data from their source more than once for obvious reasons.
The other user is the business analyst. This person typically does not have a background in SQL or programming. They need to be able to use the graphical interface to efficiently transform the raw data that has already been brought into PowerBI by the data source owner.
You can't expect the data source owner to do all the aggregation / filtering / etc against the data before bringing it into Power BI service. It would seem to be the whole reason for Power BI is to enable people who don't have backgrounds in software engineering / SQL to analyze the data.
James Wheeler commented
This functionality is basically essential. All of the data intelligence tools need to work the following way.....
1) Incremental refresh to pull large raw / close to raw data into the service efficiently. Incremental refresh needs to be much more intelligent and robust than it is currently.
2) Transform / query against the raw data already in the service. DO NOT COPY THE OUTPUT SOMEPLACE ELSE!!!!
3) MUST!!!! NOT REQUIRE PREMIUM NODE.
At a high level data ingest should be super efficient. NEVER ingest the same data twice. This means not only being able to detect new data based on a date column but also some mechanism to allow the data source owner to specify how to detect what data has been updated.
Once data is in service DO NOT make a copy of it anyplace. Only make a copy if the owner of the transform specifically indicates that you can run the transform once and "cache" the output because he/she is certain that the results would never change or is ok with stale results. Default behavior should ALWAYS be something like a live connection to the data imported by the data owner.
Ben f commented
this is a must have
allow for mass data extractions, directly from powerbi interface or with any other method that any analyst could use to receive the record in a given query (email the link to a onedrive share where the dataset will be posted, or fetch xml or....). The current limit of 30k is an important limitation
ok, feature announced at the summit event... but limited to the premium side of Power BI.
so better to go with Azure Data explorer on top of the data lake and we'll have near the same feature for far away lower cost!
April Fu commented
Strongly agree to have this feature!
I agree, this feature is a must have. else we are just doubling our data process and data movement.
Another component is to be able to use the analyse in excel feature but against a dataflow instead of a pbi model. Sometimes I just need to pull a single table of a model into a excel file.
Yes, a really important feature.
and make this working in mix model too, please.
Martin McMillan commented
Can't imagine a better improvement! I'm currently having to import large amounts of data, making the report sluggish. I can't direct query because we don't have the data warehouse.
David Cresp commented
This would certainly be very useful for us as well. Dataflows are a great way to put data in the cloud. Being able to DQ them would be very useful.
The other aspect I am finding the more I tinker with Dataflows is increasing the importance of being able to DQ these for a few reasons. 1) Getting the timing schedules for refreshes has just got a lot more complex - To get a Dataset refreshed you need to have the Dataflow refresh complete to get the latest data, then only after that is finished start refreshing the Dataset to get the latest data. The sequence of timings is going to be complex, plus also time debilitating. My first dataflows are taking around 1 hour to process, and then my datasets after that are taking a bit of time. Waiting a few hours to get the latest data is feeling like a step backwards compared to the refresh times I was getting with my existing models. 2) We can use Dataflows as a way to generate a 'Single Model of Truth' with the data preparation. However, since anyone can modify a DF in their Datasets this also breaks the certainty that the data validity is maintained. By allowing a DQ connection you can be assured that the Single Model of Truth can be maintained.