It is just not gonna work. The main point to remember here, is that Spark DataFrames
* are not data containers. There are descriptions of transformations, that will be applied on the data, once pipeline is executed. It means, that result can be different every time you evaluate the data. The only meaningful question you can ask here is if both DataFrames
describes the same execution plan, which is obviously not useful in your case.
So how to compare the data? There is really no universal answer here.
Testing
If it is a part of an unit test collecting data and comparing local objects is the way to go (although please keep in mind that using sets can miss some subtle but common problems).
Production
Outside unit test you can try to check if
- Size A is equal to the size of B
- A EXCEPT B IS ∅ AND B EXCEPT A IS ∅
This however is very expensive and if feasible might significantly increase the cost of the process. So in practice you might prefer methods which don't provide strict guarantees, but have better performance profile. These will differ depending on the input and output source as well as the failure model (for example file based sources are more reliable than ones using databases or message queues).
In the simplest case you can manually inspect basic invariants, like the number of rows read and written, using Spark web UI. For more advanced monitoring you can implement your own Spark listeners (check for example Spark: how to get the number of written rows?), query listeners, or accumulators, but all this components are not exposed in sparklyr
and will require writing native (Scala or Java) code.
* I refer here to Spark, but using dplyr
with database backend is not that different.