Are we doing Pypark Dataframe “Union” in the right way?

Mageswaran D
5 min readAug 25, 2020

Union operation a simple operation of joining two or more table rows. What coult go wrong? Hmm…

  • Schema mismatch : Number of columns or the datatype mismatch
  • Rows getting appended in wrong column order

In this post I am gonna show some pitfalls we faced in our project some time back, when we used union two add tables from ingestion drops and how we managed it.

Join vs Union

It would be nice to recap the differences…just the basic ones!

  • In join we add two tables with different schema based on a common key column. The new table can now have columns from both the tables. User can choose what columns should be there in the joined table and how it should appear, with same names or different names.
  • Union on other hand is appending the rows to existing table, with same schema. It is like appending one table to other table with same schema (or number of columns)

Sample Tables:

Note: Date columns are type casted to DateType()

df :

+-------+----------+-------+-----+   
| name| date|product|price| +-------+----------+-------+-----+
| Sri|2019-10-10| Paint| 80|
| Subash|2019-04-02| Ladder| 20|
| Balaji|2019-06-22| Stool| 20|
|Sarvesh|2019-12-09| Vacuum| 40|
|Chandru|2019-07-12| Bucket| 5| +-------+----------+-------+-----+

df1:

+----------+-------+-----+-----+     
| date|product|price| name|
+----------+-------+-----+-----+
|2018-10-10| Paint| 80| John|
|2018-04-02| Ladder| 20| Alex|
|2018-06-22| Stool| 20|Varun|
|2018-12-09| Vacuum| 40|Mages|
|2018-07-12| Bucket| 5| Arun|
+----------+-------+-----+-----+

df2:

+----------+-------+-----+     
| date|product|price|
+----------+-------+-----+
|2018-10-10| Paint| 80|
|2018-04-02| Ladder| 20|
|2018-06-22| Stool| 20|
|2018-12-09| Vacuum| 40|
|2018-07-12| Bucket| 5|
+----------+-------+-----+

Union Cases

  • Case 1 : Tables with same number of columns and columns are in same order
df.union(df).show() +-------+----------+-------+-----+    
| name| date|product|price| +-------+----------+-------+-----+
| Sri|2019-10-10| Paint| 80|
| Subash|2019-04-02| Ladder| 20|
| Balaji|2019-06-22| Stool| 20|
|Sarvesh|2019-12-09| Vacuum| 40|
|Chandru|2019-07-12| Bucket| 5|
| Sri|2019-10-10| Paint| 80|
| Subash|2019-04-02| Ladder| 20|
| Balaji|2019-06-22| Stool| 20|
|Sarvesh|2019-12-09| Vacuum| 40|
|Chandru|2019-07-12| Bucket| 5| +-------+----------+-------+-----+

Here we have used same table to union on it self (assuming the second table to be a different one with different data). Out looks what we are expecting it do. Fine, Good!

  • Case 2 : Tables with same number of column but different order
df.union(df1).show()+----------+----------+-------+-----+      
| name| date|product|price| +----------+----------+-------+-----+
| Sri|2019-10-10| Paint| 80|
| Subash|2019-04-02| Ladder| 20|
| Balaji|2019-06-22| Stool| 20|
| Sarvesh|2019-12-09| Vacuum| 40|
| Chandru|2019-07-12| Bucket| 5|
|2018-10-10| Paint| 80| John|
|2018-04-02| Ladder| 20| Alex|
|2018-06-22| Stool| 20|Varun|
|2018-12-09| Vacuum| 40|Mages|
|2018-07-12| Bucket| 5| Arun| +----------+----------+-------+-----+

Shouldn’t PySpark take care of column ordering and union the tables? What happened to datatype of the table columns?

Well it seems like its not considering the order or datatype of the columns!

  • Case 3 : Tables with different columns or schema
df1.union(df2).show()AnalysisException: Union can only be performed on tables with the same number of columns, but the first table has 4 columns and the second table has 3 columns;;

Huh, expected different number columns!

Whats the correct way of unioning tables with same number of column but different order?

df.select(*sorted(df.columns)).union(df1.select(*sorted(df1.columns))).show(10)+----------+-------+-----+-------+      
| date| name|price|product| +----------+-------+-----+-------+
|2019-10-10| Sri| 80| Paint|
|2019-04-02| Subash| 20| Ladder|
|2019-06-22| Balaji| 20| Stool|
|2019-12-09|Sarvesh| 40| Vacuum|
|2019-07-12|Chandru| 5| Bucket|
|2018-10-10| John| 80| Paint|
|2018-04-02| Alex| 20| Ladder|
|2018-06-22| Varun| 20| Stool|
|2018-12-09| Mages| 40| Vacuum|
|2018-07-12| Arun| 5| Bucket| +----------+-------+-----+-------+

What we came up was to sort the columns from both the tables and select the ordered columns before union.

One drawback is that it changes the order of the original table, which is pretty much easy to capture before hand and select it after the union is done!

unordered_cols = df.columns
df1.select(*sorted(df1.columns)).union(df.select(*sorted(df.columns))).select(unordered_cols).show(10)
+-------+----------+-------+-----+
| name| date|product|price| +-------+----------+-------+-----+
| John|2018-10-10| Paint| 80|
| Alex|2018-04-02| Ladder| 20|
| Varun|2018-06-22| Stool| 20|
| Mages|2018-12-09| Vacuum| 40|
| Arun|2018-07-12| Bucket| 5|
| Sri|2019-10-10| Paint| 80|
| Subash|2019-04-02| Ladder| 20|
| Balaji|2019-06-22| Stool| 20|
|Sarvesh|2019-12-09| Vacuum| 40|
|Chandru|2019-07-12| Bucket| 5| +-------+----------+-------+-----+

Another way of doing it… Sort the columns, check taht they are same and select with first table columns while doing union.

df.select(*df.columns).union(df1.select(*df.columns)).show(10)ordef col_based_union(df_1, df_2):
assert sorted(df_1.columns) == sorted(df_2.columns)
return df_1.select(*df_1.columns).union(df_2.select(*df_1.columns))
col_based_union(df, df1).show()
+-------+----------+-------+-----+
| name| date|product|price| +-------+----------+-------+-----+
| Sri|2019-10-10| Paint| 80|
| Subash|2019-04-02| Ladder| 20|
| Balaji|2019-06-22| Stool| 20|
|Sarvesh|2019-12-09| Vacuum| 40|
|Chandru|2019-07-12| Bucket| 5|
| John|2018-10-10| Paint| 80|
| Alex|2018-04-02| Ladder| 20|
| Varun|2018-06-22| Stool| 20|
| Mages|2018-12-09| Vacuum| 40|
| Arun|2018-07-12| Bucket| 5|
+-------+----------+-------+-----+

Will update this blog, if I come across any better way of doing union on PySpark considering column datatypes.

Check Databricks notebook for live example @ https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/8963851468310921/2010579741105096/5846184720595634/latest.html

--

--