So today I was trying to use the handy function sha1() provided by Spark and I needed to concatenate all my columns in just one, since it did not supported multiple ones. The solution seemed easy at first: use concat(), however, something odd was happening.

It turns out I had more than 600 columns to concatenate, and I wanted to leave some out of the operation. It should have been easy, just doing:

val cols = ds.columns.filterNot(some_condition).map(col)
ds.select(concat(cols:_*).as("concatenated"))

For most of the rows that did the job. Despite having multiple data types, spark casted them successfully to string and returned the concatenation of all the desired columns. Despite that, a big issue arose: some of them had a null value in that field even though they had non null values in them! And of course, I needed that column to have a value since it was going to a hash function.

After playing with some subsets of the columns and getting nowhere near the solution, I started exploring a simple ds.map(row -> row.toString). That would have been enough if it hadn't been for the need to filter some columns and the inefficiency to add a join later.

I was starting to loose my hope when I saw that the columns where the concatenation failed had null values! That was odd, I just wanted for those null values to not appear in the resulting string but it was turning out that they invalidated all the operation. Knowing which was my issue, i
found a post about MySQL: two similar functions had a different behavior when facing NULL values: CONCAT and CONCAT_WS

The first one was the same I was using in spark: concatenates all the columns you want. The latter also existed in spark, but the description was somewhat similar, the only difference was that in concat_ws you needed to add a separator. Apparently there was no difference.

However, in MySQL there was a more important "feature": it skips expressions that contain a NULL value. That was it, what I needed. And indeed, that was the solution for me.

val ds = Seq(
  (1, "sdw", "sre", BigDecimal("232.332323")),
  (2,null,"d",null)
).toDF("id","str1","str2","decimal")

ds.select(concat(cols:_*).as("normal"), concat_ws("",cols:_*).as("ws")).show(false)
normal ws
1sdwsre232.332323000000000000 1sdwsre232.332323000000000000
null 2d

I hope this helps!


NOTE: I've found out later in the following commit that in the code comments this behavior is explained: Reynold Xin added If any input is null, concat returns null. in the Concat case class and in the added ConcatWS the following is stated: Returns null if the separator is null. Otherwise, concat_ws skips all null values.