Short introduction

This is a part two of Apache Spark gotcha series. If you want to see the first part, check it out here.


Another week, another brain teaser…

scala> case class Foo(number: BigDecimal)
defined class Foo

scala> val x = BigDecimal("11.11")
x: scala.math.BigDecimal = 11.11

scala> x.precision
res9: Int = 4

scala> x.scale
res10: Int = 2

scala> spark.createDataset(Seq(Foo(x))).printSchema()

Now, what’s the scale and precision of a big decimal that we have just read in a freshly created dataset? You may say:

Precision is four and scale is two, because that’s what’s printed out!

But I say:

Wrong answer!

And here’s a gotcha - in fact, a precision is going to be 38 and scale is going to be 18.

scala> spark.createDataset(Seq(Foo(x))).printSchema()
 |-- number: decimal(38,18) (nullable = true)

You cannot enforce precision and scale while working with decimal types, there’s even a jira issue.

The workaround is very simple.

scala> val foos = spark.createDataset(Seq(Foo(x)))
foos: org.apache.spark.sql.Dataset[Foo] = [number: decimal(38,18)]

scala> import org.apache.spark.sql.types.DecimalType
import org.apache.spark.sql.types.DecimalType

scala> val superFoos = foos.toDF().withColumn("number", foos("number").cast(DecimalType(10, 2))).as[Foo]
res2: org.apache.spark.sql.Dataset[Foo] = [number: decimal(10,2)]


This concludes part two of the series. If you want to get notified about the next part, or you just like this blog, then follow me on Twitter or subscribe to the mailing list. See ya next time!