All Articles

Flatmap

Intro

Recall the map and flatten methods.

map eats a list aa and a function ff and returns a list bb such that bi=f(ai)b_i = f(a_i).

flatten concatenates a list of lists into a single list.

There are many instances where you may want to compose map and flatten. There is a method which accomplishes this for you called, creatively, flatMap.

Example

  val a = List(List(1, 2), List(3, 4))
  val b = a.map((x: List[Int]) => x.map(_ * 2))
  val c = b.flatten
  println(c) // List(2,4,6,8)

We are taking a list of lists, scaling each list by two, and then concatenating the lists into a single list.

Both of these operations can be merged into a single operation using the flatMap method.

  val x = List(List(1, 2), List(3, 4))
  val y = x.flatMap(x => x.map(_ *2))
  println(y)  // List(2,4,6,8)

Spark flatMap

A string is a list of chars. Recall that a char is at bottom an integer. The integer to which a given char is assigned depends on the underlying encoding.

val x: Char = 'a'
println(x) // a

The default encoding in Scala is ISO-8859-1. The integer corresponding to ‘a’ in this encoding is 97. You can reveal this by instantiated 'a' as a variable of type int (note how this does not trigger a type mismatch error).

val x: Int = 'a'
println(x) // 97

If you try to concatenate two variables of type char, you are actually performing ordinary arithmetic on the underlying integer encoding. In fact, concatenation will automatically cast the result as a variable of type int.

val x: Char = 'a'
val y = x + 'b'
println(y) // 195
println(y.toChar)   // Ã

('b' is encoded as 98 = 195 - 97)

With that out of the way, let’s see how map behaves on a single word.

val s = "Hello"
val t = s.map(x => x)
println(t) // Hello

Ok. By applying the identity map to each character in the string, map returns the original string itself. So far so good.

val s = "Hello"
val t = s.map(x => x + 0)
println(t) // Vector(72, 101, 108, 108, 111)

In this case the addition of 0 casts each char as an int and map returns a vector of the integers corresponding to each char in Hello according to the ISO 8859-1 encoding.

You can recover the original string by using the toChar method.

val s = "Hello"
val t = s.map(x => (x +  0).toChar)
println(t)  // Hello

Ok, so if each char is cast as an int, whether by performing arithmetic or by some other means, unless the char is explicitly recast as such using the toChar method, for example, map will return a vector of the integer encodings.

Now let’s see how map behaves on a string composed of several words.

Regex review

Recall that + is a regex quantifier that matches one or more of the preceding character. Recall also that \\W is shorthand for ^\\w, where \w is any alphanumeric character, including underscore ([A-Za-z0-9_]).

Ok. Suppose now that we want to split a string composed of several words into its constituent words.

val s: String = "Hello, world!"
val t: Array[String] = s.split("\\W+")
t.foreach(println) 
// Hello
// world 

Note that we are splitting at any non-alphanumeric character.

In spark this would be accomplished in the following manner. Suppose we have a text file data.txt that contains the single line “Hello, world!“. Then,

    val spark = SparkSession.builder
      .master("local[*]")
      .appName("Word split")
      .getOrCreate();

    import spark.implicits._
    val sc: SparkContext = spark.sparkContext
    val data = sc.textFile("data.txt")
    val words = data.map(x => x.split("\\W+"))
    val df = words.toDF();
    df.show()

    sc.stop()
value
[ Hello, world ]

Note that in this case data is a Resilient Distributed Dataset (RDD).

Finally, suppose I have an array of strings, and I would like to split each one into its constitutent words and merge the words into a single list. This is a combination of the map and flatten operations, therefore we use flatMap.

val s: List[String] = List("Hello, world!", "Good afternoon!")
val t: List[String] = s.flatMap(x => x.split("\\W+"))
println(t)

In spark, we would accomplish this in a similar manner. Suppose that the data file data.txt contains two lines, “Hello, world!” and “Good afternoon!“. Then,

    val spark = SparkSession.builder
      .master("local[*]")
      .appName("Word split")
      .getOrCreate();

    import spark.implicits._
    val sc: SparkContext = spark.sparkContext
    val data = sc.textFile("data.txt")
    val words = data.flatMap(x => x.split("\\W+"))
    val df = words.toDF();
    df.show()
    
    sc.stop()
value
Hello
Published 30 Jun 2018