map eats a list and a function and returns a list such that .
flatten concatenates a list of lists into a single list.
There are many instances where you may want to compose
There is a method which accomplishes this for you called, creatively,
val a = List(List(1, 2), List(3, 4)) val b = a.map((x: List[Int]) => x.map(_ * 2)) val c = b.flatten println(c) // List(2,4,6,8)
We are taking a list of lists, scaling each list by two, and then concatenating the lists into a single list.
Both of these operations can be merged into a single operation using the
val x = List(List(1, 2), List(3, 4)) val y = x.flatMap(x => x.map(_ *2)) println(y) // List(2,4,6,8)
A string is a list of chars. Recall that a
char is at bottom an integer. The
integer to which a given
char is assigned depends on the underlying encoding.
val x: Char = 'a' println(x) // a
The default encoding in Scala is
ISO-8859-1. The integer corresponding to ‘a’
in this encoding is
97. You can reveal this by instantiated
'a' as a
variable of type
int (note how this does not trigger a
type mismatch error).
val x: Int = 'a' println(x) // 97
If you try to concatenate two variables of type
char, you are actually
performing ordinary arithmetic on the underlying integer encoding. In fact,
concatenation will automatically cast the result as a variable of type
val x: Char = 'a' val y = x + 'b' println(y) // 195 println(y.toChar) // Ã
'b' is encoded as
98 = 195 - 97)
With that out of the way, let’s see how
map behaves on a single word.
val s = "Hello" val t = s.map(x => x) println(t) // Hello
Ok. By applying the identity map to each character in the string,
the original string itself. So far so good.
val s = "Hello" val t = s.map(x => x + 0) println(t) // Vector(72, 101, 108, 108, 111)
In this case the addition of
0 casts each
char as an
a vector of the integers corresponding to each
Hello according to
ISO 8859-1 encoding.
You can recover the original string by using the
val s = "Hello" val t = s.map(x => (x + 0).toChar) println(t) // Hello
Ok, so if each
char is cast as an
int, whether by performing arithmetic or
by some other means, unless the
char is explicitly recast as such using the
toChar method, for example,
map will return a vector of the integer
Now let’s see how
map behaves on a string composed of several words.
+ is a regex quantifier that matches one or more of the
preceding character. Recall also that
\\W is shorthand for
is any alphanumeric character, including underscore (
Ok. Suppose now that we want to split a string composed of several words into its constituent words.
val s: String = "Hello, world!" val t: Array[String] = s.split("\\W+") t.foreach(println) // Hello // world
Note that we are splitting at any non-alphanumeric character.
In spark this would be accomplished in the following manner. Suppose we have a
data.txt that contains the single line “Hello, world!“. Then,
val spark = SparkSession.builder .master("local[*]") .appName("Word split") .getOrCreate(); import spark.implicits._ val sc: SparkContext = spark.sparkContext val data = sc.textFile("data.txt") val words = data.map(x => x.split("\\W+")) val df = words.toDF(); df.show() sc.stop()
|[ Hello, world ]|
Note that in this case
data is a Resilient Distributed Dataset (RDD).
Finally, suppose I have an array of strings, and I would like to split each
one into its constitutent words and merge the words into a single list. This
is a combination of the
flatten operations, therefore we use
val s: List[String] = List("Hello, world!", "Good afternoon!") val t: List[String] = s.flatMap(x => x.split("\\W+")) println(t)
In spark, we would accomplish this in a similar manner. Suppose that the data
data.txt contains two lines, “Hello, world!” and “Good afternoon!“.
val spark = SparkSession.builder .master("local[*]") .appName("Word split") .getOrCreate(); import spark.implicits._ val sc: SparkContext = spark.sparkContext val data = sc.textFile("data.txt") val words = data.flatMap(x => x.split("\\W+")) val df = words.toDF(); df.show() sc.stop()