Campaign managment | Operational optimization |
---|---|
Improve marketing Strengthen subscriber acquisitions Acquire competitor market Increase cross-sales Reduce churn Target high value subscribers A/B test campaign messages Compute campaign's ROI |
Drive field actions Improve network coverage Performance tracking Stock management Managed services Data commercialization |
Call Detail Record (CDR) | |
Date — Time — Duration | |
Billing — Recharge — Money transfer | |
Social Network | |
Antenna's geolocalisation | |
Network information | |
Data usage — URL — App |
“MapReduce tasks must be written as acyclic dataflow programs, i.e. a stateless mapper followed by a stateless reducer, that are executed by a batch job scheduler.
This paradigm makes repeated querying of datasets difficult and imposes limitations that are felt in fields such as machine learning, where iterative algorithms that revisit a single working set multiple times are the norm.”
— Zaharia, Chowdhury, Franklin, Shenker, Stoica
Transformations | Actions |
---|---|
map filter flatMap join union distinct groupByKey aggregateByKey sortByKey cogroup sample intersection ... |
reduce collect count first take takeSample takeOrdered countByKey foreach saveAsParquetFile saveAsTextFile saveAsObjectFile ... |
val file = spark.textFile("hdfs://...")
val errors = file.filter(line => line.contains("ERROR"))
errors.cache()
// Count errors mentioning MySQL
errors.filter(line => line.contains("MySQL")).count()
// Count errors mentioning Windows should be much faster
errors.filter(line => line.contains("Windows")).count()
val points = spark.textFile(...).map(parsePoint).cache()
var w = Vector.random(D) // current separating plane
for (i <- 1 to ITERATIONS) {
val gradient = points.map(p =>
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
).reduce(_ + _)
w -= gradient
}
println("Final separating plane: " + w)