| Campaign managment | Operational optimization |
|---|---|
|
Improve marketing Strengthen subscriber acquisitions Acquire competitor market Increase cross-sales Reduce churn Target high value subscribers A/B test campaign messages Compute campaign's ROI |
Drive field actions Improve network coverage Performance tracking Stock management Managed services Data commercialization |
| Call Detail Record (CDR) | |
| Date — Time — Duration | |
| Billing — Recharge — Money transfer | |
| Social Network | |
| Antenna's geolocalisation | |
| Network information | |
| Data usage — URL — App |
“MapReduce tasks must be written as acyclic dataflow programs, i.e. a stateless mapper followed by a stateless reducer, that are executed by a batch job scheduler.
This paradigm makes repeated querying of datasets difficult and imposes limitations that are felt in fields such as machine learning, where iterative algorithms that revisit a single working set multiple times are the norm.”
— Zaharia, Chowdhury, Franklin, Shenker, Stoica
| Transformations | Actions |
|---|---|
|
map filter flatMap join union distinct groupByKey aggregateByKey sortByKey cogroup sample intersection ... |
reduce collect count first take takeSample takeOrdered countByKey foreach saveAsParquetFile saveAsTextFile saveAsObjectFile ... |
val file = spark.textFile("hdfs://...")
val errors = file.filter(line => line.contains("ERROR"))
errors.cache()
// Count errors mentioning MySQL
errors.filter(line => line.contains("MySQL")).count()
// Count errors mentioning Windows should be much faster
errors.filter(line => line.contains("Windows")).count()





val points = spark.textFile(...).map(parsePoint).cache()
var w = Vector.random(D) // current separating plane
for (i <- 1 to ITERATIONS) {
val gradient = points.map(p =>
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
).reduce(_ + _)
w -= gradient
}
println("Final separating plane: " + w)