Shell
- sc.textFile("LICENSE"),sc.parallelize(1 to 100)
- filter,map,count,flatMap ze splitem, foreach
- collect,first,take
- ScalaTest matchera zrobić
- sample, takeSample
- reduceByKey(f : (V,V) ⇒ V) : RDD[(K, V)] ⇒ RDD[(K, V)]
- union() : (RDD[T],RDD[T]) ⇒ RDD[T]
- join() : (RDD[(K, V)],RDD[(K, W)]) ⇒ RDD[(K, (V, W))]
- cogroup() : (RDD[(K, V)],RDD[(K, W)]) ⇒ RDD[(K, (Seq[V], Seq[W]))]
- crossProduct() : (RDD[T],RDD[U]) ⇒ RDD[(T, U)]
- mapValues(f : V ⇒ W) : RDD[(K, V)] ⇒ RDD[(K, W)] (Preserves partitioning)
- sort(c : Comparator[K]) : RDD[(K, V)] ⇒ RDD[(K, V)]
- partitionBy(p : Partitioner[K]) : RDD[(K, V)] ⇒ RDD[(K, V)]
- rdd.toDebugString
Komendy
val conf=new SparkConf().setMaster("local").setAppName("nauka")
to remove "this" reference: class (val param:String){ def getMatchesNoReference(rdd: RDD[String]): RDD[String] = { // Safe: extract just the field we need into a local variable val param = this.param rdd.map(x => x.highOrdered(param)) } }
val rdd = sc.parallelize(List(1,2,3,4,5))
val (v,n)=rdd.aggregate((0,0))(
{case ((v,n),elem) => (v+elem,n+1)},
{case ((v1,n1),(v2,n2))=> (v1+v2,n1+n2)}
)
sc.sequenceFile[RecordID, RecordInfo]("hdfs://...")
.partitionBy(new HashPartitioner(69))
// Create 69 partitions
.persist()
SparkSubmit i submitowanie do clustra
- tworzenie RDD samemu i wczytywanie z pliku
- cache i persist
https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf http://people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf