Shell

sc.textFile("LICENSE"),sc.parallelize(1 to 100)
filter,map,count,flatMap ze splitem, foreach
collect,first,take
ScalaTest matchera zrobić
sample, takeSample
reduceByKey(f : (V,V) ⇒ V) : RDD[(K, V)] ⇒ RDD[(K, V)]
union() : (RDD[T],RDD[T]) ⇒ RDD[T]
join() : (RDD[(K, V)],RDD[(K, W)]) ⇒ RDD[(K, (V, W))]
cogroup() : (RDD[(K, V)],RDD[(K, W)]) ⇒ RDD[(K, (Seq[V], Seq[W]))]
crossProduct() : (RDD[T],RDD[U]) ⇒ RDD[(T, U)]
mapValues(f : V ⇒ W) : RDD[(K, V)] ⇒ RDD[(K, W)] (Preserves partitioning)
sort(c : Comparator[K]) : RDD[(K, V)] ⇒ RDD[(K, V)]
partitionBy(p : Partitioner[K]) : RDD[(K, V)] ⇒ RDD[(K, V)]
rdd.toDebugString

Komendy

val conf=new SparkConf().setMaster("local").setAppName("nauka")

to remove "this" reference: class (val param:String){ def getMatchesNoReference(rdd: RDD[String]): RDD[String] = { // Safe: extract just the field we need into a local variable val param = this.param rdd.map(x => x.highOrdered(param)) } }

 val rdd = sc.parallelize(List(1,2,3,4,5))
    val (v,n)=rdd.aggregate((0,0))(
      {case ((v,n),elem) => (v+elem,n+1)},
      {case ((v1,n1),(v2,n2))=> (v1+v2,n1+n2)}
    )

sc.sequenceFile[RecordID, RecordInfo]("hdfs://...")
.partitionBy(new HashPartitioner(69))
// Create 69 partitions
.persist()

SparkSubmit i submitowanie do clustra

tworzenie RDD samemu i wczytywanie z pliku
cache i persist

https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf http://people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf

Spark

Shell

Komendy

DATA API

pliki

results matching ""

No results matching ""