Shell

  • sc.textFile("LICENSE"),sc.parallelize(1 to 100)
  • filter,map,count,flatMap ze splitem, foreach
  • collect,first,take
  • ScalaTest matchera zrobić
  • sample, takeSample
  • reduceByKey(f : (V,V) ⇒ V) : RDD[(K, V)] ⇒ RDD[(K, V)]
  • union() : (RDD[T],RDD[T]) ⇒ RDD[T]
  • join() : (RDD[(K, V)],RDD[(K, W)]) ⇒ RDD[(K, (V, W))]
  • cogroup() : (RDD[(K, V)],RDD[(K, W)]) ⇒ RDD[(K, (Seq[V], Seq[W]))]
  • crossProduct() : (RDD[T],RDD[U]) ⇒ RDD[(T, U)]
  • mapValues(f : V ⇒ W) : RDD[(K, V)] ⇒ RDD[(K, W)] (Preserves partitioning)
  • sort(c : Comparator[K]) : RDD[(K, V)] ⇒ RDD[(K, V)]
  • partitionBy(p : Partitioner[K]) : RDD[(K, V)] ⇒ RDD[(K, V)]
  • rdd.toDebugString

Komendy

val conf=new SparkConf().setMaster("local").setAppName("nauka")

to remove "this" reference: class (val param:String){ def getMatchesNoReference(rdd: RDD[String]): RDD[String] = { // Safe: extract just the field we need into a local variable val param = this.param rdd.map(x => x.highOrdered(param)) } }

 val rdd = sc.parallelize(List(1,2,3,4,5))
    val (v,n)=rdd.aggregate((0,0))(
      {case ((v,n),elem) => (v+elem,n+1)},
      {case ((v1,n1),(v2,n2))=> (v1+v2,n1+n2)}
    )
sc.sequenceFile[RecordID, RecordInfo]("hdfs://...")
.partitionBy(new HashPartitioner(69))
// Create 69 partitions
.persist()

SparkSubmit i submitowanie do clustra

  • tworzenie RDD samemu i wczytywanie z pliku
  • cache i persist

https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf http://people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf

DATA API

results matching ""

    No results matching ""