3/18/2015 demo1 - Databricks demo1 Create a DataFrame from existing data in formats such as JSON, Avro, parquet, etc > dbutils.fs.head("/home/michael/spark.json") [Truncated to first 65536 bytes] Out[75]: u'{"commitHash":"786808abfd6ca8c8d3a2331d1be49c1466006a46","pa rentHashes":"2483c1efb6429a7d8a20c96d18ce2fec93a1aff9","authorName":"Zh ang, Liye","authorEmail":"liye.zhang@intel.com","authorDate":"2014-12-2 7 07:23:13.0","committerName":"Patrick Wendell","committerEmail":"pwend ell@gmail.com","committerDate":"2014-12-27 07:24:22.0","encoding":"","s ubject":"[SPARK-4954][Core] add spark version infomation in log for sta ndalone mode","body":"[SPARK-4954][Core] add spark version infomation i n log for standalone mode\\n\\nThe master and worker spark version may be not the same with Driver spark version. That is because spark Jar fi le might be replaced for new application without restarting the spark c luster. So there shall log out the spark-version in both Mater and Work er log.\\n\\nAuthor: Zhang, Liye <liye.zhang@intel.com>\\n\\nCloses #37 90 from liyezhang556520/version4Standalone and squashes the following c ommits:\\n\\ne05e1e3 [Zhang, Liye] add spark version infomation in log for standalone mode","changedFiles":["core/src/main/scala/org/apache/sp ark/deploy/master/Master.scala","core/src/main/scala/org/apache/spark/d Command took 1.01s -- by admin at 3/18/2015, 1:15:29 PM on Michael Demo Create a DataFrame using sqlContext.load() > df = sqlContext.load("/home/michael/spark.json", "json") Command took 0.78s -- by admin at 3/18/2015, 1:15:40 PM on Michael Demo > df.printSchema() root |-- authorDate: string (nullable = true) |-- authorEmail: string (nullable = true) |-- authorName: string (nullable = true) |-- body: string (nullable = true) |-- branches: array (nullable = true) | |-- element: string (containsNull = true) https://spark13.dev.databricks.com:34561/#shell/38299 1/4 3/18/2015 demo1 - Databricks |-- changedFiles: array (nullable = true) | |-- element: string (containsNull = true) |-- commitHash: string (nullable = true) |-- committerDate: string (nullable = true) |-- committerEmail: string (nullable = true) |-- committerName: string (nullable = true) |-- encoding: string (nullable = true) |-- parentHashes: string (nullable = true) |-- subject: string (nullable = true) Command took 0.02s -- by admin at 3/18/2015, 1:16:25 PM on Michael Demo > df.filter(df.committerName == "Michael Armbrust").count() Out[84]: 636L Command took 0.63s -- by admin at 3/18/2015, 1:17:22 PM on Michael Demo Interleave DataFrame operations and your own code using seamless UDF integration > from pyspark.sql.functions import * import re re.search("@([^@]*)", "michael@databricks.com").group(1) Out[85]: 'databricks.com' Command took 0.04s -- by admin at 3/18/2015, 1:17:46 PM on Michael Demo > get_domain = udf(lambda x: re.search("@([^@]*)", x + "@").group(1)) Command took 0.07s -- by admin at 3/18/2015, 1:18:09 PM on Michael Demo > df.select(get_domain(df.committerEmail).alias("domain")).g roupBy("domain").count().orderBy(desc("count")).take(5) Out[89]: [Row(domain=u'gmail.com', count=3557), Row(domain=u'databricks.com', count=2059), Row(domain=u'eecs.berkeley.edu', count=1708), Row(domain=u'apache.org', count=1045), Row(domain=u'cs.berkeley.edu', count=392)] Command took 1.21s -- by admin at 3/18/2015, 1:19:06 PM on Michael Demo https://spark13.dev.databricks.com:34561/#shell/38299 2/4 3/18/2015 demo1 - Databricks Spark SQL DataFrames can easily be used with Panda's > import matplotlib.pyplot as plt Command took 0.03s -- by admin at 3/18/2015, 1:19:35 PM on Michael Demo > by_domain = df.select(get_domain(df.committerEmail).alias("domain")).g roupBy("domain").count().orderBy(desc("count")).limit(10) Command took 0.07s -- by admin at 3/18/2015, 1:19:35 PM on Michael Demo > by_domain.toPandas().plot(kind="bar", x="domain") Out[92]: <matplotlib.axes.AxesSubplot at 0x7f6928fee450> Command took 1.22s -- by admin at 3/18/2015, 1:19:37 PM on Michael Demo > plt.gcf().subplots_adjust(bottom=0.40) Command took 0.04s -- by admin at 3/18/2015, 1:19:41 PM on Michael Demo > display() https://spark13.dev.databricks.com:34561/#shell/38299 3/4 3/18/2015 demo1 - Databricks Command took 0.32s -- by admin at 3/18/2015, 1:19:41 PM on Michael Demo > plt.clf() https://spark13.dev.databricks.com:34561/#shell/38299 4/4
© Copyright 2024