Create a DataFrame from existing data in formats such as

3/18/2015
demo1 - Databricks
demo1
Create a DataFrame from existing data
in formats such as JSON, Avro,
parquet, etc
> dbutils.fs.head("/home/michael/spark.json")
[Truncated to first 65536 bytes]
Out[75]: u'{"commitHash":"786808abfd6ca8c8d3a2331d1be49c1466006a46","pa
rentHashes":"2483c1efb6429a7d8a20c96d18ce2fec93a1aff9","authorName":"Zh
ang, Liye","authorEmail":"liye.zhang@intel.com","authorDate":"2014-12-2
7 07:23:13.0","committerName":"Patrick Wendell","committerEmail":"pwend
ell@gmail.com","committerDate":"2014-12-27 07:24:22.0","encoding":"","s
ubject":"[SPARK-4954][Core] add spark version infomation in log for sta
ndalone mode","body":"[SPARK-4954][Core] add spark version infomation i
n log for standalone mode\\n\\nThe master and worker spark version may
be not the same with Driver spark version. That is because spark Jar fi
le might be replaced for new application without restarting the spark c
luster. So there shall log out the spark-version in both Mater and Work
er log.\\n\\nAuthor: Zhang, Liye <liye.zhang@intel.com>\\n\\nCloses #37
90 from liyezhang556520/version4Standalone and squashes the following c
ommits:\\n\\ne05e1e3 [Zhang, Liye] add spark version infomation in log
for standalone mode","changedFiles":["core/src/main/scala/org/apache/sp
ark/deploy/master/Master.scala","core/src/main/scala/org/apache/spark/d
Command took 1.01s -- by admin at 3/18/2015, 1:15:29 PM on Michael Demo
Create a DataFrame using sqlContext.load()
> df = sqlContext.load("/home/michael/spark.json", "json")
Command took 0.78s -- by admin at 3/18/2015, 1:15:40 PM on Michael Demo
> df.printSchema()
root
|-- authorDate: string (nullable = true)
|-- authorEmail: string (nullable = true)
|-- authorName: string (nullable = true)
|-- body: string (nullable = true)
|-- branches: array (nullable = true)
|
|-- element: string (containsNull = true)
https://spark13.dev.databricks.com:34561/#shell/38299
1/4
3/18/2015
demo1 - Databricks
|-- changedFiles: array (nullable = true)
|
|-- element: string (containsNull = true)
|-- commitHash: string (nullable = true)
|-- committerDate: string (nullable = true)
|-- committerEmail: string (nullable = true)
|-- committerName: string (nullable = true)
|-- encoding: string (nullable = true)
|-- parentHashes: string (nullable = true)
|-- subject: string (nullable = true)
Command took 0.02s -- by admin at 3/18/2015, 1:16:25 PM on Michael Demo
> df.filter(df.committerName == "Michael Armbrust").count()
Out[84]: 636L
Command took 0.63s -- by admin at 3/18/2015, 1:17:22 PM on Michael Demo
Interleave DataFrame operations and
your own code using seamless UDF
integration
> from pyspark.sql.functions import *
import re
re.search("@([^@]*)", "michael@databricks.com").group(1)
Out[85]: 'databricks.com'
Command took 0.04s -- by admin at 3/18/2015, 1:17:46 PM on Michael Demo
> get_domain = udf(lambda x: re.search("@([^@]*)", x +
"@").group(1))
Command took 0.07s -- by admin at 3/18/2015, 1:18:09 PM on Michael Demo
> df.select(get_domain(df.committerEmail).alias("domain")).g
roupBy("domain").count().orderBy(desc("count")).take(5)
Out[89]:
[Row(domain=u'gmail.com', count=3557),
Row(domain=u'databricks.com', count=2059),
Row(domain=u'eecs.berkeley.edu', count=1708),
Row(domain=u'apache.org', count=1045),
Row(domain=u'cs.berkeley.edu', count=392)]
Command took 1.21s -- by admin at 3/18/2015, 1:19:06 PM on Michael Demo
https://spark13.dev.databricks.com:34561/#shell/38299
2/4
3/18/2015
demo1 - Databricks
Spark SQL DataFrames can easily be
used with Panda's
> import matplotlib.pyplot as plt
Command took 0.03s -- by admin at 3/18/2015, 1:19:35 PM on Michael Demo
> by_domain =
df.select(get_domain(df.committerEmail).alias("domain")).g
roupBy("domain").count().orderBy(desc("count")).limit(10)
Command took 0.07s -- by admin at 3/18/2015, 1:19:35 PM on Michael Demo
> by_domain.toPandas().plot(kind="bar", x="domain")
Out[92]: <matplotlib.axes.AxesSubplot at 0x7f6928fee450>
Command took 1.22s -- by admin at 3/18/2015, 1:19:37 PM on Michael Demo
> plt.gcf().subplots_adjust(bottom=0.40)
Command took 0.04s -- by admin at 3/18/2015, 1:19:41 PM on Michael Demo
> display()
https://spark13.dev.databricks.com:34561/#shell/38299
3/4
3/18/2015
demo1 - Databricks
Command took 0.32s -- by admin at 3/18/2015, 1:19:41 PM on Michael Demo
> plt.clf()
https://spark13.dev.databricks.com:34561/#shell/38299
4/4