4/28/2018

Spark Cassandra Integration with spark-cassandra-connector

In here, I am going to show how to integrate local single node Cassandra db with standalone spark using spark-cassandra-connector.

Setup Cassandra, Spark, Scala & ScalaBuildTool

1. Download Cassandra & Spark. I am using Cassandra version 3.11.2  and Spark version 2.2.1 .

http://cassandra.apache.org/download/
https://spark.apache.org/releases/spark-release-2-2-1.html
https://www.scala-lang.org/download/2.11.8.html
https://www.scala-sbt.org/download.html

2. Environment setup in .profile

#cassandra setup
export CASSANDRA_HOME=/home/dhanuka/software/apache-cassandra-3.11.2

#spark, sbt and scala setup
export SPARK_HOME=/home/dhanuka/software/spark/spark-2.2.1-bin-hadoop2.7
export SBT_HOME=/home/dhanuka/software/spark/sbt-launcher-packaging-0.13.13
export SCALA_HOME=/home/dhanuka/software/scala-2.11.8

PATH=$PATH:$JAVA_HOME/bin:$MAVEN_HOME/bin:SPARK_HOME/bin:$SBT_HOME/bin:$SCALA_HOME/bin:CASSANDRA_HOME/bin

3. $ source ~/.profile 

Create Cassandra Keyspace and Table

 1. Start cassandra with following command

$ cassandra -f

2. Start CQL shell

$ cqlsh

3. Create keyspace and a table

cqlsh> CREATE KEYSPACE people WITH replication = {'class': 'SimpleStrategy', 'replication_factor':1};

cqlsh> use people;

cqlsh:people> CREATE TABLE users(
          ... id varchar ,
          ... first_name varchar,
          ... last_name varchar,
          ... city varchar,
          ... emails varchar,
          ... PRIMARY KEY (id));

 

cqlsh:people>  Insert into users (id,first_name,last_name,city,emails) values('1','dhanuka','ranasinghe','colombo','dhanuka.priyanath@gmail.com');

 cqlsh:people> select * from users;

 id      | city    | emails                      | first_name | last_name
---------+---------+-----------------------------+------------+------------
 1 | colombo | dhanuka.priyanath@gmail.com |    dhanuka | ranasinghe




Build spark-cassandra-connector.

1. clone from git hub repository.

git clone https://github.com/datastax/spark-cassandra-connector.git 

cd spark-cassandra-connector

2. Build the project with scala 2.11 and cassandra 3.11.2

spark-cassandra-connector$ sbt -Dscala-2.11=true -Dtest.cassandra.version=3.11.2 assembly

You can find the jar location below.

$  spark-cassandra-connector/spark-cassandra-connector/target/full/scala-2.11/spark-cassandra-connector-assembly-2.0.7-82-g0369a7b.jar

mv  spark-cassandra-connector-assembly-2.0.7-82-g0369a7b.jar   spark-cassandra-connector-assembly-2.0.7.jar


 Connect Spark with Cassandra through Spark-Shell

1. Copy cassandra-connector-assembly-2.0.7.jar to spark jars location. Copy to below location

cp cassandra-connector-assembly-2.0.7.jar  $SPARK_HOME/jars

2.  Start spark-shell

$ spark-shell --jars $SPARK_HOME/jars/spark-cassandra-connector-assembly-2.0.7.jar

3. Stop current spark context

scala> sc.stop

4. Program to read Cassandra from spark

scala> import com.datastax.spark.connector._, org.apache.spark.SparkContext, org.apache.spark.SparkContext._, org.apache.spark.SparkConf
import com.datastax.spark.connector._
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf



scala> val conf = new SparkConf(true).set("spark.cassandra.connection.host", "localhost")
conf: org.apache.spark.SparkConf = org.apache.spark.SparkConf@2c2a7d53


scala> val sc = new SparkContext(conf)
sc: org.apache.spark.SparkContext = org.apache.spark.SparkContext@15914bb5


scala> val test_spark_rdd = sc.cassandraTable("people", "users")
test_spark_rdd: com.datastax.spark.connector.rdd.CassandraTableScanRDD[com.datastax.spark.connector.CassandraRow] = CassandraTableScanRDD[0] at RDD at CassandraRDD.scala:19


scala> test_spark_rdd.first
res1: com.datastax.spark.connector.CassandraRow = CassandraRow{id: 1, city: colombo, emails: dhanuka.priyanath@gmail.com, first_name: dhanuka, last_name: ranasinghe}


References:

[1] https://www.datastax.com/dev/blog/kindling-an-introduction-to-spark-with-cassandra-part-1

[2] https://www.youtube.com/watch?v=jpEABn80OCU




1 comment:

  1. Thanks for sharing your innovative ideas to our vision. I have read your blog and I gathered some new information through your blog. Your blog is really very informative and unique. Keep posting like this. Awaiting for your further update.If you are looking for any How to install Cassandra on ubuntu related information, please visit our website Cassandra Cluster ubuntu Setup

    ReplyDelete