The purpose of this blog post is to setup Hadoop cluster which is secured by Kerberos authentication.
Also please note that, this setup only for learning purpose and not suitable for production setup.
Prerequisites:
- Install docker and docker-compose
- Install Java 8
- My /etc/hosts file added entries. Need to change according to your machine IP address.
192.168.0.114 hadoop.docker.com
192.168.0.114 hive-metastore hadoop.docker.com kdc.kerberos.com kdc EXAMPLE.COM
Steps:
1. Clone below 3 github projects
git clone https://github.com/dhanuka84/docker-hadoop.git
git clone https://github.com/dhanuka84/docker-hive.git
git clone https://github.com/dhanuka84/docker-hadoop-secure.git
2. Build docker-hadoop docker image.
cd docker-hadoop/base
docker build -t dhanuka/hadoop:2.7.7 .
3. Build docker-hive docker image
cd docker-hive
docker build -t dhanuka/hive:2.3.2 .
4. Replace my local machine ip with your local machine ip.
cd docker-hadoop-secure
find ./ -type f -exec sed -i 's/192.168.0.114/your_local_ip/g' {} \;
5. Start Kerberos Key Distributed Service
cd docker-hadoop-secure
docker-compose -f docker-kdc.yml up -d
6. Create Kerberos principals and keytabs
Login to kdc docker container
Execute below commands one by one to create principals
kadmin.local: addprinc -randkey hive/hive-metastore@EXAMPLE.COM
Execute below commands one by one to create keytabs from principals
kadmin.local: ktadd -k /opt/nn.service.keytab nn/hadoop.docker.com
kadmin.local: ktadd -k /opt/dn.service.keytab dn/hadoop.docker.com
kadmin.local: ktadd -k /opt/spnego.service.keytab spnego/hadoop.docker.com
kadmin.local: ktadd -k /opt/jhs.service.keytab jhs/hadoop.docker.com
kadmin.local: ktadd -k /opt/yarn.service.keytab yarn/hadoop.docker.com
kadmin.local: ktadd -k /opt/nm.service.keytab nm/hadoop.docker.com
kadmin.local: ktadd -k /opt/rm.service.keytab rm/hadoop.docker.com
kadmin.local: ktadd -k /opt/hive.keytab hive/hive-metastore
kadmin.local: q
[root@kdc /]# exit
Keytabs can be found from kdc-opt directory. Now we need to copy them to keytabs directory
docker-compose -f docker-hdfs.yml up -d
Full Log file
Please note hadoop configuration files in below location (specially core-site.xml & hdfs-site.xml )
8. Start Hive Meta Store Service
docker-compose -f docker-hive.yml up -d
Please note , this will bootup, hive metastore, hive server and postgres db.
We actually don't need hive server, so we can stop that docker container.
Login to Hive Metastore and use CLI tool to create external table
As you can see above, it throws and error
Caused by: java.net.ConnectException: Call From hive-metastore/192.168.0.114 to hadoop.docker.com:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
Run below command:
kinit hive/hive-metastore@EXAMPLE.COM -k -t /etc/security/keytabs/hive.keytab
root@hive-metastore:/opt# cd hive/bin
root@hive-metastore:/opt/hive/bin# ./hive
Then hive CLI will be appeared.
9. Create hdfs folder structure and copy sample data file into that folder
Login to hdfs docker container
Authenticate
bash-4.1# kinit nn/hadoop.docker.com@EXAMPLE.COM -k -t /etc/security/keytabs/nn.service.keytab
Create hdfs folder
bash-4.1# hdfs dfs -mkdir /data
List down hdfs folder
bash-4.1# hdfs dfs -ls /
Found 2 items
drwxr-xr-x - root root 0 2020-05-15 05:15 /data
drwxrwx--- - root root 0 2020-05-15 05:02 /tmp
Create a test data file with below content
dhanu,colombo
kithnu,colombo
yuki,colombo
bash-4.1# vi test.dat
Copy test.dat file to hdfs folder /data
bash-4.1# hdfs dfs -ls /data
10. Create External Hive table.
execute below command in hive CLI
CREATE EXTERNAL TABLE data_t5 (name string, city string) ROW FORMAT DELIMITED FIELDS TERMINATED BY "," STORED AS TEXTFILE LOCATION "/data";
Use postgres client application to connect to postgres db
As you can see table metadata in postgress metastore database.
Follow below documentation to setup Presto-Server in your local machine
https://docs.starburstdata.com/latest/installation/deployment.html
Follow below documentation to setup Presto-CLI in your local machine.
https://docs.starburstdata.com/latest/installation/cli.html
I have used single node Presto-Server , which means single coordinator and single worker nodes.
My Presto-Server folder structure:
My Presto-CLI folder structure:
Configure Presto-Server: etc/config.properties
dhanuka@dhanuka:~/research/hdfs/prestor/presto-server-332$ vim etc/config.properties
node.id=presto-master
node.environment=test
coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8080
query.max-memory=1024MB
query.max-memory-per-node=2048MB
query.max-total-memory-per-node=2048MB
discovery-server.enabled=true
discovery.uri=http://localhost:8080
task.max-worker-threads=8
Configure Hive catalog and jvm:
Replace presto/etc/ with docker-hadoop-secure/presto/etc/ files
dhanuka@dhanuka:~/research/hdfs/prestor/presto-server-332$ vim etc/catalog/hive.properties
connector.name=hive-hadoop2
hive.metastore.uri=thrift://hive-metastore:9083
hive.metastore.authentication.type=KERBEROS
hive.metastore.service.principal=hive/hive-metastore@EXAMPLE.COM
hive.metastore.client.principal=hive/hive-metastore@EXAMPLE.COM
hive.metastore.client.keytab=/home/dhanuka/research/hdfs/prestor/presto-server-332/keytabs/hive.keytab
hive.hdfs.authentication.type=KERBEROS
hive.hdfs.presto.principal=nn/hadoop.docker.com@EXAMPLE.COM
hive.hdfs.presto.keytab=/home/dhanuka/research/hdfs/prestor/presto-server-332/keytabs/nn.service.keytab
hive.hdfs.impersonation.enabled=false
hive.config.resources=/home/dhanuka/research/hdfs/docker-hadoop-secure/hive/conf/hdfs-site.xml,/home/dhanuka/research/hdfs/docker-hadoop-secure/hive/conf/core-site.xml
Replace /home/dhanuka/research/hdfs/ with your local machine location
dhanuka@dhanuka:~/research/hdfs/prestor/presto-server-332$ vim etc/jvm.config
-server-Xmx3G-XX:+UseG1GC-XX:+UseGCOverheadLimit-XX:+ExplicitGCInvokesConcurrent-XX:+HeapDumpOnOutOfMemoryError-XX:+ExitOnOutOfMemoryError-XX:ReservedCodeCacheSize=150M-DHADOOP_USER_NAME=hive-Duser.timezone=UTC-Djdk.attach.allowAttachSelf=true-Djdk.nio.maxCachedBufferSize=2000000-Dpresto-temporarily-allow-java8=true-Djava.security.krb5.conf=/home/dhanuka/research/hdfs/docker-hadoop-secure/config_files/krb5.conf-Dsun.security.krb5.debug=true-Dlog.enable-console=true
12. Start Presto-Server and execute SQL commands in Presto-CLI
Launch Presto-Server
./bin/launcher run
Launch Presto-CLI
./presto-cli --catalog hive --schema default
Run Select query
select * from data_t5;
References:
https://github.com/Knappek/docker-hadoop-secure