If you are running CDH, Cloudera’s distribution of Hadoop, we aim to provide you with first-class integration on Google Cloud so you can run a CDH cluster with Cloud Storage integration.
In this post, we’ll help you get started deploying the Cloud Storage connector for your CDH clusters. The methods and steps we discuss here will apply to both on-premise clusters and cloud-based clusters. Keep in mind that the Cloud Storage connector uses Java, so you’ll want to make sure that the appropriate Java 8 packages are installed on your CDH cluster. Java 8 should come pre-configured as your default Java Development Kit.
[Check out this post if you’re deciding how and when to use Cloud Storage over the Hadoop Distributed File System (HDFS).]
Here’s how to get started:
Distribute using the Cloudera parcel
If you’re running a large Hadoop cluster or more than one cluster, it can be hard to deploy libraries and configure Hadoop services to use those libraries without making mistakes. Fortunately, Cloudera Manager provides a way to install packages with parcels. A parcel is a binary distribution format that consists of a gzipped (compressed) tar archive file with metadata.
We recommend using the CDH parcel to install the Cloud Storage connector. There are some big advantages of using a parcel instead of manual deployment and configuration to deploy the Cloud Storage connector on your Hadoop cluster:
Self-contained distribution: All related libraries, scripts and metadata are packaged into a single parcel file. You can host it at an internal location that is accessible to the cluster or even upload it directly to the Cloudera Manager node.
No need for sudo access or root: The parcel is not deployed under /usr or any of the system directories. Cloudera Manager will deploy it through agents, which eliminates the need to use sudo access users or root user to deploy.
Create your own Cloud Storage connector parcel
To create the parcel for your clusters, download and use this script. You can do this on any machine with access to the internet.
This script will execute the following actions:
Download Cloud Storage connector to a local drive
Package the connector Java Archive (JAR) file into a parcel
Place the parcel under the Cloudera Manager’s parcel repo directory
If you’re connecting an on-premise CDH cluster or cluster on a cloud provider other than Google Cloud Platform (GCP), follow the instructions from this page to create a service account and download its JSON key file.
Create the Cloud Storage parcel
Next, you’ll want to run the script to create the parcel file and checksum file and let Cloudera Manager find it with the following steps:
1. Place the service account JSON key file and the create_parcel.sh script under the same directory. Make sure that there are no other files under this directory.
2. Run the script, which will look something like this:
$ ./create_parcel.sh -f <parcel_name> -v <version> -o <os_distro_suffix>
- parcel_name is the name of the parcel in a single string format without any spaces or special characters. (i.e.,, gcsconnector)
- version is the version of the parcel in the format x.x.x (ex: 1.0.0)
- os_distro_suffix: Like the naming conventions of RPM or deb, parcels need to be named in a similar way. A full list of possible distribution suffixes can be found here.
- d is a flag you can use to deploy the parcel to the Cloudera Manager parcel repo folder. It’s optional; if not provided, the parcel file will be created in the same directory where the script ran.
3. Logs of the script can be found in /var/log/build_script.log
Distribute and activate the parcel
Once you’ve created the Cloud Storage parcel, Cloudera Manager has to recognize the parcel and install it on the cluster.
The script you ran generated a .parcel file and a .parcel.sha checksum file. Put these two files on the Cloudera Manager node under directory /opt/cloudera/parcel-repo. If you already host Cloudera parcels somewhere, you can just place these files there and add an entry in the manifest.json file.
On the Cloudera Manager interface, go to Hosts -> Parcels and click Check for New Parcels to refresh the list to load any new parcels. The Cloud Storage connector parcel should show up like this: