Use DSBULK to load, unload and count data on Aiven service for Cassandra®#
DSBulk is a highly configurable tool used to load, unload and count data in Apache Cassandra®. It has configurable consistency levels for loading and unloading and offers the most accurate way to count records in Cassandra.
Prerequisites#
To install the latest release of DSBulk, download the .zip or .tar.gz file from the DSBulk GitHub repository.
Tip
You can read more about the DSBulk different use cases and manual pages in the dedicated documentation
Variables#
These are the placeholders you will need to replace in the code sample:
Variable |
Description |
|---|---|
|
Password of the |
|
Host name for the connection |
|
Port number to use for the Cassandra service |
|
Path of the CA Certificate for the Cassandra service |
|
Password to secure your keystore. |
Tip
Most of the above variables and the CA Certificate file can be found in the Aiven Console, in the service detail page.
Preparation of the environment#
In order for dsbulk to read the security certificate to connect to Aiven service for Cassandra, the certificate must be imported in a truststore.
1. Download the certificate from the service overview page of your Aiven for Apache Cassandra service. Save the CA certificate
in a file called cassandra-certificate.pem in a directory on the linux system where dsbulk runs.
Run this command line to create a truststore file and import the certificate in it:
keytool -import -v \ -trustcacerts \ -alias CARoot \ -file cassandra-certificate.pem \ -keystore client.truststore \ -storepass KEYSTORE_PASSWORD
A truststore file called
client.truststoreis created in the directory where thekeytoolcommand has been launched.The
keytoolcommand assumes the filecassandra-certificate.pemis in the same directory where you runkeytool. If that is not the case, provide a full path tocassandra-certificate.pem.The next step is to create a configuration file with the connection information.
By creating a configuration file, the
dsbulkcommand line is more readable and it doesn’t show passwords in clear text. If you don’t create a configuration file, every option must be explicitly provided on the command line.Create a file that contains the connection configuration like the following:
datastax-java-driver { advanced { ssl-engine-factory { keystore-password = "cassandra" keystore-path = "/home/user1/client.truststore" class = DefaultSslEngineFactory truststore-password = "cassandra" truststore-path = "/home/user1/client.truststore" } auth-provider { username = avnadmin password = PASSWORD } } }The DSBulk configuration file can contain many different blocks for different configurations. In the above example, it only the
datastax-java-driverblock is filled. Thessl-engine-factoryblock contains the path of the truststore and the related password.Tip
The Cassandra documentation has both full reference and templates of the application configuration file and a full reference of the driver configuration file.
Run a dsbulk command to count records in a Cassandra table#
Once the configuration file is created, you can run the dsbulk.
Navigate to the bin subdirectory of the downloaded
dsbulkpackageRun the following command:
./dsbulk count \ -f /full/path/to/conf.file \ -k baselines \ -t keyvalue \ -h HOST \ -port PORT \ --log.verbosity 2
where:
baselinesandkeyvalueare the names of the example keyspace and table in the Cassandra database.log.verbositycontrols the amount of logging that is sent at standard output whendsbulkruns.verbosity=2is used only to troubleshoot problems. To reduce verbosity, reduce the number to 1 or remove the option altogether.-fspecifies the path to the configuration file-hand-pare the hostname and port number to connect to Cassandra.
Extract data from a Cassandra table in CSV format#
To extract the data from a table, you can use the following command:
./dsbulk unload \
-f /full/path/to/conf.file \
-k baselines \
-t keyvalue \
-h HOST \
-port PORT \
-url /directory_for_output
This command will extract all records from the table and output in a CSV format to the directory specified in the -url parameter.
Load data into a Cassandra table from a CSV file#
To load data into a Cassandra table, the command line is very similar to the previous command:
./dsbulk load \
-f /full/path/to/conf.file \
-k baselines \
-t keyvalue \
-h HOST \
-port PORT \
-url data.csv
where the file data.csv is the file that contains the data to load into Cassandra.