Hadoop/Spark
project template includes sample code to connect to the following resources, with and without Kerberos authentication:
In the editor session there are two environments created. anaconda50_hadoop
contains the packages consistent with the Python 3.6 template plus additional packages to access Hadoop and Spark resources. The anaconda50_impyla
environment contains packages consistent with the Python 2.7 template plus additional packages to access Impala tables using the Impyla Python package.
[email protected]
with the Kerberos principal, the combination of your username and security domain, which was provided to you by your Administrator.
Executing the command requires you to enter a password. If there is no error message, authentication has succeeded. You can verify by issuing the klist
command. If it responds with some entries, you are authenticated.
You can also use a keytab to do this. Upload it to a project and execute a command like this:
kinit
command that uses the keytab as part of the
deployment command.
Alternatively, the deployment can include a form that asks for user credentials
and executes the kinit
command.
Hadoop/Spark
project template includes Sparkmagic, but your Administrator must have configured Workbench to work with a Livy server.Python 3
PySpark
PySpark3
R
Spark
SparkR
Python 2
PySpark
. Do not use
PySpark3
.
To work with Livy and Scala, use Spark
.
You can use Spark with Workbench in two ways:
%%local
.
%load_ext sparkmagic.magics
. That command will enable a set of functions to run code on the cluster. See examples (external link).
pandas
or other packages.
In the common case, the configuration provided for you in the Session will be correct and not require modification. However, in other cases you may need to use sandbox or ad-hoc environments that require the modifications described below.
~/.sparkmagic/conf.json
.
You may inspect this file, particularly the section "session_configs"
, or
you may refer to the example file in the spark
directory,
sparkmagic_conf.example.json
. Note that the example file has not been
tailored to your specific cluster.
In a Sparkmagic kernel such as PySpark, SparkR, or similar, you can change the
configuration with the magic %%configure
. This syntax is pure JSON, and the
values are passed directly to the driver application.
EXAMPLE:
spark.driver.python
and spark.executor.python
on all compute nodes in
your Spark cluster.
EXAMPLE:
If all nodes in your Spark cluster have Python 2 deployed at /opt/anaconda2
and Python 3 deployed at /opt/anaconda3
, then you can select Python 2 on all
execution nodes with this code:
/opt/anaconda2
and Python 3 deployed at /opt/anaconda3
, then you can select Python 3 on all
execution nodes with this code:
%load_ext sparkmagic.magics
,
you can use the %manage_spark
command to set configuration options. The
session options are in the “Create Session” pane under “Properties”.
Overriding session settings can be used to target multiple Python and R
interpreters, including Python and R interpreters coming from different Anaconda
parcels.
/opt/anaconda/
with the prefix of the name and location for the particular parcel or management pack.krb5.conf
file and a
sparkmagic_conf.json
file in the project directory so they will be saved
along with the project itself. An example Sparkmagic configuration is included,
sparkmagic_conf.example.json
, listing the fields that are typically set. The
"url"
and "auth"
keys in each of the kernel sections are especially
important.
The krb5.conf
file is normally copied from the Hadoop cluster, rather than
written manually, and may refer to additional configuration or certificate
files. These files must all be uploaded using the interface.
To use these alternate configuration files, set the KRB5_CONFIG
variable
default to point to the full path of krb5.conf
and set the values of
SPARKMAGIC_CONF_DIR
and SPARKMAGIC_CONF_FILE
to point to the Sparkmagic
config file. You can set these either by using the Project pane on the left of
the interface, or by directly editing the anaconda-project.yml
file.
For example, the final file’s variables section may look like this:
kinit
or starting any notebook/kernel..json
file, all Sparkmagic kernels will fail to launch. You can test your Sparkmagic configuration by running the following Python command in an interactive shell: python -m json.tool sparkmagic_conf.json
.python-hdfs
[anaconda50_hadoop] Python 3
hdfscli
command line, configure the ~/.hdfscli.cfg
file:
[anaconda50_hadoop] Python 3
environment and executing the hdfscli
command. For example:
pyhive
[anaconda50_hadoop] Python 3
[anaconda50_hadoop] Python 3
environment and run:
impyla
implyr
Python 2
environment and run: