Skip to main content

Documentation Index

Fetch the complete documentation index at: https://www.truefoundry.com/llms.txt

Use this file to discover all available pages before exploring further.

TrueFoundry Spark Notebooks provide a JupyterLab environment with a dedicated Spark Connect server running alongside it. This gives you an interactive PySpark and Scala environment backed by a fully managed Spark cluster on Kubernetes — with no external infrastructure to set up. Use Spark Notebooks when you need to:
  • Explore and transform large datasets interactively
  • Prototype Spark ETL pipelines before productionizing them
  • Run distributed computations without managing Spark infrastructure

Getting Started

To launch a Spark Notebook, select Jupyter Notebook with Spark as the workbench type in the deployment form and configure the Spark cluster settings.
1

Create a new Notebook

Navigate to your workspace and click New Notebook. Select the Jupyter Notebook with Spark type.
2

Choose a Spark image

Select the pre-built Spark image or provide a custom extended image.
3

Configure Spark cluster resources

Set the driver resources, executor count (or dynamic scaling), and executor resources.
4

Launch

Click deploy. The notebook and Spark Connect server will start together. A SparkSession is automatically available in every Python and Scala notebook cell.

Pre-built Image

ImageSparkPythonScalaDelta LakeDatabricks LTS
public.ecr.aws/truefoundrycloud/jupyter-spark:0.4.9-py3.11.14-sc2.12-spark3.5.7-sudo3.5.73.112.123.3.116.4
Support for additional Spark versions (Spark 4.0, more Python versions) and Databricks LTS compatibility is coming soon.
All Jupyter Spark images are available at https://gallery.ecr.aws/truefoundrycloud/jupyter-spark
The image includes:
  • PySpark with Spark Connect client
  • Delta Lake for ACID table operations
  • Scala kernel (Almond) pre-configured with Spark Connect JARs
  • Conda for managing multiple Python environments

Using Spark in the Notebook

Spark is preconfigured in the notebook and available via the spark variable.
# `spark` is already available — no setup needed
df = spark.range(1000000).toDF("id")
df.filter(df.id % 2 == 0).count()
The notebook connects to the Spark Connect server via the SPARK_CONNECT_URL environment variable, which is automatically set to point to the co-located Spark Connect server.
The startup script retries the connection up to 5 times (configurable via SPARK_INIT_RETRIES). If the Spark Connect server hasn’t started yet, the session will be created once it becomes available.

Using Delta Lake

Delta Lake is pre-installed, enabling ACID transactions on your data lake:
df.write.format("delta").mode("overwrite").save("s3a://my-bucket/delta-table/")

delta_df = spark.read.format("delta").load("s3a://my-bucket/delta-table/")
delta_df.show()

Spark Cluster Configuration

The Spark cluster is configured through the Spark Cluster Config section in the deployment form.

Driver Resources

The Spark Connect server (driver) runs as a separate pod. Configure its resources based on the complexity of your query plans and the volume of data collected to the driver.
cpu_request
number
default:"1"
Minimum CPU cores for the driver.
cpu_limit
number
default:"3"
Maximum CPU cores for the driver.
memory_request
number
default:"4000"
Minimum memory in MB for the driver.
memory_limit
number
default:"6000"
Maximum memory in MB for the driver.

Executor Instances

Choose between Fixed and Dynamic executor scaling:
A fixed number of executor pods are launched when the Spark cluster starts.
ParameterDefaultDescription
count2Number of executor pods to start

Executor Resources

Each executor pod gets its own resource allocation:
cpu_request
number
default:"2"
CPU cores per executor.
memory_request
number
default:"4000"
Memory in MB per executor.
ephemeral_storage_request
number
default:"5000"
Ephemeral disk in MB per executor (used for shuffle data).

Spark Configuration Properties

Pass additional Spark configuration as key-value pairs. These are applied to the Spark Connect server and executors.
spark.sql.adaptive.enabled = true
spark.sql.shuffle.partitions = 200
spark.jars.packages = io.delta:delta-spark_2.12:3.3.1
Some internal configuration (e.g., spark.jars.ivy, connection timeouts, spark.connect packages) is managed automatically. User-supplied spark.jars.packages values are merged with the internal ones.

Spark Image

By default, the Spark Connect server and executors use the apache/spark:3.5.7 image, which matches the pre-built notebook image. You can override this with a custom Spark image in the Advanced section of the Spark Cluster Config. The image must have Spark pre-installed and be compatible with the Kubernetes executor model.

Environment Variables

The following environment variables are automatically set or can be overridden:
VariableDefaultDescription
SPARK_CONNECT_URLAuto-generatedgRPC URL of the Spark Connect server
SPARK_INIT_RETRIES5Number of connection retries at startup
SPARK_INIT_RETRY_DELAY3Seconds between retries
You can also add custom environment variables (plain text or secret references) in the deployment form for your application code.

Service Account

If your Spark jobs need to access cloud storage (S3, GCS, ADLS) or other cloud services, assign a Kubernetes service account with the appropriate IAM role to the notebook. The Spark Connect server and executors inherit this service account for cloud access. Configure the service account in the Advanced section of the deployment form.

Custom Images

You can build custom Spark notebook images by extending the pre-built base images:
FROM public.ecr.aws/truefoundrycloud/jupyter-spark:0.4.9-py3.11.14-sc2.12-spark3.5.7-sudo

# Install additional pip packages
RUN python3 -m pip install --no-cache-dir \
    koalas \
    mlflow \
    scikit-learn

# Install apt packages
USER root
RUN DEBIAN_FRONTEND=noninteractive apt-get update && \
    apt-get install -y --no-install-recommends graphviz && \
    apt-get clean && rm -rf /var/lib/apt/lists/*
USER jovyan
Do not overwrite the ENTRYPOINT or CMD instructions. These are built into the base images and are critical for correct operation.
Build and push the image to a registry integrated with TrueFoundry, then select it as a custom image when creating the notebook.