0.11.0 (2024-05-27)

Breaking Changes

There can be some changes in connection behavior, related to version upgrades. So we mark these changes as breaking although most of users will not see any differences.

Update Clickhouse JDBC driver to latest version (#249): : * Package was renamed ru.yandex.clickhouse:clickhouse-jdbc → com.clickhouse:clickhouse-jdbc.
- Package version changed 0.3.2 → 0.6.0-patch5.
- Driver name changed ru.yandex.clickhouse.ClickHouseDriver → com.clickhouse.jdbc.ClickHouseDriver.

This brings up several fixes for Spark <-> Clickhouse type compatibility, and also Clickhouse clusters support. - Update other JDBC drivers to latest versions: : * MSSQL 12.2.0 → 12.6.2 (#254). * MySQL 8.0.33 → 8.4.0 (#253, #285). * Oracle 23.2.0.0 → 23.4.0.24.05 (#252, #284). * Postgres 42.6.0 → 42.7.3 (#251). - Update MongoDB connector to latest version: 10.1.1 → 10.3.0 (#255, #283).

This brings up Spark 3.5 support. - Update XML package to latest version: 0.17.0 → 0.18.0 (#259).

This brings few bugfixes with datetime format handling. - For JDBC connections add new SQLOptions class for DB.sql(query, options=...) method (#272).

Firsly, to keep naming more consistent.

Secondly, some of options are not supported by DB.sql(...) method, but supported by DBReader. For example, SQLOptions do not support partitioning_mode and require explicit definition of lower_bound and upper_bound when num_partitions is greater than 1. ReadOptions does support partitioning_mode and allows skipping lower_bound and upper_bound values.

This require some code changes. Before:

from onetl.connection import Postgres

postgres = Postgres(...)
df = postgres.sql(
    """
    SELECT *
    FROM some.mytable
    WHERE key = 'something'
    """,
    options=Postgres.ReadOptions(
        partitioning_mode="range",
        partition_column="id",
        num_partitions=10,
    ),
)

After:

from onetl.connection import Postgres

postgres = Postgres(...)
df = postgres.sql(
    """
    SELECT *
    FROM some.mytable
    WHERE key = 'something'
    """,
    options=Postgres.SQLOptions(
        # partitioning_mode is not supported!
        partition_column="id",
        num_partitions=10,
        lower_bound=0,  # <-- set explicitly
        upper_bound=1000,  # <-- set explicitly
    ),
)

For now, DB.sql(query, options=...) can accept ReadOptions to keep backward compatibility, but emits deprecation warning. The support will be removed in v1.0.0. - Split up JDBCOptions class into FetchOptions and ExecuteOptions (#274).

New classes are used by DB.fetch(query, options=...) and DB.execute(query, options=...) methods respectively. This is mostly to keep naming more consistent.

This require some code changes. Before:

from onetl.connection import Postgres

postgres = Postgres(...)
df = postgres.fetch(
    "SELECT * FROM some.mytable WHERE key = 'something'",
    options=Postgres.JDBCOptions(
        fetchsize=1000,
        query_timeout=30,
    ),
)

postgres.execute(
    "UPDATE some.mytable SET value = 'new' WHERE key = 'something'",
    options=Postgres.JDBCOptions(query_timeout=30),
)

After:

from onetl.connection import Postgres

# Using FetchOptions for fetching data
postgres = Postgres(...)
df = postgres.fetch(
    "SELECT * FROM some.mytable WHERE key = 'something'",
    options=Postgres.FetchOptions(  # <-- change class name
        fetchsize=1000,
        query_timeout=30,
    ),
)

# Using ExecuteOptions for executing statements
postgres.execute(
    "UPDATE some.mytable SET value = 'new' WHERE key = 'something'",
    options=Postgres.ExecuteOptions(query_timeout=30),  # <-- change class name
)

For now, DB.fetch(query, options=...) and DB.execute(query, options=...) can accept JDBCOptions, to keep backward compatibility, but emit a deprecation warning. The old class will be removed in v1.0.0. - Serialize ColumnDatetimeHWM to Clickhouse’s DateTime64(6) (precision up to microseconds) instead of DateTime (precision up to seconds) (#267).

In previous onETL versions, ColumnDatetimeHWM value was rounded to the second, and thus reading some rows that were read in previous runs, producing duplicates.

For Clickhouse versions below 21.1 comparing column of type DateTime with a value of type DateTime64 is not supported, returning an empty dataframe. To avoid this, replace:

DBReader(
    ...,
    hwm=DBReader.AutoDetectHWM(
        name="my_hwm",
        expression="hwm_column",  # <--
    ),
)

with:

DBReader(
    ...,
    hwm=DBReader.AutoDetectHWM(
        name="my_hwm",
        expression="CAST(hwm_column AS DateTime64)",  # <-- add explicit CAST
    ),
)

- Pass JDBC connection extra params as properties dict instead of URL with query part (#268).

This allows passing custom connection parameters like Clickhouse(extra={"custom_http_options": "option1=value1,option2=value2"}) without need to apply urlencode to parameter value, like option1%3Dvalue1%2Coption2%3Dvalue2.

Features

Improve user experience with Kafka messages and Database tables with serialized columns, like JSON/XML.

Allow passing custom package version as argument for DB.get_packages(...) method of several DB connectors: : * Clickhouse.get_packages(package_version=..., apache_http_client_version=...) (#249).
- MongoDB.get_packages(scala_version=..., spark_version=..., package_version=...) (#255).
- MySQL.get_packages(package_version=...) (#253).
- MSSQL.get_packages(java_version=..., package_version=...) (#254).
- Oracle.get_packages(java_version=..., package_version=...) (#252).
- Postgres.get_packages(package_version=...) (#251).
- Teradata.get_packages(package_version=...) (#256).

Now users can downgrade or upgrade connection without waiting for next onETL release. Previously only Kafka and Greenplum supported this feature. - Add FileFormat.parse_column(...) method to several classes: : * Avro.parse_column(col) (#265). * JSON.parse_column(col, schema=...) (#257). * CSV.parse_column(col, schema=...) (#258). * XML.parse_column(col, schema=...) (#269).

This allows parsing data in value field of Kafka message or string/binary column of some table as a nested Spark structure. - Add FileFormat.serialize_column(...) method to several classes: : * Avro.serialize_column(col) (#265). * JSON.serialize_column(col) (#257). * CSV.serialize_column(col) (#258).

This allows saving Spark nested structures or arrays to value field of Kafka message or string/binary column of some table.

Improvements

Few documentation improvements.

Replace all assert in documentation with doctest syntax. This should make documentation more readable (#273).
Add generic Troubleshooting guide (#275).
Improve Kafka documentation: : * Add “Prerequisites” page describing different aspects of connecting to Kafka.
- Improve “Reading from” and “Writing to” page of Kafka documentation, add more examples and usage notes.
- Add “Troubleshooting” page (#276).
Improve Hive documentation: : * Add “Prerequisites” page describing different aspects of connecting to Hive.
- Improve “Reading from” and “Writing to” page of Hive documentation, add more examples and recommendations.
- Improve “Executing statements in Hive” page of Hive documentation. (#278).
Add “Prerequisites” page describing different aspects of using SparkHDFS and SparkS3 connectors. (#279).
Add note about connecting to Clickhouse cluster. (#280).
Add notes about versions when specific class/method/attribute/argument was added, renamed or changed behavior (#282).

Bug Fixes

Fix missing pysmb package after installing pip install onetl[files] .