Columns¶

Due to its nature, hamana library is designed to work with data often extracted in tabular form. As a consequence, it was introduced the Column class that could be used to store useful information about the data extracted (e.g. name, type, etc.) and better describe the data. For example, the Column class can be found in Query objects, or in the definition of CSV connectors.

Even if Column classes define a general behavior, they can be customized to better fit to specific data types or sources. hamana provides default implemntations for the most common data types:

NumberColumn: this column can be used to manage any kind of number.
IntegerColumn: column class specialised to manage integer values.
StringColumn: column class specialised to manage string values.
BooleanColumn: column class specialised to manage boolean values.
DatetimeColumn: this column is specific for datetime values.
DateColumn: this column is specific for date values.

These classes could be useful because they provide already a default implementation of the ColumnParser class, that is used to convert the data from the source to the internal representation. In addition, they provide additional class attributes fitting the desired datatype.

Clearly, it remains always possible to create custom Column classes by extending the Column class and providing a custom implementation of the ColumnParser class.

DataType¶

Before presenting the Column class, we first introduce the DataType class. This class creates a standard inside the library to manage the types, and it provides a bridge between SQLite and pandas data types.

hamana.core.column.DataType ¶

Bases: Enum

Enumeration representing the datatypes of the hamana columns.

The library supports the following data types:

INTEGER: integer data type.
NUMBER: number data type.
STRING: string data type.
BOOLEAN: boolean data type.
DATETIME: datetime data type.
DATE: date data type.
CUSTOM: custom data type.

The CUSTOM data type is used to represent a custom datatype that could be used for dedicated implementations.

Since the library is designed to be used with pandas and sqlite, the DataType enumeration also provides a method to map the data types to the corresponding data types in sqlite and pandas.

INTEGER `class-attribute` `instance-attribute` ¶

INTEGER = 'integer'

Integer data type.

NUMBER `class-attribute` `instance-attribute` ¶

NUMBER = 'number'

Number data type.

STRING `class-attribute` `instance-attribute` ¶

STRING = 'string'

String data type.

BOOLEAN `class-attribute` `instance-attribute` ¶

BOOLEAN = 'boolean'

Boolean data type.

DATETIME `class-attribute` `instance-attribute` ¶

DATETIME = 'datetime'

Datetime data type.

DATE `class-attribute` `instance-attribute` ¶

DATE = 'date'

Date data type.

CUSTOM `class-attribute` `instance-attribute` ¶

CUSTOM = 'custom'

Custom data type.

from_pandas `classmethod` ¶

from_pandas(dtype: str) -> DataType

Function to map a pandas datatype to DataType.

Observe that if no mapping is found, the default is DataType.STRING.

Parameters:

Name	Type	Description	Default
`dtype`	`str`	pandas data type.	required

Returns:

Type	Description
`DataType`	`DataType` mapped.

Source code in src/hamana/core/column.py

@classmethod
def from_pandas(cls, dtype: str) -> "DataType":
    """
        Function to map a `pandas` datatype to `DataType`.

        Observe that if no mapping is found, the default is `DataType.STRING`.

        Parameters:
            dtype: pandas data type.

        Returns:
            `DataType` mapped.
    """
    if "int" in dtype:
        return DataType.INTEGER
    elif "float" in dtype:
        return DataType.NUMBER
    elif dtype == "object":
        return DataType.STRING
    elif dtype == "bool":
        return DataType.BOOLEAN
    elif "datetime" in dtype:
        return DataType.DATETIME
    else:
        logger.warning(f"unknown data type: {dtype}")
        return DataType.STRING

to_sqlite `classmethod` ¶

to_sqlite(dtype: DataType) -> str

Function to map a DataType to a SQLite datatype.

Parameters:

Name	Type	Description	Default
`dtype`	`DataType`	`DataType` to be mapped.	required

Returns:

Type	Description
`str`	SQLite data type mapped.

Source code in src/hamana/core/column.py

@classmethod
def to_sqlite(cls, dtype: "DataType") -> str:
    """
        Function to map a `DataType` to a SQLite datatype.

        Parameters:
            dtype: `DataType` to be mapped.

        Returns:
            SQLite data type mapped.
    """
    match dtype:
        case DataType.INTEGER:
            return "INTEGER"
        case DataType.NUMBER:
            return "REAL"
        case DataType.STRING:
            return "TEXT"
        case DataType.BOOLEAN:
            return "INTEGER"
        case DataType.DATETIME:
            return "INTEGER"
        case DataType.DATE:
            return "INTEGER"
        case DataType.CUSTOM:
            return "BLOB"
        case _:
            return ""

Parser¶

Another useful functionality that could be available in the Column class is the parser attribute. This variable, if present, is an instance of the ColumnParser class, that is used to convert the data from the source to the internal representation.

The ColumnParser class is composed of two methods:

pandas: this method must respect the protocol PandasParser, and it is specifically used to convert pandas.Series input datas.
polars: currently not supported, but it will be used to convert polars.Series input datas.

By default, the Column class does not provide any parser, but the NumberColumn, IntegerColumn, StringColumn, BooleanColumn, DatetimeColumn, and DateColumn classes provide a default implementation of the ColumnParser class.

hamana.core.column.ColumnParser `dataclass` ¶

ColumnParser(
    pandas: PandasParser, polars: Callable | None = None
)

Class representing a parser for a column in the hamana library.

Since the library is designed to be used with pandas and polars, the ColumnParser class provides methods that could be used to parse data coming from these libraries.

hamana.core.column.PandasParser ¶

Bases: Protocol

Protocol representing a parser for pandas series.

A pandas parser is a function that requires at least a pandas series to be taken as input and returned as output after dedicated transformations.

Structure:

def parser(series: pandas.Series, *args: Any, **kwargs: Any) -> pandas.Series:
    ...

Identifier¶

The are many situations where it is required to identity the column datatype (string, number, date, etc.), e.g. when the data is extracted from file sources like CSV files. To solve this problem, hamana provides the ColumnIdentifier class, that is used to identify the column type according to an input data.

Similarly to the ColumnParser class, the ColumnIdentifier class is composed of two methods:

pandas: this method must respect the protocol PandasIdentifier, and it is specifically used to identify the column type from a pandas.Series input data.
polars: currently not supported, but it will be used to identify the column type from a polars.Series input data.

hamana.core.identifier.ColumnIdentifier `dataclass` ¶

ColumnIdentifier(
    pandas: PandasIdentifier[TColumn],
    polars: Callable | None = None,
)

Bases: Generic[TColumn]

Class representing an identifier for a column in the hamana library.

Since the library is designed to be used with pandas and polars, the ColumnIdentifier class provides methods that could be used to identify the column from a set of data from both libraries.

Note

Observe that the identification process tries to infer the column type based on the data provided. The process is not perfect and could lead to wrong inferences. The user should always check the inferred column type and adjust it if needed.

is_empty `staticmethod` ¶

is_empty(
    series: PandasSeries, raise_error: bool = False
) -> bool

Check if the series is empty.

Parameters:

Name	Type	Description	Default
`series`	`PandasSeries`	the series to check.	required
`raise_error`	`bool`	if True, raise an error if the series is empty.	`False`

Returns:

Type	Description
`bool`	True if the series is empty, False otherwise.

Source code in src/hamana/core/identifier.py

@staticmethod
def is_empty(series: PandasSeries, raise_error: bool = False) -> bool:
    """
        Check if the series is empty.

        Parameters:
            series: the series to check.
            raise_error: if True, raise an error if the series is empty.

        Returns:
            True if the series is empty, False otherwise.
    """
    logger.debug("start")

    is_empty = series.empty
    if is_empty and raise_error:
        raise ColumnIdentifierEmptySeriesError("empty series")

    logger.debug("end")
    return is_empty

call ¶

__call__(
    series: Any,
    column_name: str,
    order: int | None = None,
    *args: Any,
    **kwargs: Any
) -> TColumn | None

Identifies the column type from a given series.

Parameters:

Name	Type	Description	Default
`series`	`Any`	the series to identify the column type from.	required
`column_name`	`str`	the name of the column to identify.	required
`*args`	`Any`	additional arguments to pass to the identifier.	`()`
`**kwargs`	`Any`	additional keyword arguments to pass to the identifier.	`{}`

Returns:

Type	Description
`TColumn \| None`	the identified column type or `None` if the column type could not be identified.

Source code in src/hamana/core/identifier.py

def __call__(self, series: Any, column_name: str, order: int | None = None, *args: Any, **kwargs: Any) -> TColumn | None:
    """
        Identifies the column type from a given series.

        Parameters:
            series: the series to identify the column type from.
            column_name: the name of the column to identify.
            *args: additional arguments to pass to the identifier.
            **kwargs: additional keyword arguments to pass to the identifier.

        Returns:
            the identified column type or `None` if the column type
                could not be identified.
    """
    logger.debug("start")

    _series = None

    # pandas series
    if isinstance(series, PandasSeries):
        try:
            logging.debug("Identifying column type using pandas identifier.")
            _series = self.pandas(series, column_name, order, *args, **kwargs)
        except ColumnDateFormatterError as e:
            logger.error("Column date formatter error.")
            logger.exception(e)
            raise e
        except Exception as e:
            logger.info("pandas identifier failed.")
            logger.exception(e)

    logger.debug("end")
    return _series

infer `staticmethod` ¶

infer(
    series: Any,
    column_name: str,
    order: int | None = None,
    *args: Any,
    **kwargs: Any
) -> (
    NumberColumn
    | IntegerColumn
    | StringColumn
    | BooleanColumn
    | DatetimeColumn
    | DateColumn
)

Infers the column type from a given series. The function passes the series to the default hamana identifiers in the following order:

DatetimeColumn
BooleanColumn
IntegerColumn
NumberColumn
StringColumn

in order to infer the column type.

Note

If the column is empty, then by default the function assign the STRING datatype.

Parameters:

Name	Type	Description	Default
`series`	`Any`	the series to infer the column type from.	required
`*args`	`Any`	additional arguments to pass to the identifier.	`()`
`**kwargs`	`Any`	additional keyword arguments to pass to the identifier.	`{}`

Returns:

Type	Description
`NumberColumn \| IntegerColumn \| StringColumn \| BooleanColumn \| DatetimeColumn \| DateColumn`	the inferred column type.

Raises:

Type	Description
`ColumnIdentifierError`	if no column type could be inferred.

Source code in src/hamana/core/identifier.py

@staticmethod
def infer(series: Any, column_name: str, order: int | None = None, *args: Any, **kwargs: Any) -> NumberColumn | IntegerColumn | StringColumn | BooleanColumn | DatetimeColumn | DateColumn:
    """
        Infers the column type from a given series. The function passes 
        the series to the default `hamana` identifiers in the following
        order:

        - [`DatetimeColumn`][hamana.core.column.DatetimeColumn]
        - [`BooleanColumn`][hamana.core.column.BooleanColumn]
        - [`IntegerColumn`][hamana.core.column.IntegerColumn]
        - [`NumberColumn`][hamana.core.column.NumberColumn]
        - [`StringColumn`][hamana.core.column.StringColumn]

        in order to infer the column type.

        Note:
            If the column is empty, then by default the 
            function assign the `STRING` datatype.

        Parameters:
            series: the series to infer the column type from.
            *args: additional arguments to pass to the identifier.
            **kwargs: additional keyword arguments to pass to the identifier.

        Returns:
            the inferred column type.

        Raises:
            ColumnIdentifierError: if no column type could be inferred.
    """
    logger.debug("start")

    try:
        # infer date column
        inferred_column = date_identifier(series, column_name, order, *args, **kwargs)
        if inferred_column is not None:
            logger.info(f"date column inferred, format: {inferred_column.format}")
            return inferred_column

        # infer datetime column
        inferred_column = datetime_identifier(series, column_name, order, *args, **kwargs)
        if inferred_column is not None:
            logger.info(f"datetime column inferred, format: {inferred_column.format}")
            return inferred_column

        # infer boolean column
        inferred_column = boolean_identifier(series, column_name, order, *args, **kwargs)
        if inferred_column is not None:
            logger.info(f"boolean column inferred, true value: {inferred_column.true_value}, false value: {inferred_column.false_value}")
            return inferred_column

        # infer integer column
        inferred_column = integer_identifier(series, column_name, order, *args, **kwargs)
        if inferred_column is not None:
            logger.info(f"integer column inferred, decimal separator: {inferred_column.decimal_separator}, thousands separator: {inferred_column.thousands_separator}")
            return inferred_column

        # infer number column
        inferred_column = number_identifier(series, column_name, order, *args, **kwargs)
        if inferred_column is not None:
            logger.info(f"number column inferred, decimal separator: {inferred_column.decimal_separator}, thousands separator: {inferred_column.thousands_separator}")
            return inferred_column

        # infer string column
        inferred_column = string_identifier(series, column_name, order, *args, **kwargs)
        if inferred_column is not None:
            logger.info("string column inferred")
            return inferred_column
    except ColumnIdentifierEmptySeriesError:
        logger.warning(f"column '{column_name}' empty, assigned STRING datatype.")
        return StringColumn(name = column_name, order = order)

    raise ColumnIdentifierError("no column inferred")

hamana.core.identifier.PandasIdentifier ¶

Bases: Protocol[TColumn]

Protocol representing an identifier for pandas series.

A PandasIdentifier is a callable that must have at least the following input parameters:

series: the pandas series to identify the column type from.
column_name: the name of the column to identify.

The PandasIdentifier must return a column type or None if the column type could not be identified.

Structure

def __call__(self, series: PandasSeries, column_name: str, order: int | None = None, *args: Any, **kwargs: Any) -> TColumn | None:
    ...

Default Identifiers¶

hamana provides a set of default identifiers that can be used to identify the default's hamana column types.

Number Identifier¶

hamana.core.identifier.number_identifier `module-attribute` ¶

number_identifier = ColumnIdentifier[NumberColumn](
    pandas=_default_numeric_pandas
)

Default identifier for the NumberColumn class.

More details on the default methods can be found in the corresponding functions' documentation.

pandas: _default_numeric_pandas
polars: None (not implemented)

hamana.core.identifier._default_numeric_pandas ¶

_default_numeric_pandas(
    series: PandasSeries,
    column_name: str,
    order: int | None = None,
) -> NumberColumn | None

This function defines the default behavior to identify a number column from a pandas series.

In order to identify a number column, the function follows the steps:

Drop null values (included empty strings)
Check if the column has letters
Count the max appearance of the comma and dot separators in all the elements.
Evaluate first the default configuration (dot decimal separator, comma thousands separator).
If the default configuration does not work, evaluate the alternative configuration (comma decimal separator, dot thousands separator).
If also this configuration does not work, return None.

Parameters:

Name	Type	Description	Default
`series`	`PandasSeries`	`pandas` series to be checked.	required
`column_name`	`str`	name of the column to be checked.	required

Returns:

Type	Description
`NumberColumn \| None`	`NumberColumn` if the column is a number column, `None` otherwise.

Source code in src/hamana/core/identifier.py

def _default_numeric_pandas(series: PandasSeries, column_name: str, order: int | None = None) -> NumberColumn | None:
    """
        This function defines the default behavior to identify a number column from a `pandas` series.

        In order to identify a number column, the function follows the steps:

        - Drop null values (included empty strings)
        - Check if the column has letters
        - Count the max appearance of the comma and dot 
            separators in all the elements.
        - Evaluate first the default configuration (dot decimal separator, 
            comma thousands separator).
        - If the default configuration does not work, evaluate the 
            alternative configuration (comma decimal separator, dot 
            thousands separator).
        - If also this configuration does not work, return None.

        Parameters:
            series: `pandas` series to be checked.
            column_name: name of the column to be checked.

        Returns:
            `NumberColumn` if the column is a number column, `None` otherwise.
    """
    logger.debug("start")
    column = None

    # drop null values
    _series = series.replace("", None).dropna().astype("str")
    logger.debug(f"dropped null values: {len(series) - len(_series)}")
    ColumnIdentifier.is_empty(_series, raise_error = True)

    # check letters presence
    logger.debug("check letters")
    if _series.str.replace(r"[0-9\.\-\+\,eE]", "", regex = True).str.len().sum() > 0:
        logger.warning("letters found, no number column")
        return None

    # check separators
    comma_separator_count = _series.str.replace(r"[0-9\.\-\+eE]", "", regex = True).str.len().max()
    logger.debug(f"comma separator count: {comma_separator_count}")

    dot_separator_count   = _series.str.replace(r"[0-9\-\+\,eE]", "", regex = True).str.len().max()
    logger.debug(f"dot separator count: {dot_separator_count}")

    if (
            dot_separator_count in [0, 1]
        and _series.str.match(r"^[+-]?(\d+(\,\d{3})*|\d{1,2})(\.\d+)?([eE][+-]?\d+)?$").all()
    ):
        logger.info("possible number column: dot decimal separator, comma thousands separator")
        column = NumberColumn(name = column_name, decimal_separator = ".", thousands_separator = ",", order = order)
        column.inferred = True
    elif (
            comma_separator_count in [0, 1]
        and _series.str.match(r"^[+-]?(\d+(\.\d{3})*|\d{1,2})(\,\d+)?([eE][+-]?\d+)?$").all()
    ):
        logger.info("possible number column: comma decimal separator, dot thousands separator")
        column = NumberColumn(name = column_name, decimal_separator = ",", thousands_separator = ".", order = order)
        column.inferred = True
    else:
        logger.warning("no separator found")

    logger.debug("end")
    return column

Integer Identifier¶

hamana.core.identifier.integer_identifier `module-attribute` ¶

integer_identifier = ColumnIdentifier[IntegerColumn](
    pandas=_default_integer_pandas
)

Default identifier for the IntegerColumn class.

More details on the default methods can be found in the corresponding functions' documentation.

pandas: _default_integer_pandas
polars: None (not implemented)

hamana.core.identifier._default_integer_pandas ¶

_default_integer_pandas(
    series: PandasSeries,
    column_name: str,
    order: int | None = None,
) -> IntegerColumn | None

This function defines the default behavior to identify an integer column from a pandas series.

In order to identify an integer column, the function follows the steps:

Drop null values (included empty strings)
Check if the column can be considered as number datatype
If the check is passed, then is checked if the column is composed only by integers (included the sign).

Parameters:

Name	Type	Description	Default
`series`	`PandasSeries`	`pandas` series to be checked.	required
`column_name`	`str`	name of the column to be checked.	required

Returns:

Type	Description
`IntegerColumn \| None`	`IntegerColumn` if the column is an integer column, `None` otherwise.

Source code in src/hamana/core/identifier.py

def _default_integer_pandas(series: PandasSeries, column_name: str, order: int | None = None) -> IntegerColumn | None:
    """
        This function defines the default behavior to identify an integer column from a `pandas` series.

        In order to identify an integer column, the function follows the steps:

        - Drop null values (included empty strings)
        - Check if the column can be considered as number datatype
        - If the check is passed, then is checked if the column is 
            composed only by integers (included the sign).

        Parameters:
            series: `pandas` series to be checked.
            column_name: name of the column to be checked.

        Returns:
            `IntegerColumn` if the column is an integer column, `None` otherwise.
    """
    logger.debug("start")
    column = None

    # drop null values
    _series = series.replace("", None).dropna().astype("str")
    logger.debug(f"dropped null values: {len(series) - len(_series)}")

    # check number column
    inferred_column = number_identifier.pandas(series, column_name)
    if inferred_column is None:
        logger.warning("no number column found")
        return None
    logger.debug("number column inferred")

    # adjust series
    _series = _series.str.replace(r"\.0$", "", regex = True)

    # check separators
    comma_separator_count = _series.str.replace(r"[0-9\.\-\+eE]", "", regex = True).str.len().max()
    logger.debug(f"comma separator count: {comma_separator_count}")

    dot_separator_count   = _series.str.replace(r"[0-9\-\+\,eE]", "", regex = True).str.len().max()
    logger.debug(f"dot separator count: {dot_separator_count}")

    # infer thousands separator
    thousands_separator = inferred_column.thousands_separator
    decimal_separator = inferred_column.decimal_separator
    if comma_separator_count >= 0 and dot_separator_count == 0:
        thousands_separator = ","
        decimal_separator   = "."
    elif dot_separator_count >= 0 and comma_separator_count == 0:
        thousands_separator = "."
        decimal_separator   = ","

    int_regex = rf"^[+-]?(\d+(\{thousands_separator}" + r"\d{3})*|\d{1,2})$"
    if thousands_separator is not None and _series.str.match(int_regex).all():
        logger.info("integer column found")
        column = IntegerColumn(name = inferred_column.name, decimal_separator = decimal_separator, thousands_separator = thousands_separator, order = order)
        column.inferred = True
    else:
        logger.warning("no integer column found")

    logger.debug("end")
    return column

String Identifier¶

hamana.core.identifier.string_identifier `module-attribute` ¶

string_identifier = ColumnIdentifier[StringColumn](
    pandas=_default_string_pandas
)

Default identifier for the StringColumn class.

More details on the default methods can be found in the corresponding functions' documentation.

pandas: _default_string_pandas
polars: None (not implemented)

hamana.core.identifier._default_string_pandas ¶

_default_string_pandas(
    series: PandasSeries,
    column_name: str,
    order: int | None = None,
) -> StringColumn | None

Function to identify a string column from a pandas series.

The function checks if the column is a string column by converting the column to string type and checking if at least one value can be considered as string.

Parameters:

Name	Type	Description	Default
`series`	`PandasSeries`	`pandas` series to be checked.	required
`column_name`	`str`	name of the column to be checked.	required

Returns:

Type	Description
`StringColumn \| None`	`StringColumn` if the column is a string column, `None` otherwise.

Source code in src/hamana/core/identifier.py

def _default_string_pandas(series: PandasSeries, column_name: str, order: int | None = None) -> StringColumn | None:
    """
        Function to identify a string column from a `pandas` series.

        The function checks if the column is a string column by 
        converting the column to string type and checking if at least 
        one value can be considered as string.

        Parameters:
            series: `pandas` series to be checked.
            column_name: name of the column to be checked.

        Returns:
            `StringColumn` if the column is a string column, `None` otherwise.
    """
    logger.debug("start")
    column = None

    # drop null values
    _series = series.replace("", None).dropna()
    logger.debug(f"dropped null values: {len(series) - len(_series)}")
    ColumnIdentifier.is_empty(_series, raise_error = True)

    # check values
    logger.debug("check values")
    if _series.astype("str").str.match(r"^[A-Za-z\d\W]+$").any():
        logger.info("string column found")
        column = StringColumn(name = column_name, order = order)
        column.inferred = True
    else:
        logger.warning("no string column found")

    logger.debug("end")
    return column

Boolean Identifier¶

hamana.core.identifier.boolean_identifier `module-attribute` ¶

boolean_identifier = ColumnIdentifier[BooleanColumn](
    pandas=_default_boolean_pandas
)

Default identifier for the BooleanColumn class.

More details on the default methods can be found in the corresponding functions' documentation.

pandas: _default_boolean_pandas
polars: None (not implemented)

hamana.core.identifier._default_boolean_pandas ¶

_default_boolean_pandas(
    series: PandasSeries,
    column_name: str,
    order: int | None = None,
    min_count: int = 1000,
) -> BooleanColumn | None

This function defines the default behavior to identify a boolean column from a pandas series.

To identify a boolean column, the function checks if the column has only two unique values. Observe, that the function does not check if the values are boolean values, but only if the column has two unique values; for this reason the assignment of the True and False values is arbitrary.

Parameters:

Name	Type	Description	Default
`series`	`PandasSeries`	`pandas` series to be checked.	required
`column_name`	`str`	name of the column to be checked.	required
`min_count`	`int`	minimum number of elements to consider the column as a boolean column. This parameter is used to avoid wrong inferences when the column has only a few elements.	`1000`

Returns:

Type	Description
`BooleanColumn \| None`	`BooleanColumn` if the column is a boolean column, `None` otherwise.

Source code in src/hamana/core/identifier.py

def _default_boolean_pandas(series: PandasSeries, column_name: str, order: int | None = None, min_count: int = 1_000) -> BooleanColumn | None:
    """
        This function defines the default behavior to identify a boolean column from a `pandas` series.

        To identify a boolean column, the function checks if the column has only two unique values.
        Observe, that the function does not check if the values are boolean values, but only if the
        column has two unique values; for this reason the assignment of the `True` and `False` values
        is arbitrary.

        Parameters:
            series: `pandas` series to be checked.
            column_name: name of the column to be checked.
            min_count: minimum number of elements to consider the column as a boolean column.
                This parameter is used to avoid wrong inferences when the column has only a few elements.

        Returns:
            `BooleanColumn` if the column is a boolean column, `None` otherwise.
    """
    logger.debug("start")
    column = None

    # drop null values
    _series = series.replace("", None).dropna()
    logger.debug(f"dropped null values: {len(series) - len(_series)}")
    ColumnIdentifier.is_empty(_series, raise_error = True)

    # check values
    logger.debug("check values")
    count_disinct = _series.nunique()
    if count_disinct == 2 and len(_series) > min_count:
        values = _series.unique()
        logger.info(f"boolean column found, unique values: {values}")
        column = BooleanColumn(name = column_name, true_value = values[0], false_value = values[1], order = order)
        column.inferred = True
    else:
        logger.warning(f"no boolean column, unique values: {count_disinct}")

    logger.debug("end")
    return column

Datetime Identifier¶

hamana.core.identifier.datetime_identifier `module-attribute` ¶

datetime_identifier = ColumnIdentifier[DatetimeColumn](
    pandas=_default_datetime_pandas
)

Default identifier for the DatetimeColumn class.

More details on the default methods can be found in the corresponding functions' documentation.

pandas: _default_datetime_pandas
polars: None (not implemented)

hamana.core.identifier._default_datetime_pandas ¶

_default_datetime_pandas(
    series: PandasSeries,
    column_name: str,
    order: int | None = None,
    format: str | None = None,
) -> DatetimeColumn | None

This function defines the default behavior to identify a datetime column from a pandas series.

To identify this type of column, the function removes first the null values, then tries to apply pandas.to_datetime with a list of the most common datetime formats. If the column is not identified, the function tries to apply pandas.to_datetime without providing any format. Since this last operation could lead to wrong inferences, the function considers the column as a datetime column only if all the values are converted correctly.

Default Formats:

YYYY-MM-DD HH:mm:ss
YYYY-MM-DD HH:mm
YYYY-MM-DD
YYYY/MM/DD HH:mm:ss
YYYY/MM/DD HH:mm
YYYY/MM/DD
YYYYMMDD HH:mm:ss
YYYYMMDD HH:mm
YYYYMMDD

Parameters:

Name	Type	Description	Default
`series`	`PandasSeries`	`pandas` series to be checked.	required
`column_name`	`str`	name of the column to be checked.	required
`format`	`str \| None`	datetime format used to try to convert the series. If the format is provided, then the default formats are not used.	`None`

Returns:

Type	Description
`DatetimeColumn \| None`	`DatetimeColumn` if the column is a datetime column, `None` otherwise.

Source code in src/hamana/core/identifier.py

def _default_datetime_pandas(series: PandasSeries, column_name: str, order: int | None = None, format: str | None = None) -> DatetimeColumn | None:
    """
        This function defines the default behavior to identify a datetime column from a `pandas` series.

        To identify this type of column, the function removes first the 
        null values, then tries to apply `pandas.to_datetime` with a list 
        of the most common datetime formats. If the column is **not** 
        identified, the function tries to apply `pandas.to_datetime` 
        without providing any format. Since this last operation could 
        lead to wrong inferences, the function considers the column as
        a datetime column only if all the values are converted correctly.

        Default Formats:

        - `YYYY-MM-DD HH:mm:ss`
        - `YYYY-MM-DD HH:mm`
        - `YYYY-MM-DD`
        - `YYYY/MM/DD HH:mm:ss`
        - `YYYY/MM/DD HH:mm`
        - `YYYY/MM/DD`
        - `YYYYMMDD HH:mm:ss`
        - `YYYYMMDD HH:mm`
        - `YYYYMMDD`

        Parameters:
            series: `pandas` series to be checked.
            column_name: name of the column to be checked.
            format: datetime format used to try to convert the series.
                If the format is provided, then the default formats are 
                not used.

        Returns:
            `DatetimeColumn` if the column is a datetime column, `None` otherwise.
    """
    logger.debug("start")
    column = None

    # drop null values
    _series = series.replace("", None).dropna().astype("str")
    logger.debug(f"dropped null values: {len(series) - len(_series)}")
    ColumnIdentifier.is_empty(_series, raise_error = True)

    # set format
    format_list: list[str]
    if format is None:
        logger.debug("no format provided, check most common formats")
        format_list = [
            "%Y-%m-%d %H:%M:%S",
            "%Y-%m-%d %H:%M",
            "%Y-%m-%d",
            "%Y/%m/%d %H:%M:%S",
            "%Y/%m/%d %H:%M",
            "%Y/%m/%d",
            "%Y%m%d %H:%M:%S",
            "%Y%m%d %H:%M",
            "%Y%m%d",
            "%Y%m%d%H%M%S"
        ]
    else:
        format_list = [format]

    # check formats
    logger.debug("check datetime formats")
    for _format in format_list:
        try:
            if pd.to_datetime(_series, errors = "coerce", format = _format).notnull().all():
                logger.info(f"format '{_format}' used, datetime column found")
                column = DatetimeColumn(name = column_name, format = _format, order = order)
                column.inferred = True
                return column
        except Exception:
            logger.warning(f"format '{_format}' not recognized")
            logger.warning("no datetime column found")

    logger.debug("end")
    return None

Date Identifier¶

hamana.core.identifier.date_identifier `module-attribute` ¶

date_identifier = ColumnIdentifier[DatetimeColumn](
    pandas=_default_date_pandas
)

Default identifier for the Datetime class.

More details on the default methods can be found in the corresponding functions' documentation.

pandas: _default_date_pandas
polars: None (not implemented)

hamana.core.identifier._default_date_pandas ¶

_default_date_pandas(
    series: PandasSeries,
    column_name: str,
    order: int | None = None,
    format: str | None = None,
) -> DateColumn | None

This function defines the default behavior to identify a date column from a pandas series.

The function leverages on DatetimeColumn deault pandas identifier method _default_datetime_pandas to identify the column. However, the function considers only datetime formats that do not contain time information.

Default Formats:

YYYY-MM-DD
YYYY/MM/DD
YYYYMMDD

Parameters:

Name	Type	Description	Default
`series`	`PandasSeries`	`pandas` series to be checked.	required
`column_name`	`str`	name of the column to be checked.	required
`format`	`str \| None`	date format used to try to convert the series. If the format is provided, then the default formats are not used. Observe that the format must not contain time information.	`None`

Returns:

Type	Description
`DateColumn \| None`	`DateColumn` if the column is a datetime column, `None` otherwise.

Raises:

Type	Description
`ColumnDateFormatterError`	if the format is not valid.

Source code in src/hamana/core/identifier.py

def _default_date_pandas(series: PandasSeries, column_name: str, order: int | None = None, format: str | None = None) -> DateColumn | None:
    """
        This function defines the default behavior to identify a date column from a `pandas` series.

        The function leverages on `DatetimeColumn` deault pandas identifier method 
        `_default_datetime_pandas` to identify the column. However, the function 
        considers only datetime formats that do not contain time information.

        Default Formats:

        - `YYYY-MM-DD`
        - `YYYY/MM/DD`
        - `YYYYMMDD`

        Parameters:
            series: `pandas` series to be checked.
            column_name: name of the column to be checked.
            format: date format used to try to convert the series.
                If the format is provided, then the default formats are 
                not used. Observe that the format must not contain time
                information.

        Returns:
            `DateColumn` if the column is a datetime column, `None` otherwise.

        Raises:
            ColumnDateFormatterError: if the format is not valid.
    """
    logger.debug("start")
    column = None

    # check valid format
    if format is not None:
        logger.debug("check valid format")
        DateColumn.check_format(format)

    # drop null values
    _series = series.replace("", None).dropna().astype("str")
    logger.debug(f"dropped null values: {len(series) - len(_series)}")
    ColumnIdentifier.is_empty(_series, raise_error = True)

    # set format
    format_list: list[str]
    if format is None:
        logger.debug("no format provided, check most common formats")
        format_list = [
            "%Y-%m-%d",
            "%Y/%m/%d",
            "%Y%m%d"
        ]
    else:
        format_list = [format]

    # check formats
    logger.debug("check datetime formats")
    for _format in format_list:
        if _default_datetime_pandas(_series, column_name, order, _format) is not None:
            logger.info(f"format '{_format}' used, date column found")
            column = DateColumn(name = column_name, format = _format, order = order)
            column.inferred = True

    logger.debug("end")
    return column

API¶

hamana.core.column.Column `dataclass` ¶

Column(
    name: str,
    dtype: DataType,
    parser: ColumnParser | None = None,
    order: int | None = None,
    inferred: bool = False,
)

Class representing a column in the hamana library.

To define a column, the following attributes are required:

name: name of the column.
dtype: represents the datatype and should be an instance of DataType.
parser: a column in hamana could have an associated parser object that could be used to parse list of values; e.g. useful when data are extracted from different data sources and should be casted and normalized.

name `instance-attribute` ¶

name: str

Name of the column.

dtype `instance-attribute` ¶

dtype: DataType

Data type of the column.

parser `class-attribute` `instance-attribute` ¶

parser: ColumnParser | None = None

Parser object for the column.

order `class-attribute` `instance-attribute` ¶

order: int | None = None

Numerical order of the column.

inferred `class-attribute` `instance-attribute` ¶

inferred: bool = False

Flag to indicate if the column was inferred.

hamana.core.column.NumberColumn ¶

NumberColumn(
    name: str,
    decimal_separator: str = ".",
    thousands_separator: str = ",",
    null_default_value: int | float | None = None,
    parser: ColumnParser | None = None,
    order: int | None = None,
)

Bases: Column

Dedicated class representing DataType.NUMBER columns.

The class provides attributes that could be used to define the properties of the number column, such as:

decimal_separator: the decimal separator used in the number. By default, the decimal separator is set to ..
thousands_separator: the thousands separator used in the number. By default, the thousands separator is set to ,.
null_default_value: the default value to be used when a null value is found. By default, the default value is set to None.

The class also provides a default parser that could be used to parse the number column using pandas.

Source code in src/hamana/core/column.py

def __init__(
    self,
    name: str,
    decimal_separator: str = ".",
    thousands_separator: str = ",",
    null_default_value: int | float | None = None,
    parser: ColumnParser | None = None,
    order: int | None = None
):
    # set the attributes
    self.decimal_separator = decimal_separator
    self.thousands_separator = thousands_separator
    self.null_default_value = null_default_value
    self.parser: ColumnParser # type: ignore

    logger.debug(f"decimal separator: {self.decimal_separator}")
    logger.debug(f"thousands separator: {self.thousands_separator}")
    logger.debug(f"null default value: {self.null_default_value}")

    # set default parser
    if parser is None:
        logger.debug("set default parser")
        parser = ColumnParser(pandas = self.pandas_default_parser)

    # call the parent class constructor
    super().__init__(name, DataType.NUMBER, parser, order)

    return

decimal_separator `instance-attribute` ¶

decimal_separator: str = decimal_separator

Decimal separator used in the number.

thousands_separator `instance-attribute` ¶

thousands_separator: str = thousands_separator

Thousands separator used in the number.

null_default_value `instance-attribute` ¶

null_default_value: int | float | None = null_default_value

Default value to be used when a null value is found.

pandas_default_parser ¶

pandas_default_parser(
    series: PandasSeries,
    mode: PandasParsingModes = PandasParsingModes.RAISE,
) -> PandasSeries

Default pandas parser for the number columns. The function converts first the column to string type and replaces the thousands separator with an empty string and the decimal separator with .. Then, the function tries to convert the column to a numeric type using the pandas.to_numeric.

If the null_default_value is set, the function fills the null values with the default value.

Parameters:

Name	Type	Description	Default
`series`	`PandasSeries`	`pandas` series to be parsed.	required
`mode`	`PandasParsingModes`	mode to be used when parsing the number column. By default, the mode is set to `PandasParsingModes.RAISE`.	`PandasParsingModes.RAISE`

Returns:

Type	Description
`PandasSeries`	`pandas` series parsed.

Raises:

Type	Description
`ColumnParserPandasNumberError`	error parsing the number column.

Source code in src/hamana/core/column.py

def pandas_default_parser(self, series: PandasSeries, mode: PandasParsingModes = PandasParsingModes.RAISE) -> PandasSeries:
    """
        Default `pandas` parser for the number columns. The function 
        converts first the column to string type and replaces the 
        thousands separator with an empty string and the decimal 
        separator with `.`. Then, the function tries to convert the 
        column to a numeric type using the `pandas.to_numeric`.

        If the `null_default_value` is set, the function fills the 
        null values with the default value.

        Parameters:
            series: `pandas` series to be parsed.
            mode: mode to be used when parsing the number column.
                By default, the mode is set to `PandasParsingModes.RAISE`.

        Returns:
            `pandas` series parsed.

        Raises:
            `ColumnParserPandasNumberError`: error parsing the number column.
    """

    _series = pd.Series(np.nan, index = series.index)
    try:
        _series_number = pd.to_numeric(series.dropna().astype("str").str.replace(self.thousands_separator, "").str.replace(self.decimal_separator, "."), errors = mode.value) # type: ignore (pandas issue in typing)
        _series.loc[_series_number.index] = _series_number
    except Exception as e:
        logger.error(f"error parsing number: {e}")
        raise ColumnParserPandasNumberError(f"error parsing number: {e}")

    if self.null_default_value is not None:
        logger.debug(f"fill nulls, default value: {self.null_default_value}")
        _series = _series.fillna(self.null_default_value)
    return _series.astype("float")

hamana.core.column.IntegerColumn ¶

IntegerColumn(
    name: str,
    decimal_separator: str = ".",
    thousands_separator: str = ",",
    null_default_value: int | None = 0,
    parser: ColumnParser | None = None,
    order: int | None = None,
)

Bases: NumberColumn

Class representing DataType.INTEGER columns. It ehrits from the NumberColumn class and provides a default parser that could be used to parse integer columns.

Similar to the NumberColumn class, the IntegerColumn class provides attributes that could be used to define the properties of the integer column, such as:

decimal_separator: the decimal separator used in the number. By default, the decimal separator is set to ..
thousands_separator: the thousands separator used in the number. By default, the thousands separator is set to ,.
null_default_value: the default value to be used when a null value is found. By default, the default value is set to 0.

Source code in src/hamana/core/column.py

def __init__(
    self,
    name: str,
    decimal_separator: str = ".",
    thousands_separator: str = ",",
    null_default_value: int | None = 0,
    parser: ColumnParser | None = None,
    order: int | None = None
):

    # call the parent class constructor
    super().__init__(name, decimal_separator, thousands_separator, null_default_value, parser, order)

    # override types
    self.dtype = DataType.INTEGER

pandas_default_parser ¶

pandas_default_parser(
    series: PandasSeries,
    mode: PandasParsingModes = PandasParsingModes.RAISE,
) -> PandasSeries

Default pandas parser for the integer columns. Similar to the NumberColumn class, the function converts first the column to string type and replaces the thousands separator with an empty string and the decimal separator with .. Then, the function tries to convert the column to a numeric type using the pandas.to_numeric.

If the null_default_value is set, the function fills the null values with the default value, and casts the column to integer type. Otherwise, the function applies the np.floor function to the returned series.

Parameters:

Name	Type	Description	Default
`series`	`PandasSeries`	`pandas` series to be parsed.	required
`mode`	`PandasParsingModes`	mode to be used when parsing the number column. By default, the mode is set to `PandasParsingModes.RAISE`.	`PandasParsingModes.RAISE`

Returns:

Type	Description
`PandasSeries`	`pandas` series parsed.

Raises:

Type	Description
`ColumnParserPandasNumberError`	error parsing the number column.

Source code in src/hamana/core/column.py

def pandas_default_parser(self, series: PandasSeries, mode: PandasParsingModes = PandasParsingModes.RAISE) -> PandasSeries:
    """
        Default `pandas` parser for the integer columns. Similar 
        to the `NumberColumn` class, the function converts first 
        the column to string type and replaces the thousands separator
        with an empty string and the decimal separator with `.`. 
        Then, the function tries to convert the column to a numeric
        type using the `pandas.to_numeric`.

        If the `null_default_value` is set, the function fills the
        null values with the default value, and casts the column to 
        integer type. Otherwise, the function applies the `np.floor`
        function to the returned series.

        Parameters:
            series: `pandas` series to be parsed.
            mode: mode to be used when parsing the number column.
                By default, the mode is set to `PandasParsingModes.RAISE`.

        Returns:
            `pandas` series parsed.

        Raises:
            `ColumnParserPandasNumberError`: error parsing the number column.
    """

    _series = pd.Series(np.nan, index = series.index)
    try:
        _series_number = pd.to_numeric(
            arg = series.dropna().astype("str").str.replace(self.thousands_separator, "").str.replace(self.decimal_separator, "."),
            errors = mode.value # type: ignore (pandas issue in typing)
        )
        _series.loc[_series_number.index] = _series_number
    except Exception as e:
        logger.error(f"error parsing integer: {e}")
        raise ColumnParserPandasNumberError(f"error parsing integer: {e}")

    if self.null_default_value is not None:
        logger.debug(f"fill nulls, default value: {self.null_default_value}")
        return pd.Series(_series, dtype = "float").fillna(self.null_default_value).astype("int")

    return pd.Series(_series.astype(float).apply(np.floor), dtype = "Int64")

hamana.core.column.StringColumn ¶

StringColumn(
    name: str,
    parser: ColumnParser | None = None,
    order: int | None = None,
)

Bases: Column

Class representing DataType.STRING columns.

Source code in src/hamana/core/column.py

def __init__(
    self,
    name: str,
    parser: ColumnParser | None = None,
    order: int | None = None
):

    self.parser: ColumnParser # type: ignore

    # set default parser
    if parser is None:
        logger.debug("set default parser")
        parser = ColumnParser(pandas = self.pandas_default_parser)

    # call the parent class constructor
    super().__init__(name, DataType.STRING, parser, order)

    return

pandas_default_parser ¶

pandas_default_parser(series: PandasSeries) -> PandasSeries

Default pandas parser for the string columns. The function converts the column to string type and replaces the null values with None.

Parameters:

Name	Type	Description	Default
`series`	`PandasSeries`	`pandas` series to be parsed.	required

Returns:

Type	Description
`PandasSeries`	`pandas` series parsed

Source code in src/hamana/core/column.py

def pandas_default_parser(self, series: PandasSeries) -> PandasSeries:
    """
        Default `pandas` parser for the string columns. The function
        converts the column to string type and replaces the null values
        with `None`.

        Parameters:
            series: `pandas` series to be parsed.

        Returns:
            `pandas` series parsed
    """
    _series_nulls = series.isnull()
    return series.astype("str").where(~_series_nulls, None)

hamana.core.column.BooleanColumn ¶

BooleanColumn(
    name: str,
    true_value: str | int | float = "Y",
    false_value: str | int | float = "N",
    parser: ColumnParser | None = None,
    order: int | None = None,
)

Bases: Column

Class representing DataType.BOOLEAN columns.

The class provides attributes that could be used to define the properties of the boolean column, such as:

true_value: the value to be used to represent the True value. By default, the value is set to Y.
false_value: the value to be used to represent the False value. By default, the value is set to N.

The class also provides a default parser that could be used to parse the boolean column using pandas.

Source code in src/hamana/core/column.py

def __init__(self,
    name: str,
    true_value: str | int | float = "Y",
    false_value: str | int | float = "N",
    parser: ColumnParser | None = None,
    order: int | None = None
) -> None:

    # set attributes
    self.true_value = true_value
    self.false_value = false_value
    self.parser: ColumnParser # type: ignore

    logger.debug(f"true value: {self.true_value}")
    logger.debug(f"false value: {self.false_value}")

    # set default parser
    if parser is None:
        logger.debug("set default parser")
        parser = ColumnParser(pandas = self.pandas_default_parser)

    # call the parent class constructor
    super().__init__(name, DataType.BOOLEAN, parser, order)

    return

true_value `instance-attribute` ¶

true_value: str | int | float = true_value

Value to be used to represent the True value.

false_value `instance-attribute` ¶

false_value: str | int | float = false_value

Value to be used to represent the False value.

pandas_default_parser ¶

pandas_default_parser(series: PandasSeries) -> PandasSeries

Default pandas parser for the boolean columns. The function maps the values to True and False based on the true_value and false_value attributes.

Observe that all other values are set to None.

Parameters:

Name	Type	Description	Default
`series`	`PandasSeries`	`pandas` series to be parsed.	required

Returns:

Type	Description
`PandasSeries`	`pandas` series parsed.

Source code in src/hamana/core/column.py

def pandas_default_parser(self, series: PandasSeries) -> PandasSeries:
    """
        Default `pandas` parser for the boolean columns.
        The function maps the values to `True` and `False` 
        based on the `true_value` and `false_value` attributes.

        Observe that all other values are set to `None`.

        Parameters:
            series: `pandas` series to be parsed.

        Returns:
            `pandas` series parsed.
    """
    return series.map({self.true_value: True, self.false_value: False})

hamana.core.column.DatetimeColumn ¶

DatetimeColumn(
    name: str,
    format: str = "%Y-%m-%d %H:%M:%S",
    null_default_value: (
        datetime | pd.Timestamp | None
    ) = None,
    parser: ColumnParser | None = None,
    order: int | None = None,
)

Bases: Column

Class representing DataType.DATETIME columns.

The class provides attributes that could be used to define the properties of the datetime column, such as:

format: the format to be used to parse the datetime. By default, the format is set to %Y-%m-%d %H:%M:%S.
null_default_value: the default value to be used when a null value is found. By default, the default value is set to None.

The class also provides a default parser that could be used to parse the datetime column using pandas.

Source code in src/hamana/core/column.py

def __init__(self,
    name: str,
    format: str = "%Y-%m-%d %H:%M:%S",
    null_default_value: datetime | pd.Timestamp | None = None,
    parser: ColumnParser | None = None,
    order: int | None = None
) -> None:

    # set attributes
    self.format = format
    self.null_default_value = null_default_value
    self.parser: ColumnParser # type: ignore

    logger.debug(f"format: {self.format}")
    logger.debug(f"null default value: {self.null_default_value}")

    # set default parser
    if parser is None:
        logger.debug("set default parser")
        parser = ColumnParser(pandas = self.pandas_default_parser)

    # call the parent class constructor
    super().__init__(name, DataType.DATETIME, parser, order)

    return

format `instance-attribute` ¶

format: str = format

Format to be used to parse the datetime.

null_default_value `instance-attribute` ¶

null_default_value: datetime | pd.Timestamp | None = (
    null_default_value
)

Default value to be used when a null value is found.

pandas_default_parser ¶

pandas_default_parser(
    series: PandasSeries,
    mode: PandasParsingModes = PandasParsingModes.RAISE,
) -> PandasSeries

Default pandas parser for the datetime columns. The function tries to convert the column to a datetime type using the pandas.to_datetime.

Observe that pandas.to_datetime could raise an OutOfBoundsDatetime error when the datetime is out of bounds. In this case, the function switches to a 'slow' mode where it first converts the column to string type and divides it into two parts:

the part that could be casted to datetime using the pandas.to_datetime.
the part that could not be casted, and should be parsed using the dateutil.parser.

This approach is slower than the default one, but can handle out of bounds datetimes.

Finally, the function fills the null values with the default value, if set.

If the null_default_value is set, the function fills the null values with the default value.

Parameters:

Name	Type	Description	Default
`series`	`PandasSeries`	`pandas` series to be parsed.	required
`mode`	`PandasParsingModes`	mode to be used when parsing the datetime column. By default, the mode is set to `PandasParsingModes.RAISE`.	`PandasParsingModes.RAISE`

Returns:

Type	Description
`PandasSeries`	`pandas` series parsed.

Raises:

Type	Description
`ColumnParserPandasDatetimeError`	error parsing the datetime column.

Source code in src/hamana/core/column.py

def pandas_default_parser(self, series: PandasSeries, mode: PandasParsingModes = PandasParsingModes.RAISE) -> PandasSeries:
    """
        Default `pandas` parser for the datetime columns. The function
        tries to convert the column to a datetime type using the `pandas.to_datetime`.

        Observe that `pandas.to_datetime` could raise an `OutOfBoundsDatetime` error
        when the datetime is out of bounds. In this case, the function switches to
        a 'slow' mode where it first converts the column to string type and divides 
        it into two parts:

        - the part that could be casted to datetime using the `pandas.to_datetime`.
        - the part that could not be casted, and should be parsed using the `dateutil.parser`.

        This approach is slower than the default one, but can handle out of bounds datetimes.

        Finally, the function fills the null values with the default value, if set.

        If the `null_default_value` is set, the function fills the null values
        with the default value.

        Parameters:
            series: `pandas` series to be parsed.
            mode: mode to be used when parsing the datetime column.
                By default, the mode is set to `PandasParsingModes.RAISE`.

        Returns:
            `pandas` series parsed.

        Raises:
            `ColumnParserPandasDatetimeError`: error parsing the datetime column.
    """

    _series: PandasSeries
    _series_nulls = series.isnull()

    try:
        _series = pd.Series(pd.NaT, index = series.index)
        _series_dt = pd.to_datetime(series.dropna().astype("str"), errors = mode.value, format = self.format) # type: ignore (pandas issue in typing)
        _series.loc[_series_dt.index] = _series_dt
    except OutOfBoundsDatetime as e:
        logger.warning("[WARNING] switched to 'slow' mode due to out of bounds datetimes")
        logger.debug(f"[WARNING] parsing datetime: {e}")
        _series = pd.to_datetime(series.astype("str"), errors = "coerce", format = self.format)
        _series_not_casted = _series.isnull() & ~_series_nulls
        _series_to_cast = series.where(_series_not_casted, None)
        _series = _series.where(~_series_not_casted, _series_to_cast.dropna().apply(parser.parse))
    except Exception as e:
        logger.error(f"error parsing datetime: {e}")
        raise ColumnParserPandasDatetimeError(f"error parsing datetime: {e}")

    logger.debug("update null values")
    if _series_nulls.sum() > 0 and self.null_default_value is not None:
        logger.info("fill nulls")

        if (
                self.null_default_value >= pd.Timestamp.min
            and self.null_default_value <= pd.Timestamp.max
            and "datetime64" in _series.dtype.name
        ):
            _series = _series.fillna(self.null_default_value)
        else:
            _series = _series.mask(_series_nulls, self.null_default_value)

    return _series

hamana.core.column.DateColumn ¶

DateColumn(
    name: str,
    format: str = "%Y-%m-%d",
    null_default_value: (
        datetime | pd.Timestamp | None
    ) = None,
    parser: ColumnParser | None = None,
    order: int | None = None,
)

Bases: DatetimeColumn

Class representing DataType.DATE columns.

The class inherits from the DatetimeColumn class and can be used to store date values. Different from the DatetimeColumn class, the DateColumn class does not store the time part of the datetime.

Note

During the initialization, the format is analysed to ensure that no time part is present. If the time part is found, an error is raised.

Similar to the DatetimeColumn class, the DateColumn class provides attributes that could be used to define the properties of the date column, such as:

format: the format to be used to parse the date. By default, the format is set to %Y-%m-%d.
null_default_value: the default value to be used when a null value is found. By default, the default value is set to None.

Raises:

Type	Description
`ColumnDateFormatterError`	error raised when the date format contains a time part.

Source code in src/hamana/core/column.py

def __init__(self,
    name: str,
    format: str = "%Y-%m-%d",
    null_default_value: datetime | pd.Timestamp | None = None,
    parser: ColumnParser | None = None,
    order: int | None = None
) -> None:

    # check format
    self.check_format(format)

    # call the parent class constructor
    super().__init__(name, format, null_default_value, parser, order)

    # override types
    self.dtype = DataType.DATE

    return

check_format `staticmethod` ¶

check_format(format: str) -> None

Function to check if the date format contains a time part.

Parameters:

Name	Type	Description	Default
`format`	`str`	date format to be checked.	required

Raises:

Type	Description
`ColumnDateFormatterError`	error raised when the date format contains a time part.

Source code in src/hamana/core/column.py

@staticmethod
def check_format(format: str) -> None:
    """
        Function to check if the date format contains a time part.

        Parameters:
            format: date format to be checked.

        Raises:
            `ColumnDateFormatterError`: error raised when the date format contains a time part.
    """
    not_admissible_formats = ["%H", "%I", "%p", "%M", "%S", "%f", "%z", "%c", "%X"]
    if any([f in format for f in not_admissible_formats]):
        raise ColumnDateFormatterError(f"date format {format} should not contain time part")

Columns¶

DataType¶

hamana.core.column.DataType ¶

INTEGER class-attribute instance-attribute ¶

NUMBER class-attribute instance-attribute ¶

STRING class-attribute instance-attribute ¶

BOOLEAN class-attribute instance-attribute ¶

DATETIME class-attribute instance-attribute ¶

DATE class-attribute instance-attribute ¶

CUSTOM class-attribute instance-attribute ¶

from_pandas classmethod ¶

to_sqlite classmethod ¶

Parser¶

hamana.core.column.ColumnParser dataclass ¶

hamana.core.column.PandasParser ¶

Identifier¶

hamana.core.identifier.ColumnIdentifier dataclass ¶

is_empty staticmethod ¶

__call__ ¶

infer staticmethod ¶

hamana.core.identifier.PandasIdentifier ¶

Default Identifiers¶

Number Identifier¶

hamana.core.identifier.number_identifier module-attribute ¶

hamana.core.identifier._default_numeric_pandas ¶

Integer Identifier¶

hamana.core.identifier.integer_identifier module-attribute ¶

hamana.core.identifier._default_integer_pandas ¶

String Identifier¶

hamana.core.identifier.string_identifier module-attribute ¶

hamana.core.identifier._default_string_pandas ¶

Boolean Identifier¶

hamana.core.identifier.boolean_identifier module-attribute ¶

hamana.core.identifier._default_boolean_pandas ¶

Datetime Identifier¶

hamana.core.identifier.datetime_identifier module-attribute ¶

hamana.core.identifier._default_datetime_pandas ¶

Date Identifier¶

hamana.core.identifier.date_identifier module-attribute ¶

hamana.core.identifier._default_date_pandas ¶

API¶

hamana.core.column.Column dataclass ¶

name instance-attribute ¶

dtype instance-attribute ¶

parser class-attribute instance-attribute ¶

order class-attribute instance-attribute ¶

inferred class-attribute instance-attribute ¶

hamana.core.column.NumberColumn ¶

decimal_separator instance-attribute ¶

thousands_separator instance-attribute ¶

null_default_value instance-attribute ¶

pandas_default_parser ¶

hamana.core.column.IntegerColumn ¶

pandas_default_parser ¶

hamana.core.column.StringColumn ¶

pandas_default_parser ¶

hamana.core.column.BooleanColumn ¶

true_value instance-attribute ¶

false_value instance-attribute ¶

pandas_default_parser ¶

hamana.core.column.DatetimeColumn ¶

format instance-attribute ¶

null_default_value instance-attribute ¶

pandas_default_parser ¶

hamana.core.column.DateColumn ¶

check_format staticmethod ¶

INTEGER `class-attribute` `instance-attribute` ¶

NUMBER `class-attribute` `instance-attribute` ¶

STRING `class-attribute` `instance-attribute` ¶

BOOLEAN `class-attribute` `instance-attribute` ¶

DATETIME `class-attribute` `instance-attribute` ¶

DATE `class-attribute` `instance-attribute` ¶

CUSTOM `class-attribute` `instance-attribute` ¶

from_pandas `classmethod` ¶

to_sqlite `classmethod` ¶

hamana.core.column.ColumnParser `dataclass` ¶

hamana.core.identifier.ColumnIdentifier `dataclass` ¶

is_empty `staticmethod` ¶

call ¶

infer `staticmethod` ¶

hamana.core.identifier.number_identifier `module-attribute` ¶

hamana.core.identifier.integer_identifier `module-attribute` ¶

hamana.core.identifier.string_identifier `module-attribute` ¶

hamana.core.identifier.boolean_identifier `module-attribute` ¶

hamana.core.identifier.datetime_identifier `module-attribute` ¶

hamana.core.identifier.date_identifier `module-attribute` ¶

hamana.core.column.Column `dataclass` ¶

name `instance-attribute` ¶

dtype `instance-attribute` ¶

parser `class-attribute` `instance-attribute` ¶

order `class-attribute` `instance-attribute` ¶

inferred `class-attribute` `instance-attribute` ¶

decimal_separator `instance-attribute` ¶

thousands_separator `instance-attribute` ¶

null_default_value `instance-attribute` ¶

true_value `instance-attribute` ¶

false_value `instance-attribute` ¶

format `instance-attribute` ¶

null_default_value `instance-attribute` ¶

check_format `staticmethod` ¶