Skip to content

Columns

Due to its nature, hamana library is designed to work with data often extracted in tabular form. As a consequence, it was introduced the Column class that could be used to store useful information about the data extracted (e.g. name, type, etc.) and better describe the data. For example, the Column class can be found in Query objects, or in the definition of CSV connectors.

Even if Column classes define a general behavior, they can be customized to better fit to specific data types or sources. hamana provides default implemntations for the most common data types:

  • NumberColumn: this column can be used to manage any kind of number.
  • IntegerColumn: column class specialised to manage integer values.
  • StringColumn: column class specialised to manage string values.
  • BooleanColumn: column class specialised to manage boolean values.
  • DatetimeColumn: this column is specific for datetime values.
  • DateColumn: this column is specific for date values.

These classes could be useful because they provide already a default implementation of the ColumnParser class, that is used to convert the data from the source to the internal representation. In addition, they provide additional class attributes fitting the desired datatype.

Clearly, it remains always possible to create custom Column classes by extending the Column class and providing a custom implementation of the ColumnParser class.

DataType

Before presenting the Column class, we first introduce the DataType class. This class creates a standard inside the library to manage the types, and it provides a bridge between SQLite and pandas data types.

hamana.core.column.DataType

Bases: Enum

Enumeration representing the datatypes of the hamana columns.

The library supports the following data types:

  • INTEGER: integer data type.
  • NUMBER: number data type.
  • STRING: string data type.
  • BOOLEAN: boolean data type.
  • DATETIME: datetime data type.
  • DATE: date data type.
  • CUSTOM: custom data type.

The CUSTOM data type is used to represent a custom datatype that could be used for dedicated implementations.

Since the library is designed to be used with pandas and sqlite, the DataType enumeration also provides a method to map the data types to the corresponding data types in sqlite and pandas.

INTEGER class-attribute instance-attribute

INTEGER = 'integer'

Integer data type.

NUMBER class-attribute instance-attribute

NUMBER = 'number'

Number data type.

STRING class-attribute instance-attribute

STRING = 'string'

String data type.

BOOLEAN class-attribute instance-attribute

BOOLEAN = 'boolean'

Boolean data type.

DATETIME class-attribute instance-attribute

DATETIME = 'datetime'

Datetime data type.

DATE class-attribute instance-attribute

DATE = 'date'

Date data type.

CUSTOM class-attribute instance-attribute

CUSTOM = 'custom'

Custom data type.

from_pandas classmethod

from_pandas(dtype: str) -> DataType

Function to map a pandas datatype to DataType.

Observe that if no mapping is found, the default is DataType.STRING.

Parameters:

Name Type Description Default
dtype str

pandas data type.

required

Returns:

Type Description
DataType

DataType mapped.

Source code in src/hamana/core/column.py
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
@classmethod
def from_pandas(cls, dtype: str) -> "DataType":
    """
        Function to map a `pandas` datatype to `DataType`.

        Observe that if no mapping is found, the default is `DataType.STRING`.

        Parameters:
            dtype: pandas data type.

        Returns:
            `DataType` mapped.
    """
    if "int" in dtype:
        return DataType.INTEGER
    elif "float" in dtype:
        return DataType.NUMBER
    elif dtype == "object":
        return DataType.STRING
    elif dtype == "bool":
        return DataType.BOOLEAN
    elif "datetime" in dtype:
        return DataType.DATETIME
    else:
        logger.warning(f"unknown data type: {dtype}")
        return DataType.STRING

to_sqlite classmethod

to_sqlite(dtype: DataType) -> str

Function to map a DataType to a SQLite datatype.

Parameters:

Name Type Description Default
dtype DataType

DataType to be mapped.

required

Returns:

Type Description
str

SQLite data type mapped.

Source code in src/hamana/core/column.py
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
@classmethod
def to_sqlite(cls, dtype: "DataType") -> str:
    """
        Function to map a `DataType` to a SQLite datatype.

        Parameters:
            dtype: `DataType` to be mapped.

        Returns:
            SQLite data type mapped.
    """
    match dtype:
        case DataType.INTEGER:
            return "INTEGER"
        case DataType.NUMBER:
            return "REAL"
        case DataType.STRING:
            return "TEXT"
        case DataType.BOOLEAN:
            return "INTEGER"
        case DataType.DATETIME:
            return "INTEGER"
        case DataType.DATE:
            return "INTEGER"
        case DataType.CUSTOM:
            return "BLOB"
        case _:
            return ""

Parser

Another useful functionality that could be available in the Column class is the parser attribute. This variable, if present, is an instance of the ColumnParser class, that is used to convert the data from the source to the internal representation.

The ColumnParser class is composed of two methods:

  • pandas: this method must respect the protocol PandasParser, and it is specifically used to convert pandas.Series input datas.
  • polars: currently not supported, but it will be used to convert polars.Series input datas.

By default, the Column class does not provide any parser, but the NumberColumn, IntegerColumn, StringColumn, BooleanColumn, DatetimeColumn, and DateColumn classes provide a default implementation of the ColumnParser class.

hamana.core.column.ColumnParser dataclass

ColumnParser(
    pandas: PandasParser, polars: Callable | None = None
)

Class representing a parser for a column in the hamana library.

Since the library is designed to be used with pandas and polars, the ColumnParser class provides methods that could be used to parse data coming from these libraries.

hamana.core.column.PandasParser

Bases: Protocol

Protocol representing a parser for pandas series.

A pandas parser is a function that requires at least a pandas series to be taken as input and returned as output after dedicated transformations.

Structure:

def parser(series: pandas.Series, *args: Any, **kwargs: Any) -> pandas.Series:
    ...

Identifier

The are many situations where it is required to identity the column datatype (string, number, date, etc.), e.g. when the data is extracted from file sources like CSV files. To solve this problem, hamana provides the ColumnIdentifier class, that is used to identify the column type according to an input data.

Similarly to the ColumnParser class, the ColumnIdentifier class is composed of two methods:

  • pandas: this method must respect the protocol PandasIdentifier, and it is specifically used to identify the column type from a pandas.Series input data.
  • polars: currently not supported, but it will be used to identify the column type from a polars.Series input data.

hamana.core.identifier.ColumnIdentifier dataclass

ColumnIdentifier(
    pandas: PandasIdentifier[TColumn],
    polars: Callable | None = None,
)

Bases: Generic[TColumn]

Class representing an identifier for a column in the hamana library.

Since the library is designed to be used with pandas and polars, the ColumnIdentifier class provides methods that could be used to identify the column from a set of data from both libraries.

Note

Observe that the identification process tries to infer the column type based on the data provided. The process is not perfect and could lead to wrong inferences. The user should always check the inferred column type and adjust it if needed.

is_empty staticmethod

is_empty(
    series: PandasSeries, raise_error: bool = False
) -> bool

Check if the series is empty.

Parameters:

Name Type Description Default
series PandasSeries

the series to check.

required
raise_error bool

if True, raise an error if the series is empty.

False

Returns:

Type Description
bool

True if the series is empty, False otherwise.

Source code in src/hamana/core/identifier.py
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
@staticmethod
def is_empty(series: PandasSeries, raise_error: bool = False) -> bool:
    """
        Check if the series is empty.

        Parameters:
            series: the series to check.
            raise_error: if True, raise an error if the series is empty.

        Returns:
            True if the series is empty, False otherwise.
    """
    logger.debug("start")

    is_empty = series.empty
    if is_empty and raise_error:
        raise ColumnIdentifierEmptySeriesError("empty series")

    logger.debug("end")
    return is_empty

__call__

__call__(
    series: Any,
    column_name: str,
    order: int | None = None,
    *args: Any,
    **kwargs: Any
) -> TColumn | None

Identifies the column type from a given series.

Parameters:

Name Type Description Default
series Any

the series to identify the column type from.

required
column_name str

the name of the column to identify.

required
*args Any

additional arguments to pass to the identifier.

()
**kwargs Any

additional keyword arguments to pass to the identifier.

{}

Returns:

Type Description
TColumn | None

the identified column type or None if the column type could not be identified.

Source code in src/hamana/core/identifier.py
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
def __call__(self, series: Any, column_name: str, order: int | None = None, *args: Any, **kwargs: Any) -> TColumn | None:
    """
        Identifies the column type from a given series.

        Parameters:
            series: the series to identify the column type from.
            column_name: the name of the column to identify.
            *args: additional arguments to pass to the identifier.
            **kwargs: additional keyword arguments to pass to the identifier.

        Returns:
            the identified column type or `None` if the column type
                could not be identified.
    """
    logger.debug("start")

    _series = None

    # pandas series
    if isinstance(series, PandasSeries):
        try:
            logging.debug("Identifying column type using pandas identifier.")
            _series = self.pandas(series, column_name, order, *args, **kwargs)
        except ColumnDateFormatterError as e:
            logger.error("Column date formatter error.")
            logger.exception(e)
            raise e
        except Exception as e:
            logger.info("pandas identifier failed.")
            logger.exception(e)

    logger.debug("end")
    return _series

infer staticmethod

infer(
    series: Any,
    column_name: str,
    order: int | None = None,
    *args: Any,
    **kwargs: Any
) -> (
    NumberColumn
    | IntegerColumn
    | StringColumn
    | BooleanColumn
    | DatetimeColumn
    | DateColumn
)

Infers the column type from a given series. The function passes the series to the default hamana identifiers in the following order:

in order to infer the column type.

Note

If the column is empty, then by default the function assign the STRING datatype.

Parameters:

Name Type Description Default
series Any

the series to infer the column type from.

required
*args Any

additional arguments to pass to the identifier.

()
**kwargs Any

additional keyword arguments to pass to the identifier.

{}

Returns:

Type Description
NumberColumn | IntegerColumn | StringColumn | BooleanColumn | DatetimeColumn | DateColumn

the inferred column type.

Raises:

Type Description
ColumnIdentifierError

if no column type could be inferred.

Source code in src/hamana/core/identifier.py
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
@staticmethod
def infer(series: Any, column_name: str, order: int | None = None, *args: Any, **kwargs: Any) -> NumberColumn | IntegerColumn | StringColumn | BooleanColumn | DatetimeColumn | DateColumn:
    """
        Infers the column type from a given series. The function passes 
        the series to the default `hamana` identifiers in the following
        order:

        - [`DatetimeColumn`][hamana.core.column.DatetimeColumn]
        - [`BooleanColumn`][hamana.core.column.BooleanColumn]
        - [`IntegerColumn`][hamana.core.column.IntegerColumn]
        - [`NumberColumn`][hamana.core.column.NumberColumn]
        - [`StringColumn`][hamana.core.column.StringColumn]

        in order to infer the column type.

        Note:
            If the column is empty, then by default the 
            function assign the `STRING` datatype.

        Parameters:
            series: the series to infer the column type from.
            *args: additional arguments to pass to the identifier.
            **kwargs: additional keyword arguments to pass to the identifier.

        Returns:
            the inferred column type.

        Raises:
            ColumnIdentifierError: if no column type could be inferred.
    """
    logger.debug("start")

    try:
        # infer date column
        inferred_column = date_identifier(series, column_name, order, *args, **kwargs)
        if inferred_column is not None:
            logger.info(f"date column inferred, format: {inferred_column.format}")
            return inferred_column

        # infer datetime column
        inferred_column = datetime_identifier(series, column_name, order, *args, **kwargs)
        if inferred_column is not None:
            logger.info(f"datetime column inferred, format: {inferred_column.format}")
            return inferred_column

        # infer boolean column
        inferred_column = boolean_identifier(series, column_name, order, *args, **kwargs)
        if inferred_column is not None:
            logger.info(f"boolean column inferred, true value: {inferred_column.true_value}, false value: {inferred_column.false_value}")
            return inferred_column

        # infer integer column
        inferred_column = integer_identifier(series, column_name, order, *args, **kwargs)
        if inferred_column is not None:
            logger.info(f"integer column inferred, decimal separator: {inferred_column.decimal_separator}, thousands separator: {inferred_column.thousands_separator}")
            return inferred_column

        # infer number column
        inferred_column = number_identifier(series, column_name, order, *args, **kwargs)
        if inferred_column is not None:
            logger.info(f"number column inferred, decimal separator: {inferred_column.decimal_separator}, thousands separator: {inferred_column.thousands_separator}")
            return inferred_column

        # infer string column
        inferred_column = string_identifier(series, column_name, order, *args, **kwargs)
        if inferred_column is not None:
            logger.info("string column inferred")
            return inferred_column
    except ColumnIdentifierEmptySeriesError:
        logger.warning(f"column '{column_name}' empty, assigned STRING datatype.")
        return StringColumn(name = column_name, order = order)

    raise ColumnIdentifierError("no column inferred")

hamana.core.identifier.PandasIdentifier

Bases: Protocol[TColumn]

Protocol representing an identifier for pandas series.

A PandasIdentifier is a callable that must have at least the following input parameters:

  • series: the pandas series to identify the column type from.
  • column_name: the name of the column to identify.

The PandasIdentifier must return a column type or None if the column type could not be identified.

Structure

def __call__(self, series: PandasSeries, column_name: str, order: int | None = None, *args: Any, **kwargs: Any) -> TColumn | None:
    ...

Default Identifiers

hamana provides a set of default identifiers that can be used to identify the default's hamana column types.

Number Identifier

hamana.core.identifier.number_identifier module-attribute

number_identifier = ColumnIdentifier[NumberColumn](
    pandas=_default_numeric_pandas
)

Default identifier for the NumberColumn class.

More details on the default methods can be found in the corresponding functions' documentation.

  • pandas: _default_numeric_pandas
  • polars: None (not implemented)

hamana.core.identifier._default_numeric_pandas

_default_numeric_pandas(
    series: PandasSeries,
    column_name: str,
    order: int | None = None,
) -> NumberColumn | None

This function defines the default behavior to identify a number column from a pandas series.

In order to identify a number column, the function follows the steps:

  • Drop null values (included empty strings)
  • Check if the column has letters
  • Count the max appearance of the comma and dot separators in all the elements.
  • Evaluate first the default configuration (dot decimal separator, comma thousands separator).
  • If the default configuration does not work, evaluate the alternative configuration (comma decimal separator, dot thousands separator).
  • If also this configuration does not work, return None.

Parameters:

Name Type Description Default
series PandasSeries

pandas series to be checked.

required
column_name str

name of the column to be checked.

required

Returns:

Type Description
NumberColumn | None

NumberColumn if the column is a number column, None otherwise.

Source code in src/hamana/core/identifier.py
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
def _default_numeric_pandas(series: PandasSeries, column_name: str, order: int | None = None) -> NumberColumn | None:
    """
        This function defines the default behavior to identify a number column from a `pandas` series.

        In order to identify a number column, the function follows the steps:

        - Drop null values (included empty strings)
        - Check if the column has letters
        - Count the max appearance of the comma and dot 
            separators in all the elements.
        - Evaluate first the default configuration (dot decimal separator, 
            comma thousands separator).
        - If the default configuration does not work, evaluate the 
            alternative configuration (comma decimal separator, dot 
            thousands separator).
        - If also this configuration does not work, return None.

        Parameters:
            series: `pandas` series to be checked.
            column_name: name of the column to be checked.

        Returns:
            `NumberColumn` if the column is a number column, `None` otherwise.
    """
    logger.debug("start")
    column = None

    # drop null values
    _series = series.replace("", None).dropna().astype("str")
    logger.debug(f"dropped null values: {len(series) - len(_series)}")
    ColumnIdentifier.is_empty(_series, raise_error = True)

    # check letters presence
    logger.debug("check letters")
    if _series.str.replace(r"[0-9\.\-\+\,eE]", "", regex = True).str.len().sum() > 0:
        logger.warning("letters found, no number column")
        return None

    # check separators
    comma_separator_count = _series.str.replace(r"[0-9\.\-\+eE]", "", regex = True).str.len().max()
    logger.debug(f"comma separator count: {comma_separator_count}")

    dot_separator_count   = _series.str.replace(r"[0-9\-\+\,eE]", "", regex = True).str.len().max()
    logger.debug(f"dot separator count: {dot_separator_count}")

    if (
            dot_separator_count in [0, 1]
        and _series.str.match(r"^[+-]?(\d+(\,\d{3})*|\d{1,2})(\.\d+)?([eE][+-]?\d+)?$").all()
    ):
        logger.info("possible number column: dot decimal separator, comma thousands separator")
        column = NumberColumn(name = column_name, decimal_separator = ".", thousands_separator = ",", order = order)
        column.inferred = True
    elif (
            comma_separator_count in [0, 1]
        and _series.str.match(r"^[+-]?(\d+(\.\d{3})*|\d{1,2})(\,\d+)?([eE][+-]?\d+)?$").all()
    ):
        logger.info("possible number column: comma decimal separator, dot thousands separator")
        column = NumberColumn(name = column_name, decimal_separator = ",", thousands_separator = ".", order = order)
        column.inferred = True
    else:
        logger.warning("no separator found")

    logger.debug("end")
    return column

Integer Identifier

hamana.core.identifier.integer_identifier module-attribute

integer_identifier = ColumnIdentifier[IntegerColumn](
    pandas=_default_integer_pandas
)

Default identifier for the IntegerColumn class.

More details on the default methods can be found in the corresponding functions' documentation.

  • pandas: _default_integer_pandas
  • polars: None (not implemented)

hamana.core.identifier._default_integer_pandas

_default_integer_pandas(
    series: PandasSeries,
    column_name: str,
    order: int | None = None,
) -> IntegerColumn | None

This function defines the default behavior to identify an integer column from a pandas series.

In order to identify an integer column, the function follows the steps:

  • Drop null values (included empty strings)
  • Check if the column can be considered as number datatype
  • If the check is passed, then is checked if the column is composed only by integers (included the sign).

Parameters:

Name Type Description Default
series PandasSeries

pandas series to be checked.

required
column_name str

name of the column to be checked.

required

Returns:

Type Description
IntegerColumn | None

IntegerColumn if the column is an integer column, None otherwise.

Source code in src/hamana/core/identifier.py
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
def _default_integer_pandas(series: PandasSeries, column_name: str, order: int | None = None) -> IntegerColumn | None:
    """
        This function defines the default behavior to identify an integer column from a `pandas` series.

        In order to identify an integer column, the function follows the steps:

        - Drop null values (included empty strings)
        - Check if the column can be considered as number datatype
        - If the check is passed, then is checked if the column is 
            composed only by integers (included the sign).

        Parameters:
            series: `pandas` series to be checked.
            column_name: name of the column to be checked.

        Returns:
            `IntegerColumn` if the column is an integer column, `None` otherwise.
    """
    logger.debug("start")
    column = None

    # drop null values
    _series = series.replace("", None).dropna().astype("str")
    logger.debug(f"dropped null values: {len(series) - len(_series)}")

    # check number column
    inferred_column = number_identifier.pandas(series, column_name)
    if inferred_column is None:
        logger.warning("no number column found")
        return None
    logger.debug("number column inferred")

    # adjust series
    _series = _series.str.replace(r"\.0$", "", regex = True)

    # check separators
    comma_separator_count = _series.str.replace(r"[0-9\.\-\+eE]", "", regex = True).str.len().max()
    logger.debug(f"comma separator count: {comma_separator_count}")

    dot_separator_count   = _series.str.replace(r"[0-9\-\+\,eE]", "", regex = True).str.len().max()
    logger.debug(f"dot separator count: {dot_separator_count}")

    # infer thousands separator
    thousands_separator = inferred_column.thousands_separator
    decimal_separator = inferred_column.decimal_separator
    if comma_separator_count >= 0 and dot_separator_count == 0:
        thousands_separator = ","
        decimal_separator   = "."
    elif dot_separator_count >= 0 and comma_separator_count == 0:
        thousands_separator = "."
        decimal_separator   = ","

    int_regex = rf"^[+-]?(\d+(\{thousands_separator}" + r"\d{3})*|\d{1,2})$"
    if thousands_separator is not None and _series.str.match(int_regex).all():
        logger.info("integer column found")
        column = IntegerColumn(name = inferred_column.name, decimal_separator = decimal_separator, thousands_separator = thousands_separator, order = order)
        column.inferred = True
    else:
        logger.warning("no integer column found")

    logger.debug("end")
    return column

String Identifier

hamana.core.identifier.string_identifier module-attribute

string_identifier = ColumnIdentifier[StringColumn](
    pandas=_default_string_pandas
)

Default identifier for the StringColumn class.

More details on the default methods can be found in the corresponding functions' documentation.

  • pandas: _default_string_pandas
  • polars: None (not implemented)

hamana.core.identifier._default_string_pandas

_default_string_pandas(
    series: PandasSeries,
    column_name: str,
    order: int | None = None,
) -> StringColumn | None

Function to identify a string column from a pandas series.

The function checks if the column is a string column by converting the column to string type and checking if at least one value can be considered as string.

Parameters:

Name Type Description Default
series PandasSeries

pandas series to be checked.

required
column_name str

name of the column to be checked.

required

Returns:

Type Description
StringColumn | None

StringColumn if the column is a string column, None otherwise.

Source code in src/hamana/core/identifier.py
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
def _default_string_pandas(series: PandasSeries, column_name: str, order: int | None = None) -> StringColumn | None:
    """
        Function to identify a string column from a `pandas` series.

        The function checks if the column is a string column by 
        converting the column to string type and checking if at least 
        one value can be considered as string.

        Parameters:
            series: `pandas` series to be checked.
            column_name: name of the column to be checked.

        Returns:
            `StringColumn` if the column is a string column, `None` otherwise.
    """
    logger.debug("start")
    column = None

    # drop null values
    _series = series.replace("", None).dropna()
    logger.debug(f"dropped null values: {len(series) - len(_series)}")
    ColumnIdentifier.is_empty(_series, raise_error = True)

    # check values
    logger.debug("check values")
    if _series.astype("str").str.match(r"^[A-Za-z\d\W]+$").any():
        logger.info("string column found")
        column = StringColumn(name = column_name, order = order)
        column.inferred = True
    else:
        logger.warning("no string column found")

    logger.debug("end")
    return column

Boolean Identifier

hamana.core.identifier.boolean_identifier module-attribute

boolean_identifier = ColumnIdentifier[BooleanColumn](
    pandas=_default_boolean_pandas
)

Default identifier for the BooleanColumn class.

More details on the default methods can be found in the corresponding functions' documentation.

  • pandas: _default_boolean_pandas
  • polars: None (not implemented)

hamana.core.identifier._default_boolean_pandas

_default_boolean_pandas(
    series: PandasSeries,
    column_name: str,
    order: int | None = None,
    min_count: int = 1000,
) -> BooleanColumn | None

This function defines the default behavior to identify a boolean column from a pandas series.

To identify a boolean column, the function checks if the column has only two unique values. Observe, that the function does not check if the values are boolean values, but only if the column has two unique values; for this reason the assignment of the True and False values is arbitrary.

Parameters:

Name Type Description Default
series PandasSeries

pandas series to be checked.

required
column_name str

name of the column to be checked.

required
min_count int

minimum number of elements to consider the column as a boolean column. This parameter is used to avoid wrong inferences when the column has only a few elements.

1000

Returns:

Type Description
BooleanColumn | None

BooleanColumn if the column is a boolean column, None otherwise.

Source code in src/hamana/core/identifier.py
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
def _default_boolean_pandas(series: PandasSeries, column_name: str, order: int | None = None, min_count: int = 1_000) -> BooleanColumn | None:
    """
        This function defines the default behavior to identify a boolean column from a `pandas` series.

        To identify a boolean column, the function checks if the column has only two unique values.
        Observe, that the function does not check if the values are boolean values, but only if the
        column has two unique values; for this reason the assignment of the `True` and `False` values
        is arbitrary.

        Parameters:
            series: `pandas` series to be checked.
            column_name: name of the column to be checked.
            min_count: minimum number of elements to consider the column as a boolean column.
                This parameter is used to avoid wrong inferences when the column has only a few elements.

        Returns:
            `BooleanColumn` if the column is a boolean column, `None` otherwise.
    """
    logger.debug("start")
    column = None

    # drop null values
    _series = series.replace("", None).dropna()
    logger.debug(f"dropped null values: {len(series) - len(_series)}")
    ColumnIdentifier.is_empty(_series, raise_error = True)

    # check values
    logger.debug("check values")
    count_disinct = _series.nunique()
    if count_disinct == 2 and len(_series) > min_count:
        values = _series.unique()
        logger.info(f"boolean column found, unique values: {values}")
        column = BooleanColumn(name = column_name, true_value = values[0], false_value = values[1], order = order)
        column.inferred = True
    else:
        logger.warning(f"no boolean column, unique values: {count_disinct}")

    logger.debug("end")
    return column

Datetime Identifier

hamana.core.identifier.datetime_identifier module-attribute

datetime_identifier = ColumnIdentifier[DatetimeColumn](
    pandas=_default_datetime_pandas
)

Default identifier for the DatetimeColumn class.

More details on the default methods can be found in the corresponding functions' documentation.

  • pandas: _default_datetime_pandas
  • polars: None (not implemented)

hamana.core.identifier._default_datetime_pandas

_default_datetime_pandas(
    series: PandasSeries,
    column_name: str,
    order: int | None = None,
    format: str | None = None,
) -> DatetimeColumn | None

This function defines the default behavior to identify a datetime column from a pandas series.

To identify this type of column, the function removes first the null values, then tries to apply pandas.to_datetime with a list of the most common datetime formats. If the column is not identified, the function tries to apply pandas.to_datetime without providing any format. Since this last operation could lead to wrong inferences, the function considers the column as a datetime column only if all the values are converted correctly.

Default Formats:

  • YYYY-MM-DD HH:mm:ss
  • YYYY-MM-DD HH:mm
  • YYYY-MM-DD
  • YYYY/MM/DD HH:mm:ss
  • YYYY/MM/DD HH:mm
  • YYYY/MM/DD
  • YYYYMMDD HH:mm:ss
  • YYYYMMDD HH:mm
  • YYYYMMDD

Parameters:

Name Type Description Default
series PandasSeries

pandas series to be checked.

required
column_name str

name of the column to be checked.

required
format str | None

datetime format used to try to convert the series. If the format is provided, then the default formats are not used.

None

Returns:

Type Description
DatetimeColumn | None

DatetimeColumn if the column is a datetime column, None otherwise.

Source code in src/hamana/core/identifier.py
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
def _default_datetime_pandas(series: PandasSeries, column_name: str, order: int | None = None, format: str | None = None) -> DatetimeColumn | None:
    """
        This function defines the default behavior to identify a datetime column from a `pandas` series.

        To identify this type of column, the function removes first the 
        null values, then tries to apply `pandas.to_datetime` with a list 
        of the most common datetime formats. If the column is **not** 
        identified, the function tries to apply `pandas.to_datetime` 
        without providing any format. Since this last operation could 
        lead to wrong inferences, the function considers the column as
        a datetime column only if all the values are converted correctly.

        Default Formats:

        - `YYYY-MM-DD HH:mm:ss`
        - `YYYY-MM-DD HH:mm`
        - `YYYY-MM-DD`
        - `YYYY/MM/DD HH:mm:ss`
        - `YYYY/MM/DD HH:mm`
        - `YYYY/MM/DD`
        - `YYYYMMDD HH:mm:ss`
        - `YYYYMMDD HH:mm`
        - `YYYYMMDD`

        Parameters:
            series: `pandas` series to be checked.
            column_name: name of the column to be checked.
            format: datetime format used to try to convert the series.
                If the format is provided, then the default formats are 
                not used.

        Returns:
            `DatetimeColumn` if the column is a datetime column, `None` otherwise.
    """
    logger.debug("start")
    column = None

    # drop null values
    _series = series.replace("", None).dropna().astype("str")
    logger.debug(f"dropped null values: {len(series) - len(_series)}")
    ColumnIdentifier.is_empty(_series, raise_error = True)

    # set format
    format_list: list[str]
    if format is None:
        logger.debug("no format provided, check most common formats")
        format_list = [
            "%Y-%m-%d %H:%M:%S",
            "%Y-%m-%d %H:%M",
            "%Y-%m-%d",
            "%Y/%m/%d %H:%M:%S",
            "%Y/%m/%d %H:%M",
            "%Y/%m/%d",
            "%Y%m%d %H:%M:%S",
            "%Y%m%d %H:%M",
            "%Y%m%d",
            "%Y%m%d%H%M%S"
        ]
    else:
        format_list = [format]

    # check formats
    logger.debug("check datetime formats")
    for _format in format_list:
        try:
            if pd.to_datetime(_series, errors = "coerce", format = _format).notnull().all():
                logger.info(f"format '{_format}' used, datetime column found")
                column = DatetimeColumn(name = column_name, format = _format, order = order)
                column.inferred = True
                return column
        except Exception:
            logger.warning(f"format '{_format}' not recognized")
            logger.warning("no datetime column found")

    logger.debug("end")
    return None

Date Identifier

hamana.core.identifier.date_identifier module-attribute

date_identifier = ColumnIdentifier[DatetimeColumn](
    pandas=_default_date_pandas
)

Default identifier for the Datetime class.

More details on the default methods can be found in the corresponding functions' documentation.

  • pandas: _default_date_pandas
  • polars: None (not implemented)

hamana.core.identifier._default_date_pandas

_default_date_pandas(
    series: PandasSeries,
    column_name: str,
    order: int | None = None,
    format: str | None = None,
) -> DateColumn | None

This function defines the default behavior to identify a date column from a pandas series.

The function leverages on DatetimeColumn deault pandas identifier method _default_datetime_pandas to identify the column. However, the function considers only datetime formats that do not contain time information.

Default Formats:

  • YYYY-MM-DD
  • YYYY/MM/DD
  • YYYYMMDD

Parameters:

Name Type Description Default
series PandasSeries

pandas series to be checked.

required
column_name str

name of the column to be checked.

required
format str | None

date format used to try to convert the series. If the format is provided, then the default formats are not used. Observe that the format must not contain time information.

None

Returns:

Type Description
DateColumn | None

DateColumn if the column is a datetime column, None otherwise.

Raises:

Type Description
ColumnDateFormatterError

if the format is not valid.

Source code in src/hamana/core/identifier.py
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
def _default_date_pandas(series: PandasSeries, column_name: str, order: int | None = None, format: str | None = None) -> DateColumn | None:
    """
        This function defines the default behavior to identify a date column from a `pandas` series.

        The function leverages on `DatetimeColumn` deault pandas identifier method 
        `_default_datetime_pandas` to identify the column. However, the function 
        considers only datetime formats that do not contain time information.

        Default Formats:

        - `YYYY-MM-DD`
        - `YYYY/MM/DD`
        - `YYYYMMDD`

        Parameters:
            series: `pandas` series to be checked.
            column_name: name of the column to be checked.
            format: date format used to try to convert the series.
                If the format is provided, then the default formats are 
                not used. Observe that the format must not contain time
                information.

        Returns:
            `DateColumn` if the column is a datetime column, `None` otherwise.

        Raises:
            ColumnDateFormatterError: if the format is not valid.
    """
    logger.debug("start")
    column = None

    # check valid format
    if format is not None:
        logger.debug("check valid format")
        DateColumn.check_format(format)

    # drop null values
    _series = series.replace("", None).dropna().astype("str")
    logger.debug(f"dropped null values: {len(series) - len(_series)}")
    ColumnIdentifier.is_empty(_series, raise_error = True)

    # set format
    format_list: list[str]
    if format is None:
        logger.debug("no format provided, check most common formats")
        format_list = [
            "%Y-%m-%d",
            "%Y/%m/%d",
            "%Y%m%d"
        ]
    else:
        format_list = [format]

    # check formats
    logger.debug("check datetime formats")
    for _format in format_list:
        if _default_datetime_pandas(_series, column_name, order, _format) is not None:
            logger.info(f"format '{_format}' used, date column found")
            column = DateColumn(name = column_name, format = _format, order = order)
            column.inferred = True

    logger.debug("end")
    return column

API

hamana.core.column.Column dataclass

Column(
    name: str,
    dtype: DataType,
    parser: ColumnParser | None = None,
    order: int | None = None,
    inferred: bool = False,
)

Class representing a column in the hamana library.

To define a column, the following attributes are required:

  • name: name of the column.
  • dtype: represents the datatype and should be an instance of DataType.
  • parser: a column in hamana could have an associated parser object that could be used to parse list of values; e.g. useful when data are extracted from different data sources and should be casted and normalized.

name instance-attribute

name: str

Name of the column.

dtype instance-attribute

dtype: DataType

Data type of the column.

parser class-attribute instance-attribute

parser: ColumnParser | None = None

Parser object for the column.

order class-attribute instance-attribute

order: int | None = None

Numerical order of the column.

inferred class-attribute instance-attribute

inferred: bool = False

Flag to indicate if the column was inferred.

hamana.core.column.NumberColumn

NumberColumn(
    name: str,
    decimal_separator: str = ".",
    thousands_separator: str = ",",
    null_default_value: int | float | None = None,
    parser: ColumnParser | None = None,
    order: int | None = None,
)

Bases: Column

Dedicated class representing DataType.NUMBER columns.

The class provides attributes that could be used to define the properties of the number column, such as:

  • decimal_separator: the decimal separator used in the number. By default, the decimal separator is set to ..
  • thousands_separator: the thousands separator used in the number. By default, the thousands separator is set to ,.
  • null_default_value: the default value to be used when a null value is found. By default, the default value is set to None.

The class also provides a default parser that could be used to parse the number column using pandas.

Source code in src/hamana/core/column.py
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
def __init__(
    self,
    name: str,
    decimal_separator: str = ".",
    thousands_separator: str = ",",
    null_default_value: int | float | None = None,
    parser: ColumnParser | None = None,
    order: int | None = None
):
    # set the attributes
    self.decimal_separator = decimal_separator
    self.thousands_separator = thousands_separator
    self.null_default_value = null_default_value
    self.parser: ColumnParser # type: ignore

    logger.debug(f"decimal separator: {self.decimal_separator}")
    logger.debug(f"thousands separator: {self.thousands_separator}")
    logger.debug(f"null default value: {self.null_default_value}")

    # set default parser
    if parser is None:
        logger.debug("set default parser")
        parser = ColumnParser(pandas = self.pandas_default_parser)

    # call the parent class constructor
    super().__init__(name, DataType.NUMBER, parser, order)

    return

decimal_separator instance-attribute

decimal_separator: str = decimal_separator

Decimal separator used in the number.

thousands_separator instance-attribute

thousands_separator: str = thousands_separator

Thousands separator used in the number.

null_default_value instance-attribute

null_default_value: int | float | None = null_default_value

Default value to be used when a null value is found.

pandas_default_parser

pandas_default_parser(
    series: PandasSeries,
    mode: PandasParsingModes = PandasParsingModes.RAISE,
) -> PandasSeries

Default pandas parser for the number columns. The function converts first the column to string type and replaces the thousands separator with an empty string and the decimal separator with .. Then, the function tries to convert the column to a numeric type using the pandas.to_numeric.

If the null_default_value is set, the function fills the null values with the default value.

Parameters:

Name Type Description Default
series PandasSeries

pandas series to be parsed.

required
mode PandasParsingModes

mode to be used when parsing the number column. By default, the mode is set to PandasParsingModes.RAISE.

PandasParsingModes.RAISE

Returns:

Type Description
PandasSeries

pandas series parsed.

Raises:

Type Description
`ColumnParserPandasNumberError`

error parsing the number column.

Source code in src/hamana/core/column.py
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
def pandas_default_parser(self, series: PandasSeries, mode: PandasParsingModes = PandasParsingModes.RAISE) -> PandasSeries:
    """
        Default `pandas` parser for the number columns. The function 
        converts first the column to string type and replaces the 
        thousands separator with an empty string and the decimal 
        separator with `.`. Then, the function tries to convert the 
        column to a numeric type using the `pandas.to_numeric`.

        If the `null_default_value` is set, the function fills the 
        null values with the default value.

        Parameters:
            series: `pandas` series to be parsed.
            mode: mode to be used when parsing the number column.
                By default, the mode is set to `PandasParsingModes.RAISE`.

        Returns:
            `pandas` series parsed.

        Raises:
            `ColumnParserPandasNumberError`: error parsing the number column.
    """

    _series = pd.Series(np.nan, index = series.index)
    try:
        _series_number = pd.to_numeric(series.dropna().astype("str").str.replace(self.thousands_separator, "").str.replace(self.decimal_separator, "."), errors = mode.value) # type: ignore (pandas issue in typing)
        _series.loc[_series_number.index] = _series_number
    except Exception as e:
        logger.error(f"error parsing number: {e}")
        raise ColumnParserPandasNumberError(f"error parsing number: {e}")

    if self.null_default_value is not None:
        logger.debug(f"fill nulls, default value: {self.null_default_value}")
        _series = _series.fillna(self.null_default_value)
    return _series.astype("float")

hamana.core.column.IntegerColumn

IntegerColumn(
    name: str,
    decimal_separator: str = ".",
    thousands_separator: str = ",",
    null_default_value: int | None = 0,
    parser: ColumnParser | None = None,
    order: int | None = None,
)

Bases: NumberColumn

Class representing DataType.INTEGER columns. It ehrits from the NumberColumn class and provides a default parser that could be used to parse integer columns.

Similar to the NumberColumn class, the IntegerColumn class provides attributes that could be used to define the properties of the integer column, such as:

  • decimal_separator: the decimal separator used in the number. By default, the decimal separator is set to ..
  • thousands_separator: the thousands separator used in the number. By default, the thousands separator is set to ,.
  • null_default_value: the default value to be used when a null value is found. By default, the default value is set to 0.
Source code in src/hamana/core/column.py
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
def __init__(
    self,
    name: str,
    decimal_separator: str = ".",
    thousands_separator: str = ",",
    null_default_value: int | None = 0,
    parser: ColumnParser | None = None,
    order: int | None = None
):

    # call the parent class constructor
    super().__init__(name, decimal_separator, thousands_separator, null_default_value, parser, order)

    # override types
    self.dtype = DataType.INTEGER

pandas_default_parser

pandas_default_parser(
    series: PandasSeries,
    mode: PandasParsingModes = PandasParsingModes.RAISE,
) -> PandasSeries

Default pandas parser for the integer columns. Similar to the NumberColumn class, the function converts first the column to string type and replaces the thousands separator with an empty string and the decimal separator with .. Then, the function tries to convert the column to a numeric type using the pandas.to_numeric.

If the null_default_value is set, the function fills the null values with the default value, and casts the column to integer type. Otherwise, the function applies the np.floor function to the returned series.

Parameters:

Name Type Description Default
series PandasSeries

pandas series to be parsed.

required
mode PandasParsingModes

mode to be used when parsing the number column. By default, the mode is set to PandasParsingModes.RAISE.

PandasParsingModes.RAISE

Returns:

Type Description
PandasSeries

pandas series parsed.

Raises:

Type Description
`ColumnParserPandasNumberError`

error parsing the number column.

Source code in src/hamana/core/column.py
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
def pandas_default_parser(self, series: PandasSeries, mode: PandasParsingModes = PandasParsingModes.RAISE) -> PandasSeries:
    """
        Default `pandas` parser for the integer columns. Similar 
        to the `NumberColumn` class, the function converts first 
        the column to string type and replaces the thousands separator
        with an empty string and the decimal separator with `.`. 
        Then, the function tries to convert the column to a numeric
        type using the `pandas.to_numeric`.

        If the `null_default_value` is set, the function fills the
        null values with the default value, and casts the column to 
        integer type. Otherwise, the function applies the `np.floor`
        function to the returned series.

        Parameters:
            series: `pandas` series to be parsed.
            mode: mode to be used when parsing the number column.
                By default, the mode is set to `PandasParsingModes.RAISE`.

        Returns:
            `pandas` series parsed.

        Raises:
            `ColumnParserPandasNumberError`: error parsing the number column.
    """

    _series = pd.Series(np.nan, index = series.index)
    try:
        _series_number = pd.to_numeric(
            arg = series.dropna().astype("str").str.replace(self.thousands_separator, "").str.replace(self.decimal_separator, "."),
            errors = mode.value # type: ignore (pandas issue in typing)
        )
        _series.loc[_series_number.index] = _series_number
    except Exception as e:
        logger.error(f"error parsing integer: {e}")
        raise ColumnParserPandasNumberError(f"error parsing integer: {e}")

    if self.null_default_value is not None:
        logger.debug(f"fill nulls, default value: {self.null_default_value}")
        return pd.Series(_series, dtype = "float").fillna(self.null_default_value).astype("int")

    return pd.Series(_series.astype(float).apply(np.floor), dtype = "Int64")

hamana.core.column.StringColumn

StringColumn(
    name: str,
    parser: ColumnParser | None = None,
    order: int | None = None,
)

Bases: Column

Class representing DataType.STRING columns.

Source code in src/hamana/core/column.py
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
def __init__(
    self,
    name: str,
    parser: ColumnParser | None = None,
    order: int | None = None
):

    self.parser: ColumnParser # type: ignore

    # set default parser
    if parser is None:
        logger.debug("set default parser")
        parser = ColumnParser(pandas = self.pandas_default_parser)

    # call the parent class constructor
    super().__init__(name, DataType.STRING, parser, order)

    return

pandas_default_parser

pandas_default_parser(series: PandasSeries) -> PandasSeries

Default pandas parser for the string columns. The function converts the column to string type and replaces the null values with None.

Parameters:

Name Type Description Default
series PandasSeries

pandas series to be parsed.

required

Returns:

Type Description
PandasSeries

pandas series parsed

Source code in src/hamana/core/column.py
378
379
380
381
382
383
384
385
386
387
388
389
390
391
def pandas_default_parser(self, series: PandasSeries) -> PandasSeries:
    """
        Default `pandas` parser for the string columns. The function
        converts the column to string type and replaces the null values
        with `None`.

        Parameters:
            series: `pandas` series to be parsed.

        Returns:
            `pandas` series parsed
    """
    _series_nulls = series.isnull()
    return series.astype("str").where(~_series_nulls, None)

hamana.core.column.BooleanColumn

BooleanColumn(
    name: str,
    true_value: str | int | float = "Y",
    false_value: str | int | float = "N",
    parser: ColumnParser | None = None,
    order: int | None = None,
)

Bases: Column

Class representing DataType.BOOLEAN columns.

The class provides attributes that could be used to define the properties of the boolean column, such as:

  • true_value: the value to be used to represent the True value. By default, the value is set to Y.
  • false_value: the value to be used to represent the False value. By default, the value is set to N.

The class also provides a default parser that could be used to parse the boolean column using pandas.

Source code in src/hamana/core/column.py
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
def __init__(self,
    name: str,
    true_value: str | int | float = "Y",
    false_value: str | int | float = "N",
    parser: ColumnParser | None = None,
    order: int | None = None
) -> None:

    # set attributes
    self.true_value = true_value
    self.false_value = false_value
    self.parser: ColumnParser # type: ignore

    logger.debug(f"true value: {self.true_value}")
    logger.debug(f"false value: {self.false_value}")

    # set default parser
    if parser is None:
        logger.debug("set default parser")
        parser = ColumnParser(pandas = self.pandas_default_parser)

    # call the parent class constructor
    super().__init__(name, DataType.BOOLEAN, parser, order)

    return

true_value instance-attribute

true_value: str | int | float = true_value

Value to be used to represent the True value.

false_value instance-attribute

false_value: str | int | float = false_value

Value to be used to represent the False value.

pandas_default_parser

pandas_default_parser(series: PandasSeries) -> PandasSeries

Default pandas parser for the boolean columns. The function maps the values to True and False based on the true_value and false_value attributes.

Observe that all other values are set to None.

Parameters:

Name Type Description Default
series PandasSeries

pandas series to be parsed.

required

Returns:

Type Description
PandasSeries

pandas series parsed.

Source code in src/hamana/core/column.py
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
def pandas_default_parser(self, series: PandasSeries) -> PandasSeries:
    """
        Default `pandas` parser for the boolean columns.
        The function maps the values to `True` and `False` 
        based on the `true_value` and `false_value` attributes.

        Observe that all other values are set to `None`.

        Parameters:
            series: `pandas` series to be parsed.

        Returns:
            `pandas` series parsed.
    """
    return series.map({self.true_value: True, self.false_value: False})

hamana.core.column.DatetimeColumn

DatetimeColumn(
    name: str,
    format: str = "%Y-%m-%d %H:%M:%S",
    null_default_value: (
        datetime | pd.Timestamp | None
    ) = None,
    parser: ColumnParser | None = None,
    order: int | None = None,
)

Bases: Column

Class representing DataType.DATETIME columns.

The class provides attributes that could be used to define the properties of the datetime column, such as:

  • format: the format to be used to parse the datetime. By default, the format is set to %Y-%m-%d %H:%M:%S.
  • null_default_value: the default value to be used when a null value is found. By default, the default value is set to None.

The class also provides a default parser that could be used to parse the datetime column using pandas.

Source code in src/hamana/core/column.py
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
def __init__(self,
    name: str,
    format: str = "%Y-%m-%d %H:%M:%S",
    null_default_value: datetime | pd.Timestamp | None = None,
    parser: ColumnParser | None = None,
    order: int | None = None
) -> None:

    # set attributes
    self.format = format
    self.null_default_value = null_default_value
    self.parser: ColumnParser # type: ignore

    logger.debug(f"format: {self.format}")
    logger.debug(f"null default value: {self.null_default_value}")

    # set default parser
    if parser is None:
        logger.debug("set default parser")
        parser = ColumnParser(pandas = self.pandas_default_parser)

    # call the parent class constructor
    super().__init__(name, DataType.DATETIME, parser, order)

    return

format instance-attribute

format: str = format

Format to be used to parse the datetime.

null_default_value instance-attribute

null_default_value: datetime | pd.Timestamp | None = (
    null_default_value
)

Default value to be used when a null value is found.

pandas_default_parser

pandas_default_parser(
    series: PandasSeries,
    mode: PandasParsingModes = PandasParsingModes.RAISE,
) -> PandasSeries

Default pandas parser for the datetime columns. The function tries to convert the column to a datetime type using the pandas.to_datetime.

Observe that pandas.to_datetime could raise an OutOfBoundsDatetime error when the datetime is out of bounds. In this case, the function switches to a 'slow' mode where it first converts the column to string type and divides it into two parts:

  • the part that could be casted to datetime using the pandas.to_datetime.
  • the part that could not be casted, and should be parsed using the dateutil.parser.

This approach is slower than the default one, but can handle out of bounds datetimes.

Finally, the function fills the null values with the default value, if set.

If the null_default_value is set, the function fills the null values with the default value.

Parameters:

Name Type Description Default
series PandasSeries

pandas series to be parsed.

required
mode PandasParsingModes

mode to be used when parsing the datetime column. By default, the mode is set to PandasParsingModes.RAISE.

PandasParsingModes.RAISE

Returns:

Type Description
PandasSeries

pandas series parsed.

Raises:

Type Description
`ColumnParserPandasDatetimeError`

error parsing the datetime column.

Source code in src/hamana/core/column.py
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
def pandas_default_parser(self, series: PandasSeries, mode: PandasParsingModes = PandasParsingModes.RAISE) -> PandasSeries:
    """
        Default `pandas` parser for the datetime columns. The function
        tries to convert the column to a datetime type using the `pandas.to_datetime`.

        Observe that `pandas.to_datetime` could raise an `OutOfBoundsDatetime` error
        when the datetime is out of bounds. In this case, the function switches to
        a 'slow' mode where it first converts the column to string type and divides 
        it into two parts:

        - the part that could be casted to datetime using the `pandas.to_datetime`.
        - the part that could not be casted, and should be parsed using the `dateutil.parser`.

        This approach is slower than the default one, but can handle out of bounds datetimes.

        Finally, the function fills the null values with the default value, if set.

        If the `null_default_value` is set, the function fills the null values
        with the default value.

        Parameters:
            series: `pandas` series to be parsed.
            mode: mode to be used when parsing the datetime column.
                By default, the mode is set to `PandasParsingModes.RAISE`.

        Returns:
            `pandas` series parsed.

        Raises:
            `ColumnParserPandasDatetimeError`: error parsing the datetime column.
    """

    _series: PandasSeries
    _series_nulls = series.isnull()

    try:
        _series = pd.Series(pd.NaT, index = series.index)
        _series_dt = pd.to_datetime(series.dropna().astype("str"), errors = mode.value, format = self.format) # type: ignore (pandas issue in typing)
        _series.loc[_series_dt.index] = _series_dt
    except OutOfBoundsDatetime as e:
        logger.warning("[WARNING] switched to 'slow' mode due to out of bounds datetimes")
        logger.debug(f"[WARNING] parsing datetime: {e}")
        _series = pd.to_datetime(series.astype("str"), errors = "coerce", format = self.format)
        _series_not_casted = _series.isnull() & ~_series_nulls
        _series_to_cast = series.where(_series_not_casted, None)
        _series = _series.where(~_series_not_casted, _series_to_cast.dropna().apply(parser.parse))
    except Exception as e:
        logger.error(f"error parsing datetime: {e}")
        raise ColumnParserPandasDatetimeError(f"error parsing datetime: {e}")

    logger.debug("update null values")
    if _series_nulls.sum() > 0 and self.null_default_value is not None:
        logger.info("fill nulls")

        if (
                self.null_default_value >= pd.Timestamp.min
            and self.null_default_value <= pd.Timestamp.max
            and "datetime64" in _series.dtype.name
        ):
            _series = _series.fillna(self.null_default_value)
        else:
            _series = _series.mask(_series_nulls, self.null_default_value)

    return _series

hamana.core.column.DateColumn

DateColumn(
    name: str,
    format: str = "%Y-%m-%d",
    null_default_value: (
        datetime | pd.Timestamp | None
    ) = None,
    parser: ColumnParser | None = None,
    order: int | None = None,
)

Bases: DatetimeColumn

Class representing DataType.DATE columns.

The class inherits from the DatetimeColumn class and can be used to store date values. Different from the DatetimeColumn class, the DateColumn class does not store the time part of the datetime.

Note

During the initialization, the format is analysed to ensure that no time part is present. If the time part is found, an error is raised.

Similar to the DatetimeColumn class, the DateColumn class provides attributes that could be used to define the properties of the date column, such as:

  • format: the format to be used to parse the date. By default, the format is set to %Y-%m-%d.
  • null_default_value: the default value to be used when a null value is found. By default, the default value is set to None.

Raises:

Type Description
`ColumnDateFormatterError`

error raised when the date format contains a time part.

Source code in src/hamana/core/column.py
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
def __init__(self,
    name: str,
    format: str = "%Y-%m-%d",
    null_default_value: datetime | pd.Timestamp | None = None,
    parser: ColumnParser | None = None,
    order: int | None = None
) -> None:

    # check format
    self.check_format(format)

    # call the parent class constructor
    super().__init__(name, format, null_default_value, parser, order)

    # override types
    self.dtype = DataType.DATE

    return

check_format staticmethod

check_format(format: str) -> None

Function to check if the date format contains a time part.

Parameters:

Name Type Description Default
format str

date format to be checked.

required

Raises:

Type Description
`ColumnDateFormatterError`

error raised when the date format contains a time part.

Source code in src/hamana/core/column.py
614
615
616
617
618
619
620
621
622
623
624
625
626
627
@staticmethod
def check_format(format: str) -> None:
    """
        Function to check if the date format contains a time part.

        Parameters:
            format: date format to be checked.

        Raises:
            `ColumnDateFormatterError`: error raised when the date format contains a time part.
    """
    not_admissible_formats = ["%H", "%I", "%p", "%M", "%S", "%f", "%z", "%c", "%X"]
    if any([f in format for f in not_admissible_formats]):
        raise ColumnDateFormatterError(f"date format {format} should not contain time part")