DataFrame

class opensearch_py_ml.DataFrame(os_client: OpenSearch = None, os_index_pattern: str | None = None, columns: List[str] | None = None, os_index_field: str | None = None, _query_compiler: QueryCompiler | None = None)[source]

Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns) referencing data stored in OpenSearch indices. Where possible APIs mirror pandas.DataFrame APIs. The underlying data is stored in OpenSearch rather than core memory.

Parameters

os_client: OpenSearch client os_index_pattern: str OpenSearch index pattern. This can contain wildcards. (e.g. ‘flights’) columns: list of str, optional List of DataFrame columns. A subset of the OpenSearch index’s fields. os_index_field: str, optional The OpenSearch index field to use as the DataFrame index. Defaults to _id if None is used.

See Also

:pandas_api_docs:`pandas.DataFrame`

Examples

Constructing DataFrame from an OpenSearch configuration arguments and an OpenSearch index

>>> from tests import OPENSEARCH_TEST_CLIENT
>>> df = oml.DataFrame(OPENSEARCH_TEST_CLIENT, 'flights')
>>> df.head()
   AvgTicketPrice  Cancelled  ... dayOfWeek           timestamp
0      841.265642      False  ...         0 2018-01-01 00:00:00
1      882.982662      False  ...         0 2018-01-01 18:27:00
2      190.636904      False  ...         0 2018-01-01 17:11:14
3      181.694216       True  ...         0 2018-01-01 10:33:28
4      730.041778      False  ...         0 2018-01-01 05:13:00

[5 rows x 27 columns]

Constructing DataFrame from an OpenSearch client and an OpenSearch index

>>> df = oml.DataFrame(os_client=OPENSEARCH_TEST_CLIENT, os_index_pattern='flights', columns=['AvgTicketPrice', 'Cancelled'])
>>> df.head()
   AvgTicketPrice  Cancelled
0      841.265642      False
1      882.982662      False
2      190.636904      False
3      181.694216       True
4      730.041778      False

[5 rows x 2 columns]

Constructing DataFrame from an OpenSearch client and an OpenSearch index, with ‘timestamp’ as the DataFrame index field (TODO - currently index_field must also be a field if not _id)

>>> df = oml.DataFrame(
...     os_client=OPENSEARCH_TEST_CLIENT,
...     os_index_pattern='flights',
...     columns=['AvgTicketPrice', 'timestamp'],
...     os_index_field='timestamp'
... )
>>> df.head()
                     AvgTicketPrice           timestamp
2018-01-01T00:00:00      841.265642 2018-01-01 00:00:00
2018-01-01T00:02:06      772.100846 2018-01-01 00:02:06
2018-01-01T00:06:27      159.990962 2018-01-01 00:06:27
2018-01-01T00:33:31      800.217104 2018-01-01 00:33:31
2018-01-01T00:36:51      803.015200 2018-01-01 00:36:51

[5 rows x 2 columns]