DataFrame
- class opensearch_py_ml.DataFrame(os_client: OpenSearch = None, os_index_pattern: str | None = None, columns: List[str] | None = None, os_index_field: str | None = None, _query_compiler: QueryCompiler | None = None)[source]
Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns) referencing data stored in OpenSearch indices. Where possible APIs mirror pandas.DataFrame APIs. The underlying data is stored in OpenSearch rather than core memory.
Parameters
os_client: OpenSearch client os_index_pattern: str OpenSearch index pattern. This can contain wildcards. (e.g. ‘flights’) columns: list of str, optional List of DataFrame columns. A subset of the OpenSearch index’s fields. os_index_field: str, optional The OpenSearch index field to use as the DataFrame index. Defaults to _id if None is used.
See Also
Examples
Constructing DataFrame from an OpenSearch configuration arguments and an OpenSearch index
>>> from tests import OPENSEARCH_TEST_CLIENT
>>> df = oml.DataFrame(OPENSEARCH_TEST_CLIENT, 'flights') >>> df.head() AvgTicketPrice Cancelled ... dayOfWeek timestamp 0 841.265642 False ... 0 2018-01-01 00:00:00 1 882.982662 False ... 0 2018-01-01 18:27:00 2 190.636904 False ... 0 2018-01-01 17:11:14 3 181.694216 True ... 0 2018-01-01 10:33:28 4 730.041778 False ... 0 2018-01-01 05:13:00 [5 rows x 27 columns]
Constructing DataFrame from an OpenSearch client and an OpenSearch index
>>> df = oml.DataFrame(os_client=OPENSEARCH_TEST_CLIENT, os_index_pattern='flights', columns=['AvgTicketPrice', 'Cancelled']) >>> df.head() AvgTicketPrice Cancelled 0 841.265642 False 1 882.982662 False 2 190.636904 False 3 181.694216 True 4 730.041778 False [5 rows x 2 columns]
Constructing DataFrame from an OpenSearch client and an OpenSearch index, with ‘timestamp’ as the DataFrame index field (TODO - currently index_field must also be a field if not _id)
>>> df = oml.DataFrame( ... os_client=OPENSEARCH_TEST_CLIENT, ... os_index_pattern='flights', ... columns=['AvgTicketPrice', 'timestamp'], ... os_index_field='timestamp' ... ) >>> df.head() AvgTicketPrice timestamp 2018-01-01T00:00:00 841.265642 2018-01-01 00:00:00 2018-01-01T00:02:06 772.100846 2018-01-01 00:02:06 2018-01-01T00:06:27 159.990962 2018-01-01 00:06:27 2018-01-01T00:33:31 800.217104 2018-01-01 00:33:31 2018-01-01T00:36:51 803.015200 2018-01-01 00:36:51 [5 rows x 2 columns]