pandas_to_opensearch

etl.pandas_to_opensearch(os_client: str | List[str] | Tuple[str, ...] | OpenSearch, os_dest_index: str, os_if_exists: str = 'fail', os_refresh: bool = False, os_dropna: bool = False, os_type_overrides: Mapping[str, str] | None = None, os_verify_mapping_compatibility: bool = True, thread_count: int = 4, chunksize: int | None = None, use_pandas_index_for_os_ids: bool = True) DataFrame

Append a pandas DataFrame to an OpenSearch index. Mainly used in testing. Modifies the OpenSearch destination index

Parameters

os_client: OpenSearch client os_dest_index: str

Name of OpenSearch index to be appended to

os_if_exists{‘fail’, ‘replace’, ‘append’}, default ‘fail’

How to behave if the index already exists.

  • fail: Raise a ValueError.

  • replace: Delete the index before inserting new values.

  • append: Insert new values to the existing index. Create if does not exist.

os_refresh: bool, default ‘False’

Refresh os_dest_index after bulk index

os_dropna: bool, default ‘False’
  • True: Remove missing values (see pandas.Series.dropna)

  • False: Include missing values - may cause bulk to fail

os_type_overrides: dict, default None

Dict of field_name: es_data_type that overrides default os data types

os_verify_mapping_compatibility: bool, default ‘True’
  • True: Verify that the dataframe schema matches the OpenSearch index schema

  • False: Do not verify schema

thread_count: int

number of the threads to use for the bulk requests

chunksize: int, default None

Number of pandas.DataFrame rows to read before bulk index into OpenSearch

use_pandas_index_for_os_ids: bool, default ‘True’
  • True: pandas.DataFrame.index fields will be used to populate OpenSearch ‘_id’ fields.

  • False: Ignore pandas.DataFrame.index when indexing into OpenSearch

Returns

opensearch_py_ml.Dataframe

opensearch_py_ml.DataFrame referencing data in destination_index

Examples

>>> from tests import OPENSEARCH_TEST_CLIENT
>>> pd_df = pd.DataFrame(data={'A': 3.141,
...                            'B': 1,
...                            'C': 'foo',
...                            'D': pd.Timestamp('20190102'),
...                            'E': [1.0, 2.0, 3.0],
...                            'F': False,
...                            'G': [1, 2, 3],
...                            'H': 'Long text - to be indexed as os type text'},
...                      index=['0', '1', '2'])
>>> type(pd_df)
<class 'pandas.core.frame.DataFrame'>
>>> pd_df
       A  B  ...  G                                          H
0  3.141  1  ...  1  Long text - to be indexed as os type text
1  3.141  1  ...  2  Long text - to be indexed as os type text
2  3.141  1  ...  3  Long text - to be indexed as os type text

[3 rows x 8 columns]
>>> pd_df.dtypes
A           float64
B             int64
C            object
D    datetime64[ns]
E           float64
F              bool
G             int64
H            object
dtype: object

Convert pandas.DataFrame to opensearch_py_ml.DataFrame - this creates an OpenSearch index called pandas_to_opensearch. Overwrite existing OpenSearch index if it exists if_exists=”replace”, and sync index, so it is readable on return refresh=True

>>> from tests import OPENSEARCH_TEST_CLIENT
>>> oml_df = oml.pandas_to_opensearch(pd_df,
...                            OPENSEARCH_TEST_CLIENT,
...                            'pandas_to_opensearch',
...                            os_if_exists="replace",
...                            os_refresh=True,
...                            os_type_overrides={'H':'text'}) # index field 'H' as text not keyword
>>> type(oml_df)
<class 'opensearch_py_ml.dataframe.DataFrame'>
>>> oml_df
       A  B  ...  G                                          H
0  3.141  1  ...  1  Long text - to be indexed as os type text
1  3.141  1  ...  2  Long text - to be indexed as os type text
2  3.141  1  ...  3  Long text - to be indexed as os type text

[3 rows x 8 columns]
>>> oml_df.dtypes
A           float64
B             int64
C            object
D    datetime64[ns]
E           float64
F              bool
G             int64
H            object
dtype: object

See Also

opensearch_py_ml.opensearch_to_pandas: Create a pandas.Dataframe from opensearch_py_ml.DataFrame