pandas_to_opensearch
- etl.pandas_to_opensearch(os_client: str | List[str] | Tuple[str, ...] | OpenSearch, os_dest_index: str, os_if_exists: str = 'fail', os_refresh: bool = False, os_dropna: bool = False, os_type_overrides: Mapping[str, str] | None = None, os_verify_mapping_compatibility: bool = True, thread_count: int = 4, chunksize: int | None = None, use_pandas_index_for_os_ids: bool = True) DataFrame
Append a pandas DataFrame to an OpenSearch index. Mainly used in testing. Modifies the OpenSearch destination index
Parameters
os_client: OpenSearch client os_dest_index: str
Name of OpenSearch index to be appended to
- os_if_exists{‘fail’, ‘replace’, ‘append’}, default ‘fail’
How to behave if the index already exists.
fail: Raise a ValueError.
replace: Delete the index before inserting new values.
append: Insert new values to the existing index. Create if does not exist.
- os_refresh: bool, default ‘False’
Refresh os_dest_index after bulk index
- os_dropna: bool, default ‘False’
True: Remove missing values (see pandas.Series.dropna)
False: Include missing values - may cause bulk to fail
- os_type_overrides: dict, default None
Dict of field_name: es_data_type that overrides default os data types
- os_verify_mapping_compatibility: bool, default ‘True’
True: Verify that the dataframe schema matches the OpenSearch index schema
False: Do not verify schema
- thread_count: int
number of the threads to use for the bulk requests
- chunksize: int, default None
Number of pandas.DataFrame rows to read before bulk index into OpenSearch
- use_pandas_index_for_os_ids: bool, default ‘True’
True: pandas.DataFrame.index fields will be used to populate OpenSearch ‘_id’ fields.
False: Ignore pandas.DataFrame.index when indexing into OpenSearch
Returns
- opensearch_py_ml.Dataframe
opensearch_py_ml.DataFrame referencing data in destination_index
Examples
>>> from tests import OPENSEARCH_TEST_CLIENT >>> pd_df = pd.DataFrame(data={'A': 3.141, ... 'B': 1, ... 'C': 'foo', ... 'D': pd.Timestamp('20190102'), ... 'E': [1.0, 2.0, 3.0], ... 'F': False, ... 'G': [1, 2, 3], ... 'H': 'Long text - to be indexed as os type text'}, ... index=['0', '1', '2']) >>> type(pd_df) <class 'pandas.core.frame.DataFrame'> >>> pd_df A B ... G H 0 3.141 1 ... 1 Long text - to be indexed as os type text 1 3.141 1 ... 2 Long text - to be indexed as os type text 2 3.141 1 ... 3 Long text - to be indexed as os type text [3 rows x 8 columns] >>> pd_df.dtypes A float64 B int64 C object D datetime64[ns] E float64 F bool G int64 H object dtype: object
Convert pandas.DataFrame to opensearch_py_ml.DataFrame - this creates an OpenSearch index called pandas_to_opensearch. Overwrite existing OpenSearch index if it exists if_exists=”replace”, and sync index, so it is readable on return refresh=True
>>> from tests import OPENSEARCH_TEST_CLIENT >>> oml_df = oml.pandas_to_opensearch(pd_df, ... OPENSEARCH_TEST_CLIENT, ... 'pandas_to_opensearch', ... os_if_exists="replace", ... os_refresh=True, ... os_type_overrides={'H':'text'}) # index field 'H' as text not keyword >>> type(oml_df) <class 'opensearch_py_ml.dataframe.DataFrame'> >>> oml_df A B ... G H 0 3.141 1 ... 1 Long text - to be indexed as os type text 1 3.141 1 ... 2 Long text - to be indexed as os type text 2 3.141 1 ... 3 Long text - to be indexed as os type text [3 rows x 8 columns] >>> oml_df.dtypes A float64 B int64 C object D datetime64[ns] E float64 F bool G int64 H object dtype: object
See Also
opensearch_py_ml.opensearch_to_pandas: Create a pandas.Dataframe from opensearch_py_ml.DataFrame