csv_to_opensearch
- opensearch_py_ml.etl.csv_to_opensearch(filepath_or_buffer, os_client: str | List[str] | Tuple[str, ...] | OpenSearch, os_dest_index: str, os_if_exists: str = 'fail', os_refresh: bool = False, os_dropna: bool = False, os_type_overrides: Mapping[str, str] | None = None, sep=',', delimiter=None, header='infer', names=None, index_col=None, usecols=None, prefix=None, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=0, nrows=None, chunksize=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, cache_dates=True, compression='infer', thousands=None, decimal=b'.', lineterminator=None, quotechar='"', quoting=0, doublequote=True, escapechar=None, comment=None, encoding=None, dialect=None, warn_bad_lines: bool = True, error_bad_lines: bool = True, on_bad_lines: str = 'error', delim_whitespace=False, low_memory: bool = True, memory_map=False, float_precision=None) DataFrame [source]
Read a comma-separated values (csv) file into opensearch_py_ml.DataFrame (i.e. an OpenSearch index).
Modifies an OpenSearch index
Note pandas iteration options not supported
Parameters
os_client: OpenSearch client os_dest_index: str
Name of OpenSearch index to be appended to
- os_if_exists{‘fail’, ‘replace’, ‘append’}, default ‘fail’
How to behave if the index already exists.
fail: Raise a ValueError.
replace: Delete the index before inserting new values.
append: Insert new values to the existing index. Create if does not exist.
- os_dropna: bool, default ‘False’
True: Remove missing values (see pandas.Series.dropna)
False: Include missing values - may cause bulk to fail
- os_type_overrides: dict, default None
Dict of columns: es_type to override default os datatype mappings
- chunksize
number of csv rows to read before bulk index into OpenSearch
Other Parameters
Parameters derived from :pandas_api_docs:`pandas.read_csv`.
See Also
Notes
iterator not supported
Examples
See if ‘churn’ index exists in OpenSearch
>>> from opensearchpy import OpenSearch >>> osclient = OpenSearch() >>> osclient.indices.exists(index="churn") False
Read ‘churn.csv’ and use first column as _id (and opensearch_py_ml.DataFrame index)
# churn.csv ,state,account length,area code,phone number,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,total eve minutes,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,churn 0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,0 1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,0 ...
>>> oml.csv_to_opensearch( ... "churn.csv", ... os_client=OPENSEARCH_TEST_CLIENT, ... os_dest_index='churn', ... os_refresh=True, ... index_col=0 ... ) account length area code churn customer service calls ... total night calls total night charge total night minutes voice mail plan 0 128 415 0 1 ... 91 11.01 244.7 yes 1 107 415 0 1 ... 103 11.45 254.4 yes 2 137 415 0 0 ... 104 7.32 162.6 no 3 84 408 0 2 ... 89 8.86 196.9 no 4 75 415 0 3 ... 121 8.41 186.9 no ... ... ... ... ... ... ... ... ... ... 3328 192 415 0 2 ... 83 12.56 279.1 yes 3329 68 415 0 3 ... 123 8.61 191.3 no 3330 28 510 0 2 ... 91 8.64 191.9 no 3331 184 510 0 2 ... 137 6.26 139.2 no 3332 74 415 0 0 ... 77 10.86 241.4 yes [3333 rows x 21 columns]
Validate data now exists in ‘churn’ index:
>>> oml.search(index="churn", size=1) {'took': 1, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 3333, 'relation': 'eq'}, 'max_score': 1.0, 'hits': [{'_index': 'churn', '_id': '0', '_score': 1.0, '_source': {'state': 'KS', 'account length': 128, 'area code': 415, 'phone number': '382-4657', 'international plan': 'no', 'voice mail plan': 'yes', 'number vmail messages': 25, 'total day minutes': 265.1, 'total day calls': 110, 'total day charge': 45.07, 'total eve minutes': 197.4, 'total eve calls': 99, 'total eve charge': 16.78, 'total night minutes': 244.7, 'total night calls': 91, 'total night charge': 11.01, 'total intl minutes': 10.0, 'total intl calls': 3, 'total intl charge': 2.7, 'customer service calls': 1, 'churn': 0}}]}}
TODO - currently the opensearch_py_ml.DataFrame may not retain the order of the data in the csv.