csv_to_opensearch

opensearch_py_ml.etl.csv_to_opensearch(filepath_or_buffer, os_client: str | List[str] | Tuple[str, ...] | OpenSearch, os_dest_index: str, os_if_exists: str = 'fail', os_refresh: bool = False, os_dropna: bool = False, os_type_overrides: Mapping[str, str] | None = None, sep=',', delimiter=None, header='infer', names=None, index_col=None, usecols=None, prefix=None, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=0, nrows=None, chunksize=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, cache_dates=True, compression='infer', thousands=None, decimal=b'.', lineterminator=None, quotechar='"', quoting=0, doublequote=True, escapechar=None, comment=None, encoding=None, dialect=None, warn_bad_lines: bool = True, error_bad_lines: bool = True, on_bad_lines: str = 'error', delim_whitespace=False, low_memory: bool = True, memory_map=False, float_precision=None) DataFrame[source]

Read a comma-separated values (csv) file into opensearch_py_ml.DataFrame (i.e. an OpenSearch index).

Modifies an OpenSearch index

Note pandas iteration options not supported

Parameters

os_client: OpenSearch client os_dest_index: str

Name of OpenSearch index to be appended to

os_if_exists{‘fail’, ‘replace’, ‘append’}, default ‘fail’

How to behave if the index already exists.

  • fail: Raise a ValueError.

  • replace: Delete the index before inserting new values.

  • append: Insert new values to the existing index. Create if does not exist.

os_dropna: bool, default ‘False’
  • True: Remove missing values (see pandas.Series.dropna)

  • False: Include missing values - may cause bulk to fail

os_type_overrides: dict, default None

Dict of columns: es_type to override default os datatype mappings

chunksize

number of csv rows to read before bulk index into OpenSearch

Other Parameters

Parameters derived from :pandas_api_docs:`pandas.read_csv`.

See Also

:pandas_api_docs:`pandas.read_csv`

Notes

iterator not supported

Examples

See if ‘churn’ index exists in OpenSearch

>>> from opensearchpy import OpenSearch 
>>> osclient = OpenSearch() 
>>> osclient.indices.exists(index="churn") 
False

Read ‘churn.csv’ and use first column as _id (and opensearch_py_ml.DataFrame index)

# churn.csv
,state,account length,area code,phone number,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,total eve minutes,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,churn
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,0
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,0
...
>>>  oml.csv_to_opensearch(
...      "churn.csv",
...      os_client=OPENSEARCH_TEST_CLIENT,
...      os_dest_index='churn',
...      os_refresh=True,
...      index_col=0
... ) 
          account length  area code  churn  customer service calls  ... total night calls  total night charge total night minutes voice mail plan
0                128        415      0                       1  ...                91               11.01               244.7             yes
1                107        415      0                       1  ...               103               11.45               254.4             yes
2                137        415      0                       0  ...               104                7.32               162.6              no
3                 84        408      0                       2  ...                89                8.86               196.9              no
4                 75        415      0                       3  ...               121                8.41               186.9              no
...              ...        ...    ...                     ...  ...               ...                 ...                 ...             ...
3328             192        415      0                       2  ...                83               12.56               279.1             yes
3329              68        415      0                       3  ...               123                8.61               191.3              no
3330              28        510      0                       2  ...                91                8.64               191.9              no
3331             184        510      0                       2  ...               137                6.26               139.2              no
3332              74        415      0                       0  ...                77               10.86               241.4             yes

[3333 rows x 21 columns]

Validate data now exists in ‘churn’ index:

>>> oml.search(index="churn", size=1) 
{'took': 1, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 3333, 'relation': 'eq'}, 'max_score': 1.0, 'hits': [{'_index': 'churn', '_id': '0', '_score': 1.0, '_source': {'state': 'KS', 'account length': 128, 'area code': 415, 'phone number': '382-4657', 'international plan': 'no', 'voice mail plan': 'yes', 'number vmail messages': 25, 'total day minutes': 265.1, 'total day calls': 110, 'total day charge': 45.07, 'total eve minutes': 197.4, 'total eve calls': 99, 'total eve charge': 16.78, 'total night minutes': 244.7, 'total night calls': 91, 'total night charge': 11.01, 'total intl minutes': 10.0, 'total intl calls': 3, 'total intl charge': 2.7, 'customer service calls': 1, 'churn': 0}}]}}

TODO - currently the opensearch_py_ml.DataFrame may not retain the order of the data in the csv.