今天爱分享给大家带来Elasticsearch Python查询超过10000笔数据【解决方法】,希望能够帮助到大家。
Elasticsearch Python查询超过10000笔数据解决方法
起因
最近在做数据收集以及分析,目前收集的数据使用的是ES目前已经超过10W笔,当我想要将所以有数据从ES抓下来做分析的时候遇到了问题我使用form size 来做分页一开始查询第0至10000笔数据都是正常的但是当我想查询10000 至20000 笔数据就报错了查询代码如下
GET index/_search { "from ":10000, "size" : 10000, "query":{ "match_all":{} } }
报错如下
{ "error" : { "root_cause" : [ { "type" : "illegal_argument_exception", "reason" : "Result window is too large, from + size must be less than or equal to: [10000] but was [20000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting." } ], "type" : "search_phase_execution_exception", "reason" : "all shards failed", "phase" : "query", "grouped" : true, "failed_shards" : [ { "shard" : 0, "index" : "new_channel", "node" : "dLHMyyNfQVuY-RSE1tPguQ", "reason" : { "type" : "illegal_argument_exception", "reason" : "Result window is too large, from + size must be less than or equal to: [10000] but was [20000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting." } } ], "caused_by" : { "type" : "illegal_argument_exception", "reason" : "Result window is too large, from + size must be less than or equal to: [10000] but was [20000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting.", "caused_by" : { "type" : "illegal_argument_exception", "reason" : "Result window is too large, from + size must be less than or equal to: [10000] but was [20000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting." } } }, "status" : 400 }
如何处理
如上面的错误代码 说明的 ES 有保护机制最多就可以搜寻 10000笔数据 在多就要使用分页或是去更改设定档案来解决
这里介绍两种解决方法
1.直接更改 这个index 的设定 这样的好处在于快速 方便 但前提是 需要有权限 且这个方法可能会造成ES的效能变低
代码如下
PUT index/_setting { "index":{ "max_result_window":10000000 } }
2. 使用ES的search_after
首先 第一次查询代码如下
GET index/_search { "size": 10000, "query": { "match_all": {} }, "sort": [ { "_id.keyword": "desc", # 使用 数据的 独立ID来排序 "upload_date": "asc" # 使用时间来排序 } ] }
查询后得到结果如下
{ "platform" : "ig", "channel_id" : "fffe365d-0e9f-4128-a53f-74fc47490edc", "post_id" : "2436337532322256681", "main_id" : "28ea9b5c-ceeb-485f-984e-c815b9e957d2", "upload_date" : "2020-11-06T09:14:48", "post_title" : "", "description" : "", "categories" : [ ], "tag_person" : "", "tags" : [ ], "view_count" : 0, "like_count" : 1634, "dislike_count" : 0, "average_rating" : 0, "shortcode" : "CHPnBPPBRcp", "comment_counts" : 13, "last_update" : "2021-03-19T03:17:14.820854", "follow" : 42804 }, "sort" : [ "fffe365d-0e9f-4128-a53f-74fc47490edc", 1604654088000 ] }
这里只需将最后一个sort 带入 search_aftet 就可以 完成分页查询
完成python 代码如下
data = es.search(index='', body={ "size": 10000, "query": { "match_all": {} }, "sort": [ { "channel_id.keyword": "desc", "created_at": "asc" } ] }) while True: for i in data['hits']['hits']: if i['_id'] in a: continue a.append(i['_id']) if data['hits']['hits']: after = data['hits']['hits'][-1]['sort'] else: print('查完') break print(len(a)) data = es.search(index='', body={ "size": 10000, "query": { "match_all": {} }, "search_after": after, "sort": [ { "channel_id.keyword": "desc", "created_at": "asc" } ] })