Elasticsearch Python查询超过10000笔数据【解决方法】

今天爱分享给大家带来Elasticsearch Python查询超过10000笔数据【解决方法】,希望能够帮助到大家。

Elasticsearch Python查询超过10000笔数据解决方法
起因

最近在做数据收集以及分析,目前收集的数据使用的是ES目前已经超过10W笔,当我想要将所以有数据从ES抓下来做分析的时候遇到了问题我使用form size 来做分页一开始查询第0至10000笔数据都是正常的但是当我想查询10000 至20000 笔数据就报错了查询代码如下


GET index/_search
{
"from ":10000,
"size" : 10000,
"query":{
"match_all":{}
}
}


报错如下


{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "Result window is too large, from + size must be less than or equal to: [10000] but was [20000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting."
      }
    ],
    "type" : "search_phase_execution_exception",
    "reason" : "all shards failed",
    "phase" : "query",
    "grouped" : true,
    "failed_shards" : [
      {
        "shard" : 0,
        "index" : "new_channel",
        "node" : "dLHMyyNfQVuY-RSE1tPguQ",
        "reason" : {
          "type" : "illegal_argument_exception",
          "reason" : "Result window is too large, from + size must be less than or equal to: [10000] but was [20000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting."
        }
      }
    ],
    "caused_by" : {
      "type" : "illegal_argument_exception",
      "reason" : "Result window is too large, from + size must be less than or equal to: [10000] but was [20000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting.",
      "caused_by" : {
        "type" : "illegal_argument_exception",
        "reason" : "Result window is too large, from + size must be less than or equal to: [10000] but was [20000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting."
      }
    }
  },
  "status" : 400
}

如何处理
如上面的错误代码 说明的 ES 有保护机制最多就可以搜寻 10000笔数据 在多就要使用分页或是去更改设定档案来解决
这里介绍两种解决方法

1.直接更改 这个index 的设定 这样的好处在于快速 方便 但前提是 需要有权限 且这个方法可能会造成ES的效能变低
代码如下


PUT index/_setting
{
  "index":{
    "max_result_window":10000000
  }
}

2. 使用ES的search_after
首先 第一次查询代码如下


GET index/_search
{
  "size": 10000,
  "query": {
    "match_all": {}
  },

  
  "sort": [
    {
      "_id.keyword": "desc", # 使用 数据的 独立ID来排序
      "upload_date": "asc" # 使用时间来排序
      }
    
  ]
}

查询后得到结果如下


 {
          "platform" : "ig",
          "channel_id" : "fffe365d-0e9f-4128-a53f-74fc47490edc",
          "post_id" : "2436337532322256681",
          "main_id" : "28ea9b5c-ceeb-485f-984e-c815b9e957d2",
          "upload_date" : "2020-11-06T09:14:48",
          "post_title" : "",
          "description" : "",
          "categories" : [ ],
          "tag_person" : "",
          "tags" : [ ],
          "view_count" : 0,
          "like_count" : 1634,
          "dislike_count" : 0,
          "average_rating" : 0,
          "shortcode" : "CHPnBPPBRcp",
          "comment_counts" : 13,
          "last_update" : "2021-03-19T03:17:14.820854",
          "follow" : 42804
        },
        "sort" : [
          "fffe365d-0e9f-4128-a53f-74fc47490edc",
          1604654088000
        ]
        }

这里只需将最后一个sort 带入 search_aftet 就可以 完成分页查询
完成python 代码如下


data = es.search(index='', body={
    "size": 10000,
    "query": {
        "match_all": {}
    },
    "sort": [
        {
            "channel_id.keyword": "desc",
            "created_at": "asc"
        }

    ]
})
while True:

    for i in data['hits']['hits']:
        if i['_id'] in a:
            continue
        a.append(i['_id'])
    if data['hits']['hits']:
        after = data['hits']['hits'][-1]['sort']
    else:
        print('查完')
        break
    print(len(a))
    data = es.search(index='', body={
        "size": 10000,
        "query": {
            "match_all": {}
        },
        "search_after": after,
        "sort": [
            {
                "channel_id.keyword": "desc",
                "created_at": "asc"
            }

        ]
    })



人已赞赏
Python

socket.error:[Errno 99]Cannot assign requested address【解决方法】

2021-1-6 11:48:07

Python

python csv文件操作乱码【解决办法】

2021-3-30 11:58:48