Elasticsearch Python查询超过10000笔数据【解决方法】

今天爱分享给大家带来Elasticsearch Python查询超过10000笔数据【解决方法】，希望能够帮助到大家。

Elasticsearch Python查询超过10000笔数据解决方法
起因

最近在做数据收集以及分析,目前收集的数据使用的是ES目前已经超过10W笔,当我想要将所以有数据从ES抓下来做分析的时候遇到了问题我使用form size 来做分页一开始查询第0至10000笔数据都是正常的但是当我想查询10000 至20000 笔数据就报错了查询代码如下


GET index/_search
{
"from ":10000,
"size" : 10000,
"query":{
"match_all":{}
}
}

报错如下


{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "Result window is too large, from + size must be less than or equal to: [10000] but was [20000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting."
      }
    ],
    "type" : "search_phase_execution_exception",
    "reason" : "all shards failed",
    "phase" : "query",
    "grouped" : true,
    "failed_shards" : [
      {
        "shard" : 0,
        "index" : "new_channel",
        "node" : "dLHMyyNfQVuY-RSE1tPguQ",
        "reason" : {
          "type" : "illegal_argument_exception",
          "reason" : "Result window is too large, from + size must be less than or equal to: [10000] but was [20000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting."
        }
      }
    ],
    "caused_by" : {
      "type" : "illegal_argument_exception",
      "reason" : "Result window is too large, from + size must be less than or equal to: [10000] but was [20000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting.",
      "caused_by" : {
        "type" : "illegal_argument_exception",
        "reason" : "Result window is too large, from + size must be less than or equal to: [10000] but was [20000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting."
      }
    }
  },
  "status" : 400
}

如何处理
如上面的错误代码说明的 ES 有保护机制最多就可以搜寻 10000笔数据在多就要使用分页或是去更改设定档案来解决
这里介绍两种解决方法

1.直接更改这个index 的设定这样的好处在于快速方便但前提是需要有权限且这个方法可能会造成ES的效能变低
代码如下


PUT index/_setting
{
  "index":{
    "max_result_window":10000000
  }
}

2. 使用ES的search_after
首先第一次查询代码如下


GET index/_search
{
  "size": 10000,
  "query": {
    "match_all": {}
  },

  
  "sort": [
    {
      "_id.keyword": "desc", # 使用 数据的 独立ID来排序
      "upload_date": "asc" # 使用时间来排序
      }
    
  ]
}

查询后得到结果如下


 {
          "platform" : "ig",
          "channel_id" : "fffe365d-0e9f-4128-a53f-74fc47490edc",
          "post_id" : "2436337532322256681",
          "main_id" : "28ea9b5c-ceeb-485f-984e-c815b9e957d2",
          "upload_date" : "2020-11-06T09:14:48",
          "post_title" : "",
          "description" : "",
          "categories" : [ ],
          "tag_person" : "",
          "tags" : [ ],
          "view_count" : 0,
          "like_count" : 1634,
          "dislike_count" : 0,
          "average_rating" : 0,
          "shortcode" : "CHPnBPPBRcp",
          "comment_counts" : 13,
          "last_update" : "2021-03-19T03:17:14.820854",
          "follow" : 42804
        },
        "sort" : [
          "fffe365d-0e9f-4128-a53f-74fc47490edc",
          1604654088000
        ]
        }

这里只需将最后一个sort 带入 search_aftet 就可以完成分页查询
完成python 代码如下


data = es.search(index='', body={
    "size": 10000,
    "query": {
        "match_all": {}
    },
    "sort": [
        {
            "channel_id.keyword": "desc",
            "created_at": "asc"
        }

    ]
})
while True:

    for i in data['hits']['hits']:
        if i['_id'] in a:
            continue
        a.append(i['_id'])
    if data['hits']['hits']:
        after = data['hits']['hits'][-1]['sort']
    else:
        print('查完')
        break
    print(len(a))
    data = es.search(index='', body={
        "size": 10000,
        "query": {
            "match_all": {}
        },
        "search_after": after,
        "sort": [
            {
                "channel_id.keyword": "desc",
                "created_at": "asc"
            }

        ]
    })

Elasticsearch Python查询超过10000笔数据【解决方法】

socket.error:[Errno 99]Cannot assign requested address【解决方法】

python csv文件操作乱码【解决办法】

相关推荐: