Django QuerySet查询优化

常有人拿SQL和Django的ORM查询对比(即QuerySet),总说QuerySet执行效率慢。呵呵,QuerySet只不过是多了一个解析步骤而已。实际执行效率快慢和你写的QuerySet查询有关系。正如不同的SQL语句也有执行效率快慢问题。

1、简单的QuerySet查询

一般简单的QuerySet查询没什么需要优化的。有关QuerySet基础可参考:QuerySet查询基础。简单的QuerySet通常是在一个表上的查询,尽量避免使用__in条件即可。该__in对应SQL语句的in条件,这种类型的条件判断执行效率较慢。若in条件的值比较少,建议使用or逻辑合并条件。例如:

qs = Blog.objects.filter(id__in=[1, 5, 20])

可修改为如下两种写法:

# 方法1:直接合并
qs = Blog.objects.filter(id=1)|Blog.objects.filter(id=5)|Blog.objects.filter(id=20)
# 方法2:使用Q函数
from django.db import Q
qs = Blog.objects.filter(Q(id=1)|Q(id=5)|Q(id=20))

2、复杂的QuerySet查询

复杂的查询一般涉及到外键关联其他表。有关外键QuerySet查询可参考: 外键的QuerySet查询。先假设两个简单的模型:

# coding:utf-8
from django.db import models
from django.contrib.auth.models import User
 
# 博客模型
class Blog(models.Model):
    # 标题
    caption = models.CharField(max_length=50)
    # 其他相关字段
    author = models.ForeignKey(User)
    content = models.TextField()
    publish_time = models.DateTimeField(auto_now_add=True)
    update_time = models.DateTimeField(auto_now=True)
    recommend = models.BooleanField(default=False)
 
# 阅读点击明细记录
class Recorder(models.Model):
    blog = models.ForeignKey(Blog) # 外键关联Blog
    # 记录点击的IP地址和时间
    ip_address = models.CharField(max_length=15)
    view_time = models.DateTimeField(auto_now=True)

原本结构是1个Blog模型、1个ContentTypes记录总阅读数模型和1个ContentTypes记录阅读明细。为了方便说明问题,这里将不需要的信息去掉。简化两个模型,修改成1个Blog模型和1个Blog点击阅读明细模型。则该Recorder模型记录访客打开一篇博客的时间和IP地址。我们可以通过该模型统计某一天的某篇博客的点击数或者7天内的阅读数等等。

现需查询获取:前7天内Blog点击数最高的前15篇博客。Blog模型差不多有125篇博客,每篇博客每天的点击数10-50不等。Recorder模型差不多有几十万条数据。查得Blog模型的表名为blog_blog,Recorder模型的表名为blog_recorder。可以先写个SQL语句执行,看看SQL语句执行用多少时间。测试代码如下:

# coding:utf-8
import time, datetime
from .models import Blog, Recorder
 
# 获取起始日期
now = datetime.datetime.now()
date_end = datetime.datetime(now.year, now.month, now.day, 0, 0)
date_start = date_end - datetime.timedelta(7)
 
# 记录开始时间
start = time.time()
# 获取前7天的数据
sql = """
    select blog_blog.id, blog_blog.caption, count(c.blog_id) as read_num 
    from blog_blog
    left join (
        select * from blog_recorder 
        where view_time between '%s' and '%s'
    ) as c
    on blog_blog.id = c.blog_id
    group by blog_blog.id, blog_blog.caption
    having count(c.blog_id) > 0
    order by count(c.blog_id) desc
    limit 15
""" % (date_start, date_end)
 
# 获取数据
blogs = Blog.objects.raw(sql)
 
#输出结果和执行用时
print(map(lambda x: x.read_num, blogs))
print('%.2f' % (time.time() - start))

结果用时0.43秒。再看看如何用QuerySet实现该需求,如下代码:

# coding:utf-8
import time, datetime
from .models import Blog, Recorder
from django.db import models
 
# 获取起始日期
now = datetime.datetime.now()
date_end = datetime.datetime(now.year, now.month, now.day, 0, 0)
date_start = date_end - datetime.timedelta(7)
 
# 记录开始时间
start = time.time()
blogs = Blog.objects
# 分组统计得到每条博客的阅读数
blogs = blogs.annotate(read_num=models.Count('recorder'))
# 获取前7天的数据
blogs = blogs.filter(recorder__view_time__range=[date_start, date_end])
 
# 倒序排序以及获取前15条记录
blogs = blogs.order_by('-read_num')[:15]
 
# 输出结果和执行用时
print(map(lambda x: x.read_num, blogs))
print('%.2f' % (time.time() - start))

结果,迟迟得不到结果。。。使用print(blogs.query) 输出SQL语句如下:

SELECT 
    "blog_blog"."id", "blog_blog"."caption", "blog_blog"."author_id", 
    "blog_blog"."content", "blog_blog"."publish_time", 
    "blog_blog"."update_time", "blog_blog"."recommend", 
    COUNT("blog_recorder"."id") AS "read_num" 
FROM "blog_blog" 
LEFT OUTER JOIN "blog_recorder" 
    ON ("blog_blog"."id" = "blog_recorder"."blog_id") 
INNER JOIN "blog_recorder" T3 
    ON ("blog_blog"."id" = T3."blog_id") 
 
WHERE T3."view_time" BETWEEN 2017-04-19 00:00:00 AND 2017-04-26 00:00:00 
GROUP BY 
    "blog_blog"."id", "blog_blog"."caption", "blog_blog"."author_id", 
    "blog_blog"."content", "blog_blog"."publish_time", 
    "blog_blog"."update_time", "blog_blog"."recommend" 
ORDER BY "read_num" DESC 
LIMIT 15

将该SQL语句放到数据库执行,也是迟迟得不到结果。该SQL语句执行问题是因为关联有几十万条记录的表。其中有两组表关系关联,造成查询结构复杂。把filter和annotate的顺序对调,修改代码如下:

# coding:utf-8
import time, datetime
from .models import Blog, Recorder
from django.db import models
 
# 获取起始日期
now = datetime.datetime.now()
date_end = datetime.datetime(now.year, now.month, now.day, 0, 0)
date_start = date_end - datetime.timedelta(7)
 
# 记录开始时间
start = time.time()
blogs = Blog.objects
# 获取前7天的数据
blogs = blogs.filter(recorder__view_time__range=[date_start, date_end])
# 分组统计得到每条博客的阅读数
blogs = blogs.annotate(read_num=models.Count('recorder'))
# 倒序排序以及获取前15条记录
blogs = blogs.order_by('-read_num')[:15]
 
# 输出结果和执行用时
print(map(lambda x: x.read_num, blogs))
print('%.2f' % (time.time() - start))

修改之后,执行结果用时1.03秒。效率明显要高很多,输出SQL语句如下:

SELECT 
    "blog_blog"."id", "blog_blog"."caption", "blog_blog"."author_id", 
    "blog_blog"."content", "blog_blog"."publish_time", 
    "blog_blog"."update_time", "blog_blog"."recommend", 
    COUNT("blog_recorder"."id") AS "read_num" 
FROM "blog_blog" 
INNER JOIN "blog_recorder" 
    ON ("blog_blog"."id" = "blog_recorder"."blog_id") 
 
WHERE "blog_recorder"."view_time" BETWEEN 2017-04-19 00:00:00 AND 2017-04-26 00:00:00 
GROUP BY 
    "blog_blog"."id", "blog_blog"."caption", "blog_blog"."author_id", 
    "blog_blog"."content", "blog_blog"."publish_time", 
    "blog_blog"."update_time", "blog_blog"."recommend" 
ORDER BY "read_num" DESC 
LIMIT 15

很明显得到的SQL语句要简单很多。若你前端页面不需要这么多字段,只需要id、caption、read_num字段的话。可以修改第13行代码如下:

blogs = Blog.objects.values('id', 'caption')

其他代码无需修改。可以进一步提高执行效率,结果用时为0.54秒。得到的SQL语句如下:

SELECT 
    "blog_blog"."id", "blog_blog"."caption", 
    COUNT("blog_recorder"."id") AS "read_num" 
FROM "blog_blog" 
INNER JOIN "blog_recorder" 
    ON ("blog_blog"."id" = "blog_recorder"."blog_id") 
 
WHERE "blog_recorder"."view_time" BETWEEN 2017-04-19 00:00:00 AND 2017-04-26 00:00:00 
GROUP BY 
    "blog_blog"."id", "blog_blog"."caption" 
ORDER BY "read_num" DESC 
LIMIT 15

这里的values相当于select部分选择字段。不用全部的字段进行Group by 分组比较,自然可以提高执行效率。总结一下,复杂的QuerySet查询若需要使用annotate、aggregate分组统计时,先使用filter等筛选条件再使用分组统计。若不需要全部字段,可用values提高执行效率。

原文转自 杨仕航的博客!