Django QuerySet查询优化
常有人拿SQL和Django的ORM查询对比(即QuerySet),总说QuerySet执行效率慢。呵呵,QuerySet只不过是多了一个解析步骤而已。实际执行效率快慢和你写的QuerySet查询有关系。正如不同的SQL语句也有执行效率快慢问题。
1、简单的QuerySet查询
一般简单的QuerySet查询没什么需要优化的。有关QuerySet基础可参考:QuerySet查询基础。简单的QuerySet通常是在一个表上的查询,尽量避免使用__in条件即可。该__in对应SQL语句的in条件,这种类型的条件判断执行效率较慢。若in条件的值比较少,建议使用or逻辑合并条件。例如:
qs = Blog.objects.filter(id__in=[1, 5, 20])
可修改为如下两种写法:
# 方法1:直接合并
qs = Blog.objects.filter(id=1)|Blog.objects.filter(id=5)|Blog.objects.filter(id=20)
# 方法2:使用Q函数
from django.db import Q
qs = Blog.objects.filter(Q(id=1)|Q(id=5)|Q(id=20))
2、复杂的QuerySet查询
复杂的查询一般涉及到外键关联其他表。有关外键QuerySet查询可参考: 外键的QuerySet查询。先假设两个简单的模型:
# coding:utf-8
from django.db import models
from django.contrib.auth.models import User
# 博客模型
class Blog(models.Model):
# 标题
caption = models.CharField(max_length=50)
# 其他相关字段
author = models.ForeignKey(User)
content = models.TextField()
publish_time = models.DateTimeField(auto_now_add=True)
update_time = models.DateTimeField(auto_now=True)
recommend = models.BooleanField(default=False)
# 阅读点击明细记录
class Recorder(models.Model):
blog = models.ForeignKey(Blog) # 外键关联Blog
# 记录点击的IP地址和时间
ip_address = models.CharField(max_length=15)
view_time = models.DateTimeField(auto_now=True)
原本结构是1个Blog模型、1个ContentTypes记录总阅读数模型和1个ContentTypes记录阅读明细。为了方便说明问题,这里将不需要的信息去掉。简化两个模型,修改成1个Blog模型和1个Blog点击阅读明细模型。则该Recorder模型记录访客打开一篇博客的时间和IP地址。我们可以通过该模型统计某一天的某篇博客的点击数或者7天内的阅读数等等。
现需查询获取:前7天内Blog点击数最高的前15篇博客。Blog模型差不多有125篇博客,每篇博客每天的点击数10-50不等。Recorder模型差不多有几十万条数据。查得Blog模型的表名为blog_blog,Recorder模型的表名为blog_recorder。可以先写个SQL语句执行,看看SQL语句执行用多少时间。测试代码如下:
# coding:utf-8
import time, datetime
from .models import Blog, Recorder
# 获取起始日期
now = datetime.datetime.now()
date_end = datetime.datetime(now.year, now.month, now.day, 0, 0)
date_start = date_end - datetime.timedelta(7)
# 记录开始时间
start = time.time()
# 获取前7天的数据
sql = """
select blog_blog.id, blog_blog.caption, count(c.blog_id) as read_num
from blog_blog
left join (
select * from blog_recorder
where view_time between '%s' and '%s'
) as c
on blog_blog.id = c.blog_id
group by blog_blog.id, blog_blog.caption
having count(c.blog_id) > 0
order by count(c.blog_id) desc
limit 15
""" % (date_start, date_end)
# 获取数据
blogs = Blog.objects.raw(sql)
#输出结果和执行用时
print(map(lambda x: x.read_num, blogs))
print('%.2f' % (time.time() - start))
结果用时0.43秒。再看看如何用QuerySet实现该需求,如下代码:
# coding:utf-8
import time, datetime
from .models import Blog, Recorder
from django.db import models
# 获取起始日期
now = datetime.datetime.now()
date_end = datetime.datetime(now.year, now.month, now.day, 0, 0)
date_start = date_end - datetime.timedelta(7)
# 记录开始时间
start = time.time()
blogs = Blog.objects
# 分组统计得到每条博客的阅读数
blogs = blogs.annotate(read_num=models.Count('recorder'))
# 获取前7天的数据
blogs = blogs.filter(recorder__view_time__range=[date_start, date_end])
# 倒序排序以及获取前15条记录
blogs = blogs.order_by('-read_num')[:15]
# 输出结果和执行用时
print(map(lambda x: x.read_num, blogs))
print('%.2f' % (time.time() - start))
结果,迟迟得不到结果。。。使用print(blogs.query) 输出SQL语句如下:
SELECT
"blog_blog"."id", "blog_blog"."caption", "blog_blog"."author_id",
"blog_blog"."content", "blog_blog"."publish_time",
"blog_blog"."update_time", "blog_blog"."recommend",
COUNT("blog_recorder"."id") AS "read_num"
FROM "blog_blog"
LEFT OUTER JOIN "blog_recorder"
ON ("blog_blog"."id" = "blog_recorder"."blog_id")
INNER JOIN "blog_recorder" T3
ON ("blog_blog"."id" = T3."blog_id")
WHERE T3."view_time" BETWEEN 2017-04-19 00:00:00 AND 2017-04-26 00:00:00
GROUP BY
"blog_blog"."id", "blog_blog"."caption", "blog_blog"."author_id",
"blog_blog"."content", "blog_blog"."publish_time",
"blog_blog"."update_time", "blog_blog"."recommend"
ORDER BY "read_num" DESC
LIMIT 15
将该SQL语句放到数据库执行,也是迟迟得不到结果。该SQL语句执行问题是因为关联有几十万条记录的表。其中有两组表关系关联,造成查询结构复杂。把filter和annotate的顺序对调,修改代码如下:
# coding:utf-8
import time, datetime
from .models import Blog, Recorder
from django.db import models
# 获取起始日期
now = datetime.datetime.now()
date_end = datetime.datetime(now.year, now.month, now.day, 0, 0)
date_start = date_end - datetime.timedelta(7)
# 记录开始时间
start = time.time()
blogs = Blog.objects
# 获取前7天的数据
blogs = blogs.filter(recorder__view_time__range=[date_start, date_end])
# 分组统计得到每条博客的阅读数
blogs = blogs.annotate(read_num=models.Count('recorder'))
# 倒序排序以及获取前15条记录
blogs = blogs.order_by('-read_num')[:15]
# 输出结果和执行用时
print(map(lambda x: x.read_num, blogs))
print('%.2f' % (time.time() - start))
修改之后,执行结果用时1.03秒。效率明显要高很多,输出SQL语句如下:
SELECT
"blog_blog"."id", "blog_blog"."caption", "blog_blog"."author_id",
"blog_blog"."content", "blog_blog"."publish_time",
"blog_blog"."update_time", "blog_blog"."recommend",
COUNT("blog_recorder"."id") AS "read_num"
FROM "blog_blog"
INNER JOIN "blog_recorder"
ON ("blog_blog"."id" = "blog_recorder"."blog_id")
WHERE "blog_recorder"."view_time" BETWEEN 2017-04-19 00:00:00 AND 2017-04-26 00:00:00
GROUP BY
"blog_blog"."id", "blog_blog"."caption", "blog_blog"."author_id",
"blog_blog"."content", "blog_blog"."publish_time",
"blog_blog"."update_time", "blog_blog"."recommend"
ORDER BY "read_num" DESC
LIMIT 15
很明显得到的SQL语句要简单很多。若你前端页面不需要这么多字段,只需要id、caption、read_num字段的话。可以修改第13行代码如下:
blogs = Blog.objects.values('id', 'caption')
其他代码无需修改。可以进一步提高执行效率,结果用时为0.54秒。得到的SQL语句如下:
SELECT
"blog_blog"."id", "blog_blog"."caption",
COUNT("blog_recorder"."id") AS "read_num"
FROM "blog_blog"
INNER JOIN "blog_recorder"
ON ("blog_blog"."id" = "blog_recorder"."blog_id")
WHERE "blog_recorder"."view_time" BETWEEN 2017-04-19 00:00:00 AND 2017-04-26 00:00:00
GROUP BY
"blog_blog"."id", "blog_blog"."caption"
ORDER BY "read_num" DESC
LIMIT 15
这里的values相当于select部分选择字段。不用全部的字段进行Group by 分组比较,自然可以提高执行效率。总结一下,复杂的QuerySet查询若需要使用annotate、aggregate分组统计时,先使用filter等筛选条件再使用分组统计。若不需要全部字段,可用values提高执行效率。
原文转自 杨仕航的博客!