Python快速学习 本文适用于快速学习 Python 常用 API ,是根据雨敲窗 Python 快速学习网 总结的,在 B 站有对应的学习视频 。
Python简单语法 Python运算符
算术运算符
逻辑运算符
成员运算符
身份运算符
is => 两个变量引用自同一个对象则返回True
not is => 两个变量不是引用自同一个对象返回True
if语句
print必须带圆括号 1 2 3 4 5 6 7 a = 3 if a == 1 : print ("a == 1" ) elif a == 2 : print ("a == 2" ) else : print ("a != 1 and a != 2" )
for语句
continue表示不再执行continue下面的if语句,而是直接进行下一轮循环 1 2 3 4 5 6 7 list_a = [1 , 2 , 'test' ] for i in list_a: print (i) if i == 2 : print ('这是2' ) continue print ('~~~' )
break表示不再执行剩余的循环,直接退出本次for循环去执行后面的语句
while语句 1 2 3 4 5 6 7 8 9 10 11 12 a = 1 while a != 10 : print (a) a += 1 for a in range (1 , 10 ): for b in range (1 , 10 ): print (a, b) if b == 5 : break print ('内循环' ) print ('外循环' )
Python list/tuple/dict/set list
python内置的一种数据结构
有序
可更改(添加、删除) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 >>> game = ["dota" , "dota2" , "lol" ]>>> game['dota' , 'dota2' , 'lol' ] >>> len (game)3 >>> game[0 ]'dota' >>> game[0 ] = "dota3" >>> game['dota3' , 'dota2' , 'lol' ] >>> game.append("wow" )>>> game[3 ]'wow' >>> game.insert(2 , "war3" )>>> game['dota' , 'dota2' , 'war3' , 'lol' , 'wow' ] >>> game.pop()'wow' >>> game.pop(1 )'dota2'
tuple
python 内置的一种数据结构
有序
不可更改
在赋值的时候决定所有元素 1 2 3 4 5 6 7 8 9 10 11 12 >>> game = ('dota' , 'war3' , 'lol' )>>> game('dota' , 'war3' , 'lol' ) >>> len (game)3 >>> game[0 ]'dota'
dict
dict是典型的用空间换时间的例子。会占用大量内存,但是查找、插入速度很快,不会随着元素数量增加而增加。 list则是时间换空间的例子,不会占用大量内存,但是随着元素数量增多,查找时间会变很长。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 >>> name = {1 : "alan" , 2 : "bob" , 3 : "lucy" }>>> name[1 ]'alan' >>>name[1 ] 'alan' >>> 5 in nameFalse >>> name.get(5 )>>> name.get(5 , "default" )'default' >>> name.pop(1 )'alan' >>> name{2 : 'bob' , 3 : 'lucy' } >>> name.keys()[2 , 3 ] >>> name.values()['bob' , 'lucy' ] >>> name.items()[(2 , 'bob' ), (3 , 'lucy' )]
set
set可以视为没有value的dict,只存key,一般用做去重或者集合求交、求并等。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 >>> girls_1 = set (['lucy' , 'lily' ])>>> girls_2 = set (['lily' , 'anna' ])>>> girls_1 & girls_2set (['lily' ])>>> girls_1 | girls_2set (['lily' , 'lucy' , 'anna' ])>>> girls_1.add('marry' )>>> girls_1set (['marry' , 'lily' , 'lucy' ])>>> girls_1.remove('lucy' )>>> girls_1set (['marry' , 'lily' ])
Python复制与引用、浅拷贝与深拷贝
复制出来的东西和本体是分离的,互不干涉 Python中的赋值语句都是引用的(引用地址)
浅拷贝 1 2 3 4 5 6 7 8 9 >>>a = [1 , 2 , 3 ] >>>b = a >>>b [1 , 2 , 3 ] >>>a[0 ] = 0 >>>a [0 , 2 , 3 ] >>>b [0 , 2 , 3 ]
保持a和b互不干扰 1 2 3 4 5 6 7 8 9 >>> a = [1 , 2 , 3 ]>>> b = a[::] >>> b[1 , 2 , 3 ] >>> a[0 ] = 0 >>> a[0 , 2 , 3 ] >>> b[1 , 2 , 3 ]
但是这种只是表层拷贝 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 >>> a = [0 , [1 , 2 ], 3 ]>>> b = a[::]>>> a[0 , [1 , 2 ], 3 ] >>> b[0 , [1 , 2 ], 3 ] >>> a[0 ] = 1 >>> a[1 , [1 , 2 ], 3 ] >>> b[0 , [1 , 2 ], 3 ] >>> a[1 ][0 ] = 0 >>> a[1 , [0 , 2 ], 3 ] >>> b[0 , [0 , 2 ], 3 ]
深拷贝 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 >>> import copy>>> a = [0 , [1 , 2 ], 3 ]>>> b = copy.deepcopy(a)>>> a[0 , [1 , 2 ], 3 ] >>> b[0 , [1 , 2 ], 3 ] >>> a[0 ] = 1 >>> a[1 , [1 , 2 ], 3 ] >>> b[0 , [1 , 2 ], 3 ] >>> a[1 ][0 ] = 0 >>> a[1 , [0 , 2 ], 3 ] >>> b[0 , [1 , 2 ], 3 ] >>> b[1 ][0 ] = 2 >>> >>> a[1 , [0 , 2 ], 3 ] >>> b[0 , [2 , 2 ], 3 ]
Python函数
定义格式
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 def print_hello (): print "hello" print_hello() def print_str (s ): print s return s * 2 print_str("fuck" ) def print_default (s="hello" ): print s print_default() print_default("default" ) def print_args (s, *arg ): print s for a in arg: print a return print_args("hello" ) print_args("hello" , "world" , "1" ) def print_two (a, b ): print a,b print_two(a="a" , b="b" ) print_two(b="b" , a="a" )
输出结果
1 2 3 4 5 6 7 8 9 10 hello fuck hello default hello hello world 1 a b a b
匿名函数 1 2 add = lambda x,y: x + y print add(1 , 2 )
Python类 定义 1 2 3 4 class Human (object ): pass
类属性 class Human(object):
taisheng = True
实例属性
实例属性可以在实例创建后任意时间设置。
一般放在构造函数里__init()__ 1 2 3 4 class Human (object ): def __init__ (self, name ): self .name = name human_a = Human("alan" )
Python类实例化不需要关键字new
直接函数名传参就可以实例化某个类
类方法 1 2 3 4 5 6 7 class Human (object ): def __init__ (self, name ): self .name = name def walk (self ): print self .name + " is walking" human_a = Human("alan" ) human_a.walk()
运行结果:
alan is walking
访问控制
从上面的例子来看,我们可以在外部随意更改name这个属性,如果不想让外部直接访问到,则在属性名字前加两个下划线__name,这样从外部就无法直接访问了。如果还是想访问,可以再加个get的接口。
1 2 3 4 5 6 7 8 9 10 class Human (object ):def __init__ (self, name ): self .__name = name def walk (self ): print self .name + " is walking" def get_name (self ): return self .__name human_a = Human("alan" ) print human_a.get_name()print human_a.__name
如果还是想更改__name字段,可以再加上一个set接口
1 2 3 4 5 6 7 8 9 10 11 class Human (object ):def __init__ (self, name ): self .__name = name def walk (self ): print self .name + " is walking" def get_name (self ): return self .__name def set_name (self, name ): self .__name = name human_a = Human("alan" ) print human_a.set_name("bob" )
可能有人会有疑问,为何要这么“画蛇添足”呢?其不然,这样会增强代码的健壮性,直接暴漏属性可能会带来意想不到的后果,通过接口的方式可以加以控制,例如,我们可以通过set接口去限定name的长度。
1 2 3 4 5 6 7 8 9 10 11 12 13 class Human (object ): def __init__ (self, name ): self .__name = name def walk (self ): print self .name + " is walking" def get_name (self ): return self .__name def set_name (self, name ): if len (name) <= 10 : self .__name = name human_a = Human("alan" ) print human_a.set_name("bob" )
继承 1 2 3 4 class Man (Human ):def __init__ (self, name, has_wife ): super (Man, self ).__init__(name) self .__has_wife = has_wife
super(Man, self).init (name)等价于调用了父类Human的构造函数,就不用再复制粘贴一遍了。
Python模块化
通常一个工程不可能只有一层目录结构,并且也不会一个一个path去append到sys里,常用的做法是包,一个目录及其子目录组成的一个包(可以看做一个库)。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 /Users/hushiking/hu/course ├── c.py ├── fn.py ├── for.py ├── hello │ ├── a.py │ ├── hello.py │ ├── hello_b │ │ └── b.py │ └── run.py ├── if.py ├── run.py ├── start.py ├── sum.py └── while.py
这是一个course文件夹,里面有一个hello文件夹,同时hello中又有一个子文件夹hello_b。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 def hello_a (): print "hello a" def hello_b (): print "hello b" import syssys.path.append('/Users/hushiking/hu' ) from course.hello import afrom course.hello.hello_b import bif __name__ == '__main__' : a.hello_a() b.hello_b()
Python字符串处理 查找 1 2 3 4 5 6 7 >>> s = "abc" >>> s.find("b" )1 >>> s.find("bc" )1 >>> s.find("xx" )-1
截取 1 2 3 4 5 6 7 8 9 10 >>> s = "1234567" >>> s[2 :5 ]'345' >>> s[:5 ]'12345' >>> s[3 :]'4567' >>> s[3 :-1 ]'456'
连接 1 2 3 4 >>> s = ['a' , 'b' , 'c' ]>>> "," .join(s)'a,b,c'
反转 1 2 3 >>> s = "abc" >>> s[::-1 ]>>> 'cba'
Python文件读写 读文件 1 2 3 4 5 6 7 8 echo 'hello world' > test.txt f = open ("test.txt" ) content = f.read() f.close() print content
通常不调用read接口读取文件,万一文件比内存还大,就崩了;通常调用readlines接口,传入的参数代表文件字节数
1 2 3 4 5 6 7 8 9 f = open ("test.txt" ) while True : lines = f.readlines(10000 ) print (lines) if not lines: break for line in lines: print line.strip()
写文件 1 2 3 4 5 6 7 8 9 10 f = open ("test.txt" , "w" ) f.writelines(["hhhhhh" , "lllll" ]) f.close() f = open ("test.txt" , "a" ) f.writelines(["oooooo" , "kkkkk" ]) f.close()
Python动手练习 输出从1加到100的结果 1 2 3 4 total = 0 for i in range (1 , 101 ): total += i print total
打出100以内的质数 1 2 3 4 5 6 7 8 def judge (x ): for i in range (2 , x): if x % i == 0 : return False return True for i in range (2 , 101 ): if judge(i): print i
计算一个文件中每个英文单词出现的次数 1 2 3 4 5 6 7 8 9 10 11 12 f = open ("test.txt" ) lines = f.readlines() f.close() count = {} for line in lines: tokens = line.strip().split(' ' ) for token in tokens: if token not in count: count[token] = 0 count[token] += 1 for word in count: print word, count[word]
Python数据结构的队列和栈 队列
先进先出
只能从队列末尾插入数据
只能从队列头部取出数据
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 class Queue (object ): def __init__ (self ): self .data_list = [] def init_queue (self ): self .data_list = [] def insert (self, data ): self .data_list.append(data) def pop (self ): if len (self .data_list) == 0 : return None data = self .data_list[0 ] del self .data_list[0 ] return data def size (self ): return len (self .data_list) queue = Queue() print (queue.size())queue.insert(1 ) queue.insert(2 ) queue.insert(3 ) print (queue.data_list)head = queue.pop() print (head)head = queue.pop() print (head)head = queue.pop() print (head)head = queue.pop() print (head)
结果如下:
栈
后进先出
只能从尾部插入数据
只能从尾部取数据。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 class Stack (object ): def __init__ (self ): self .data_stack = [] def init_stack (self ): self .data_stack = [] def insert (self, data ): self .data_stack.append(data) def pop (self ): if len (self .data_stack) == 0 : return None data = self .data_stack[-1 ] del self .data_stack[-1 ] return data def size (self ): return len (self .data_stack) stack = Stack() stack.insert(1 ) stack.insert(2 ) stack.insert(3 ) print (stack.data_stack)tail = stack.pop() print (tail)tail = stack.pop() print (tail)tail = stack.pop() print (tail)tail = stack.pop() print (tail)
结果如下:
Python【数据结构】树 树
二叉树 二叉树
遍历
前序遍历(根左右)
还是以上图为例,前序遍历顺序是:[1, 2, 5, 6, 3, 7, 8, 9]
中序遍历(左根右)
还是以上图为例,中序遍历顺序是:[5, 2, 6, 1, 8, 7, 9, 3]
后续遍历(左右根)
还是以上图为例,后序遍历顺序是:[5, 6, 2, 8, 9, 7 ,3, 1]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 class Node (object ): def __init__ (self, index ): self .index = index self .left_child = None self .right_child = None class BinaryTree (object ): def __init__ (self, root ): self .root = root def pre_travel (self, node ): if not node: return print (node.index) self .pre_travel(node.left_child) self .pre_travel(node.right_child) node_dict = {} for i in range (1 , 10 ): node_dict[i] = Node(i) node_dict[1 ].left_child = node_dict[2 ] node_dict[1 ].right_child = node_dict[3 ] node_dict[2 ].left_child = node_dict[5 ] node_dict[2 ].right_child = node_dict[6 ] node_dict[3 ].left_child = node_dict[7 ] node_dict[7 ].left_child = node_dict[8 ] node_dict[7 ].right_child = node_dict[9 ] tree = BinaryTree(node_dict[1 ]) tree.pre_travel(tree.root)
结果如下:
Python【数据结构】堆
堆是一个二叉树
叶子节点只存在最下面两层。
从根节点到倒数第二层,是一个完全二叉树。
一个节点不可能只有右孩子。
一个节点的左孩子和右孩子都比这个节点大(或者小)(大顶堆或小顶堆)
堆Python代码实现
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 class Heap (object ):def __init__ (self ): self .data_list = [None ] def size (self ): return len (self .data_list) - 1 def left_child (self, root ): return root * 2 def right_child (self, root ): return root * 2 + 1 def father (self, node ): return node / 2 def heapify (self, root ): if root > self .size(): return left_node = self .left_child(root) right_node = self .right_child(root) largest = root if left_node <= self .size(): if self .data_list[left_node] > self .data_list[largest]: largest = left_node if right_node <= self .size(): if self .data_list[right_node] > self .data_list[largest]: largest = right_node if largest != root: self .data_list[root], self .data_list[largest] = self .data_list[largest], self .data_list[root] self .heapify(largest) def build_heap (self ): for i in range (self .size()/2 , 0 , -1 ): self .heapify(i) def get_max (self ): if self .size() == 0 : return None ret = self .data_list[1 ] self .data_list[1 ] = self .data_list[-1 ] del self .data_list[-1 ] self .heapify(1 ) return ret def insert (self, data ): self .data_list.append(data) now_index = self .size() pre = self .father(now_index) while self .data_list[pre] < data and now_index != 1 : self .data_list[pre], self .data_list[now_index] = self .data_list[now_index], self .data_list[pre] now_index = pre pre = now_index / 2 heap = Heap() heap.insert(9 ) heap.insert(10 ) heap.insert(7 ) heap.insert(12 ) heap.insert(3 ) heap.insert(4 ) print heap.get_max()print heap.get_max()print heap.get_max()print heap.get_max()print heap.get_max()print heap.get_max()print heap.get_max()heap.insert(10 ) heap.insert(9 ) heap.insert(8 ) heap.insert(7 ) heap.insert(7 ) heap.insert(12 ) print heap.get_max()heap.data_list = [None , 1 , 2 , 3 , 4 , 5 , 6 , 7 ] heap.build_heap() print heap.get_max()
运行结果:
Python【数据结构】二分查找
要查找的列表必须是按照从小到大排列
例如 search_list = [4, 6, 7, 8, 9]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 def binary_search (search_list, target ): left = 0 right = len (search_list) - 1 while left <= right: mid = int ((left + right)/2 ) if search_list[mid] < target: left = mid + 1 continue if search_list[mid] == target: return mid if search_list[mid] > target: right = mid - 1 return None search_list = [4 , 6 , 7 , 8 , 9 ] print (binary_search(search_list, 1 ))print (binary_search(search_list, 6 ))print (binary_search(search_list, 7 ))
结果如下:
Python【数据结构】常用排序代码 插入排序 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 def insert_sort (origin_list ): sorted_list = [] for i in range (0 , len (origin_list)): if len (sorted_list) == 0 : sorted_list.append(origin_list[i]) continue for j in range (len (sorted_list) - 1 , -1 , -1 ): if sorted_list[j] <= origin_list[i]: sorted_list.insert(j + 1 , origin_list[i]) break if j == 0 : sorted_list.insert(0 , origin_list[i]) return sorted_list origin_list = [5 , 3 , 1 , 7 , 9 , 8 ] sort = insert_sort(origin_list) print (sort)
Python正则表达式 几个小例子 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 import rem = re.findall("abc" , "aaaaabcccabcc" ) print mm = re.findall("\d" , "abc1ab2c" ) print mm = re.findall("\d\d\d\d" , "123abc1234abc" ) print mm = re.findall(r"<div>(.*)</div>" , "<div>hello</div>" ) print m m = re.findall(r"<div>(.*)</div>" , "<div>hello</div><div>world</div>" ) print mm = re.findall(r"<div>(.*?)</div>" , "<div>hello</div><div>world</div>" ) print m
运行结果如下:
1 2 3 4 5 6 ['abc' , 'abc' ] ['1' , '2' ] ['1234' ] ['hello' ] ['hello</div><div>world' ] ['hello' , 'world' ]
正则表达式语法 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 m = re.findall("." , "aa\nabbcc" ) print (m) m = re.findall("\." , "a.c" ) print (m) m = re.findall('a[bcd]e' , 'abeaceade' ) print (m) m = re.findall("\d" , "abc1ab2c" ) print (m) m = re.findall("\D" , "abc1ab2c" ) print (m) m = re.findall("^abc" , "abcabc" ) print (m) m = re.findall("abc$" , "abcabc" ) print (m) s = "<div>hello\nworld</div>" m = re.findall(r"<div>(.*)</div>" , s) print (m) m = re.findall(r"<div>(.*)</div>" , s, re.S) print (m) m = re.findall("^abc" , "abc\nabc" ) print (m) m = re.findall("^abc" , "abc\nabc" , re.M) print (m) s = "aabbbbabb" m = re.findall("ab?" , s) print (m) m = re.findall("ab+" , s) print (m) m = re.findall("ab*" , s) print (m) m = re.findall("\w+@\w+\.org" , "7636874@qq.com;763687@qq.org" ) print (m)
如果是多次调用某个正则,则最好先编译后使用,以免每次调用都需要编译
1 2 3 4 5 6 7 p = re.compile ("^abc" ) m = p.findall("abc\nabc" ) print mm = p.findall("abcdef\nfdsfabc" ) print mm = p.findall("dabcdef\nefdsfabc" ) print m
Python爬虫 爬取糗事百科段子 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 import reimport requestsimport htmlimport timedef crawl_joke_list (page=1 ): url = "http://www.qiushibaike.com/8hr/page/" + str (page) res = requests.get(url) pattern = re.compile ("<div class=\"article block untagged mb15.*?<div class=\"content\">.*?</div>" , re.S) body = html.unescape(res.text).replace("<br/>" , "\n" ) m = pattern.findall(body) user_pattern = re.compile ("<div class=\"author clearfix\">.*?<h2>(.*?)</h2>" , re.S) content_pattern = re.compile ("<div class=\"content\">(.*?)</div>" , re.S) for joke in m: user = user_pattern.findall(joke) output = [] if len (user) > 0 : output.append(user[0 ]) content = content_pattern.findall(joke) if len (content) > 0 : output.append(content[0 ].replace("\n" , "" )) print ("\t" .join(output)) time.sleep(1 ) if __name__ == '__main__' : for i in range (1 , 10 ): crawl_joke_list(i)
爬取糗事百科图片 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 import requestsimport redef crawl_image (image_url, image_local_path ): r = requests.get(image_url, stream=True ) with open (image_local_path, "wb" ) as f: f.write(r.content) def crwal (page ): url = "http://www.qiushibaike.com/imgrank/page/" + str (page) res = requests.get(url) content_list = re.findall("<div class=\"thumb\">(.*?)</div>" , res.content.decode("utf-8" ), re.S) for content in content_list: image_list = re.findall("<img src=\"(.*?)\"" , content) for image_url in image_list: crawl_image(image_url, "./images/" + image_url.strip().split('/' )[-1 ]) if __name__ == '__main__' : crwal(1 )
BeautifulSoup爬取糗事百科段子 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 import reimport requestsimport htmlimport timefrom bs4 import BeautifulSoupdef crawl_joke_list_use_bs4 (page=1 ): url = "http://www.qiushibaike.com/8hr/page/" + str (page) res = requests.get(url) soup = BeautifulSoup(res.text, "html5lib" ) joke_list = soup.find_all("div" , class_="article block untagged mb15" ) for child in joke_list: print (child.find("h2" ).string + "\t" + "" .join(child.find("div" , class_="content" ).stripped_strings)) time.sleep(1 ) if __name__ == '__main__' : crawl_joke_list(1 )
Python抓取动态网站-百度美食 1 2 3 4 5 6 7 8 9 10 11 12 13 14 import requestsimport reimport jsondef crawl (page ): pn = page * 8 url = "https://sp0.baidu.com/8aQDcjqpAAV3otqbppnN2DJv/api.php?resource_id=6875&from_mid=1&&format=json&ie=utf-8&oe=utf-8&query=%E7%BE%8E%E9%A3%9F&sort_key=&sort_type=1&stat0=&stat1=&stat2=&stat3=&pn=" + str (pn) + "&rn=8&cb=jQuery110200319478991186668_1472651805605&_=1472651805613" res = requests.get(url) json_str_re = re.compile ("{.*}" ) json_str = json_str_re.search(res.text).group() food_dict = json.loads(json_str) for food in food_dict["data" ][0 ]["disp_data" ]: print (food["ename" ]) if __name__ == '__main__' : crawl(1 )
爬虫大杀器phantomJS+selenium 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 from bs4 import BeautifulSoupfrom selenium import webdriverimport urllibdriver = webdriver.PhantomJS( executable_path='/usr/local/bin/phantomjs' ) def search (keyword ): url_keyword = urllib.parse.quote(keyword) url = "http://www.tianyancha.com/search/" + url_keyword + "?checkFrom=searchBox" print (url) driver.get(url) bsObj = BeautifulSoup(driver.page_source, "html5lib" ) print (bsObj) company_list = bsObj.find_all("span" , attrs={"ng-bind-html" : "node.name | trustHtml" }) for company in company_list: print (company.get_text()) if __name__ == '__main__' : search("阿里巴巴 马云" )
爬虫绝配mongodb 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 import pymongoimport sysimport unittestreload(sys) sys.setdefaultencoding('utf-8' ) class MongoAPI (object ): def __init__ (self, db_ip, db_port, db_name, table_name ): self .db_ip = db_ip self .db_port = db_port self .db_name = db_name self .table_name = table_name self .conn = pymongo.MongoClient(host=self .db_ip, port=self .db_port) self .db = self .conn[self .db_name] self .table = self .db[self .table_name] def get_one (self, query ): return self .table.find_one(query, projection={"_id" : False }) def get_all (self, query ): return self .table.find(query) def add (self, kv_dict ): return self .table.insert(kv_dict) def delete (self, query ): return self .table.delete_many(query) def check_exist (self, query ): ret = self .get(query) return len (ret) > 0 def update (self, query, kv_dict ): ret = self .table.update_many( query, { "$set" : kv_dict, } ) if not ret.matched_count or ret.matched_count == 0 : self .add(kv_dict) elif ret.matched_count and ret.matched_count > 1 : self .delete(query) self .add(kv_dict) class DBAPITest (unittest.TestCase): def setUp (self ): self .db_api = MongoAPI("127.0.0.1" , 27017 , "test" , "test_table" ) def test (self ): db_api = self .db_api db_api.add({"url" : "test_url" , "k" : "v" }) self .assertEqual(db_api.get_one({"url" : "test_url" })["k" ], "v" ) db_api.update({"url" : "test_url" }, {"url_update" : "url_update" }) ob = db_api.get_one({"url" : "test_url" }) self .assertEqual(ob["url_update" ], "url_update" ) db_api.delete({"url" : "test_url" }) self .assertEqual(db_api.get_one({"url" : "test_url" }), None ) if __name__ == '__main__' : unittest.main()