scikit-learn 学习笔记-1-加载文本语料库
来源:互联网 发布:淘宝女鞋2016新款上市 编辑:程序博客网 时间:2024/04/29 10:23
先上官方文档:
http://scikit-learn.org/stable/user_guide.html
API:
http://scikit-learn.org/stable/modules/classes.html
加载文本语料的方法doc文档为
http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_files.html#sklearn.datasets.load_files
语料库目录结构:
container_folder/ category_1_folder/ file_1.txt file_2.txt ... file_42.txt category_2_folder/ file_43.txt file_44.txt ...
源码分析:
def load_files(container_path, description=None, categories=None, load_content=True, shuffle=True, encoding=None, decode_error='strict', random_state=0): #用于记录每个语料对应的标签是哪一个 target = [] #存放标签名 target_names = [] #存放语料文件路径名 filenames = []#获取所有子目录的名,其实就是标签名 folders = [f for f in sorted(listdir(container_path)) if isdir(join(container_path, f))]#限定加载的类的种类 if categories is not None: folders = [f for f in folders if f in categories]#初始化target_names 和 filenames 列表 for label, folder in enumerate(folders): target_names.append(folder) folder_path = join(container_path, folder) documents = [join(folder_path, d) for d in sorted(listdir(folder_path))] target.extend(len(documents) * [label]) filenames.extend(documents) # 转换为array方便花式检索 filenames = np.array(filenames) target = np.array(target) #是否打乱顺序 if shuffle: random_state = check_random_state(random_state) indices = np.arange(filenames.shape[0]) random_state.shuffle(indices) filenames = filenames[indices] target = target[indices]#是否加载内容,如果加载则需要指定编码方式,如果不指定则会以byte方式读入 if load_content: data = [] for filename in filenames: with open(filename, 'rb') as f: data.append(f.read()) if encoding is not None: data = [d.decode(encoding, decode_error) for d in data] return Bunch(data=data, filenames=filenames, target_names=target_names, target=target, DESCR=description)#最终返回的是Bunch对象 return Bunch(filenames=filenames, target_names=target_names, target=target, DESCR=description)
0 0
- scikit-learn 学习笔记-1-加载文本语料库
- Scikit-learn 学习笔记
- scikit-learn 学习笔记
- Scikit-Learn学习笔记
- 学习scikit learn 1
- Scikit-Learn学习笔记系列
- scikit-learn文档学习笔记
- scikit-learn的学习笔记
- scikit-learn 基础学习笔记
- scikit-learn补充笔记1
- Scikit-learn 学习笔记--(1)特征选择
- scikit-learn学习笔记:Simple 1D Kernel Density Estimation
- 机器学习-scikit learn学习笔记
- 使用scikit-learn进行机器学习(scikit-learn教程1)
- scikit-learn Preprocessing学习笔记(一)
- scikit-learn Preprocessing学习笔记(二)
- scikit-learn Preprocessing学习笔记(三)
- scikit-learn 学习笔记(一)
- [LeetCode]Sqrt(x)
- 我怎么能确保non-corrupt 文件传输 in linux
- SQL server 在自增长的字段中插入指定值
- [LeetCode]Simplify Path
- DeveloperResource
- scikit-learn 学习笔记-1-加载文本语料库
- Introduction to Java Programming编程题5.33<显示当前日期和时间>
- 九度oj 1092
- Linux下Redis3.0.3的部署和启动笔记
- poj-3678(2-SAT)
- js将html table导成excel表格,IE、Google Chrome都能用
- Version和Build字段的关系
- POJ 3259:Wormholes 【SPFA】
- redis常用命令、常见错误、配置技巧等分享