k-means算法是聚类算法还是分类算法

2024-05-18 10:17

1. k-means算法是聚类算法还是分类算法

一,k-means聚类算法原理
k-means
算法接受参数
k
；然后将事先输入的n个数据对象划分为
k个聚类以便使得所获得的聚类满足：同一聚类中的对象相似度较高；而不同聚类中的对象相似度较小.聚类相似度是利用各聚类中对象的均值所获得一个“中心对
象”（引力中心）来进行计算的.
　　k-means算法是最为经典的基于划分的聚类方法,是十大经典数据挖掘算法之一.k-means算法的基本思想是：以空间中k个点为中心进行聚类,对最靠近他们的对象归类.通过迭代的方法,逐次更新各聚类中心的值,直至得到最好的聚类结果.
　　假设要把样本集分为c个类别,算法描述如下：
　　（1）适当选择c个类的初始中心；
　　（2）在第k次迭代中,对任意一个样本,求其到c个中心的距离,将该样本归到距离最短的中心所在的类；
　　（3）利用均值等方法更新该类的中心值；
　　（4）对于所有的c个聚类中心,如果利用（2）（3）的迭代法更新后,值保持不变,则迭代结束,否则继续迭代.
　　该算法的最大优势在于简洁和快速.算法的关键在于初始中心的选择和距离公式.

2. kmeans算法是什么？

K-means算法是一种基于距离的聚类算法，也叫做K均值或K平均，也经常被称为劳埃德(Lloyd)算法。是通过迭代的方式将数据集中的各个点划分到距离它最近的簇内，距离指的是数据点到簇中心的距离。
K-means算法的思想很简单，对于给定的样本集，按照样本之间的距离大小，将样本划分为K个簇。将簇内的数据尽量紧密的连在一起，而让簇间的距离尽量的大。

算法流程
1、选取数据空间中的K个对象作为初始中心，每个对象代表一个聚类中心。
2、对于样本中的数据对象，根据它们与这些聚类中心的欧氏距离，按距离最近的准则将它们分到距离它们最近的聚类中心（最相似）所对应的类。
3、更新聚类中心：将每个类别中所有对象所对应的均值作为该类别的聚类中心，计算目标函数的值。
4、判断聚类中心和目标函数的值是否发生改变，若不变，则输出结果，若改变，则返回2）。

3. Kmeans算法原理

 Kmeans是一种无监督的基于距离的聚类算法，其变种还有Kmeans++。
                                                                                                                            注意，某些聚类中心可能没有被分配到样本，这样的聚类中心就会被淘汰（意味着最终的类数可能会减少） 
   和其他机器学习算法一样，K-Means 也要评估并且最小化聚类代价，在引入 K-Means 的代价函数之前，先引入如下定义：
                                           引入代价函数：
                                                                                                                                                                   5） 对噪音和异常点比较的敏感。
   数据呈圆形、凸型、在一起的簇的数据形状近似高斯分布的这些数据是kmeans喜欢的数据。

Kmeans算法原理

4. 如何改进kmeans算法中的k的选取问题

K均值聚类法分为如下几个步骤：

一、初始化聚类中心
1、根据具体问题，凭经验从样本集中选出C个比较合适的样本作为初始聚类中心。
2、用前C个样本作为初始聚类中心。
3、将全部样本随机地分成C类，计算每类的样本均值，将样本均值作为初始聚类中心。

二、初始聚类
1、按就近原则将样本归入各聚类中心所代表的类中。
2、取一样本，将其归入与其最近的聚类中心的那一类中，重新计算样本均值，更新聚类中心。然后取下一样本，重复操作，直至所有样本归入相应类中。

三、判断聚类是否合理
采用误差平方和准则函数判断聚类是否合理，不合理则修改分类。循环进行判断、修改直至达到算法终止条件。

5. K-MEANS算法的实现方法

补充一个Matlab实现方法：function [cid,nr,centers] = cskmeans(x,k,nc)% CSKMEANS K-Means clustering - general method.%% This implements the more general k-means algorithm, where% HMEANS is used to find the initial partition and then each% observation is examined for further improvements in minimizing% the within-group sum of squares.%% [CID,NR,CENTERS] = CSKMEANS(X,K,NC) Performs K-means% clustering using the data given in X.%% INPUTS: X is the n x d matrix of data,% where each row indicates an observation. K indicates% the number of desired clusters. NC is a k x d matrix for the% initial cluster centers. If NC is not specified, then the% centers will be randomly chosen from the observations.%% OUTPUTS: CID provides a set of n indexes indicating cluster% membership for each point. NR is the number of observations% in each cluster. CENTERS is a matrix, where each row% corresponds to a cluster center.%% See also CSHMEANS% W. L. and A. R. Martinez, 9/15/01% Computational Statistics Toolboxwarning off[n,d] = size(x);if nargin < 3% Then pick some observations to be the cluster centers.ind = ceil(n*rand(1,k));% We will add some noise to make it interesting.nc = x(ind,:) + randn(k,d);end% set up storage% integer 1,...,k indicating cluster membershipcid = zeros(1,n);% Make this different to get the loop started.oldcid = ones(1,n);% The number in each cluster.nr = zeros(1,k);% Set up maximum number of iterations.maxiter = 100;iter = 1;while ~isequal(cid,oldcid) & iter < maxiter% Implement the hmeans algorithm% For each point, find the distance to all cluster centersfor i = 1:ndist = sum((repmat(x(i,:),k,1)-nc).^2,2);[m,ind] = min(dist); % assign it to this cluster centercid(i) = ind;end% Find the new cluster centersfor i = 1:k% find all points in this clusterind = find(cid==i);% find the centroidnc(i,:) = mean(x(ind,:));% Find the number in each cluster;nr(i) = length(ind);enditer = iter + 1;end% Now check each observation to see if the error can be minimized some more.% Loop through all points.maxiter = 2;iter = 1;move = 1;while iter < maxiter & move ~= 0move = 0;% Loop through all points.for i = 1:n% find the distance to all cluster centersdist = sum((repmat(x(i,:),k,1)-nc).^2,2);r = cid(i); % This is the cluster id for x%%nr,nr+1;dadj = nr./(nr+1).*dist'; % All adjusted distances[m,ind] = min(dadj); % minimum should be the cluster it belongs toif ind ~= r % if not, then move xcid(i) = ind;ic = find(cid == ind);nc(ind,:) = mean(x(ic,:));move = 1;endenditer = iter+1;endcenters = nc;if move == 0disp('No points were moved after the initial clustering procedure.')elsedisp('Some points were moved after the initial clustering procedure.')endwarning on

K-MEANS算法的实现方法

6. K-MEANS算法的介绍

K-MEANS算法是输入聚类个数k，以及包含 n个数据对象的数据库，输出满足方差最小标准的k个聚类。

7. K-means的算法优点

K-Means聚类算法的优点主要集中在:1.算法快速、简单;2.对大数据集有较高的效率并且是可伸缩性的;3.时间复杂度近于线性，而且适合挖掘大规模数据集。K-Means聚类算法的时间复杂度是O(nkt) ,其中n代表数据集中对象的数量，t代表着算法迭代的次数，k代表着簇的数目。

K-means的算法优点

8. K-MEANS算法的基本简介

k-means 算法接受输入量 k ；然后将n个数据对象划分为 k个聚类以便使得所获得的聚类满足：同一聚类中的对象相似度较高；而不同聚类中的对象相似度较小。聚类相似度是利用各聚类中对象的均值所获得一个“中心对象”（引力中心）来进行计算的。