---

layout: post
title: "Java中的equals和=="
category: Reading Notes

tags: ["读文章", “Java”]

{% include JB/setup %}

equals和==

StringBuilder s1 = new StringBuilder();
StringBuilder s2 = new StringBuilder();
String s = "abc";
s1.append(s);
s2.append(s);
System.out.println("s1.equals(s2):\t\t" + (s1.equals(s2)));
System.out.println("s1 == s2:\t\t" + (s1 == s2));

String s3 = s + "1";
String s4 = s + "1";
System.out.println("s3 == s4:\t\t" + (s3 == s4));
System.out.println("s3.equals(s4):\t\t" + (s3.equals(s4)));

Object obj1 = new Object();
Object obj2 = new Object();
System.out.println("obj1.equals(obj2):\t" + (obj1.equals(obj2)));
System.out.println("obj1 == obj2:\t\t" + (obj1 == obj2));

执行结果如下

s1.equals(s2):      false
s1 == s2:       false
s3 == s4:       false
s3.equals(s4):      true
obj1.equals(obj2):  false
obj1 == obj2:       false

Object类的equals方法的实现:

public boolean equals(Object obj) {
    return (this == obj);
}

Object中的equals就是用==来比较当前对象和传入的参数的。

String的equals实现:

public boolean equals(Object anObject) {
    if (this == anObject) {
        return true;
    }
    if (anObject instanceof String) {
        String anotherString = (String)anObject;
        int n = count;
        if (n == anotherString.count) {
        char v1[] = value;
        char v2[] = anotherString.value;
        int i = offset;
        int j = anotherString.offset;
        while (n-- != 0) {
            if (v1[i++] != v2[j++])
            return false;
        }
        return true;
        }
    }
    return false;
}

它去比较内容了。

StringBuilder,在其源码里没有发现equals方法,那么它就继承了Object的实现,用==来比较传进来的参数。

Reference:

http://www.ticmy.com/?p=186

2015/10/25

---

layout: post
title: "一致性hash算法"
category: Reading Notes

tags: ["读文章", "hash"]

{% include JB/setup %}

一致性hash算法

基本场景

比如有N个cache服务器,那么如何将一个对象object映射到N个cache上,你可能会采用下面的通用方法计算object的hash值,然后均匀的映射到N个cache

hash(object) % N

一切都运行正常,再考虑如下的两种情况:

  1. 一个cache服务器M挂掉,这样所有映射到cache M的对象都会失效。怎么办?需要把cache M从cache中移除,这时候cache是N-1台,映射公式变成了

    hash(object) % (N-1)

  2. 由于访问加重,需要添加cache,这时候cache是N+1台,映射公式变成了

    hash(object) % (N+1)

1和2意味着什么?

这意味着突然之间所有的cache都失效了。对于服务器而言,这是一场灾难,洪水般的访问都会直接冲向后台服务器。

再考虑第三个问题,由于硬件能力越来越强,你可能会让后面添加的节点多做点活,显然上面的hash算法也做不到。

hash算法和单调性

hash算法的一个衡量指标是单调性(Monotonicity),定义如下:

单调性是指如果已经有一些内容通过哈希分派到了响应的缓冲中,又有新的缓冲加入到系统中。哈希的结果应能够保证原有已分配的内容可以被映射到新的缓冲中去,而不会被映射到旧的缓冲集合中的其他缓冲区

当缓冲区大小变化时Consistent hashing尽量保护已分配的内容不会被重新映射到新缓冲区。

因此上面提到的 hash(object) % N 难以满足单调性要求

consistent hashing算法的原理

consistent hashing是一种hash算法,简单的说,在移除/添加一个cache时,它能够尽可能小的改变已存在key映射关系,尽可能的满足单调性的要求

环形hash空间

考虑通常的hash算法都是将value映射到一个32位的key值,也就是 0~232-1 次方的数值空间;我们可以将这个空间想象成首(0)尾(232-1)相接的圆环,如下图所示:

一致性hash算法1

把对象映射到hash空间

接下来考虑 4 个对象 object1~object4 ,通过 hash 函数计算出的 hash 值 key 在环上的分布如图 2 所示。

hash(object1) = key1;

… …

hash(object4) = key4;

一致性hash算法2

把cache映射到hash空间

Consistent hashing 的基本思想就是将对象和 cache 都映射到同一个 hash 数值空间中,并且使用相同的 hash 算法

假设当前有 A,B 和 C 共 3 个cache ,那么其映射结果将如图 3 所示,他们在 hash 空间中,以对应的 hash 值排列。

hash(cache A) = key A;

… …

hash(cache C) = key C;

一致性hash算法3

顺便提一下 cache 的 hash 计算,一般的方法可以使用 cache 机器的 IP 地址或者机器名作为 hash 输入

把对象映射到cache

现在cache和对象都已经通过同一个hash算法映射到hash数值空间中了,接下来要考虑如何将对象映射到cache上面

在这个环形空间中,如果沿着顺时针方向从对象的key值出发,直到遇见一个cache,那么就将该对象存储在这个cache上,因为对象和cache的hash值是固定的,因此这个cache必然是唯一和确定的。这样就找到了对象和cache的映射方法了。

依然继续上面的例子(参见图 3 ),那么根据上面的方法,对象 object1 将被存储到 cache A 上; object2 和 object3 对应到 cache C ; object4 对应到 cache B

考察cache的变动

分析consistent hashing算法

移除cache

假设cache B挂掉了,根据上面讲到的映射方法,这时受影响的将仅是那些沿 cache B 逆时针遍历直到下一个 cache ( cache C )之间的对象,也即是本来映射到 cache B 上的那些对象

一致性hash算法4

添加cache

再考虑添加一台新的 cache D 的情况,假设在这个环形 hash 空间中, cache D 被映射在对象 object2 和 object3 之间。这时受影响的将仅是那些沿 cache D 逆时针遍历直到下一个 cache ( cache B )之间的对象(它们是也本来映射到 cache C 上对象的一部分),将这些对象重新映射到 cache D 上即可。

一致性hash算法5

虚拟节点

考量hash算法的另一个指标是平衡性(balance),定义如下:

平衡性:

平衡性是指哈希的结果能够尽可能分布到所有的缓冲中去,这样可以使得所有的缓冲空间都得到利用。

hash算法并不是保证部署的绝对平衡,如果cache较少的话,对象并不能被均匀的映射到cache上,比如在上面的例子中,仅部署cacheA和cacheC的情况下,在 4 个对象中, cache A 仅存储了 object1 ,而 cache C 则存储了 object2 、 object3 和 object4 ;分布是很不均衡的。

为了解决这种情况, consistent hashing 引入了“虚拟节点”的概念,它可以如下定义:

“虚拟节点(virtual node)”是实际节点在hash空间的复制品(replica),一个实际节点对应了若干个“虚拟节点”,这个对应个数也成为“复制个数”,“虚拟节点”在hash空间中以hash值排列

仍以仅部署 cache A 和 cache C 的情况为例,在图 4 中我们已经看到, cache 分布并不均匀。现在我们引入虚拟节点,并设置“复制个数”为 2 ,这就意味着一共会存在 4 个“虚拟节点”, cache A1, cache A2 代表了 cache A ; cache C1, cache C2 代表了 cache C ;

一致性hash算法6

此时,对象到“虚拟节点”的映射关系为:

objec1->cache A2 ; 
objec2->cache A1 ; 
objec3->cache C1 ; 
objec4->cache C2 ;

因此对象 object1 和 object2 都被映射到了 cache A 上,而 object3 和 object4 映射到了 cache C 上;平衡性有了很大提高

引入“虚拟节点”后,映射关系就从 { 对象 -> 节点 } 转换到了 { 对象 -> 虚拟节点 } 。查询物体所在 cache 时的映射关系如图 7 所示。

一致性hash算法7


“虚拟节点”的 hash 计算可以采用对应节点的 IP 地址加数字后缀的方式。例如假设 cache A 的 IP 地址为 202.168.14.241 。

引入“虚拟节点”前,计算 cache A 的 hash 值:

Hash(“202.168.14.241”);

引入“虚拟节点”后,计算“虚拟节”点 cache A1 和 cache A2 的 hash 值:

Hash(“202.168.14.241#1”);  // cache A1

Hash(“202.168.14.241#2”);  // cache A2

The circle is represented as a sorted map of integers, which represent the hash values, to caches (of type T here).

When a ConsistentHash object is created each node is added to the circle map a number of times (controlled by numberOfReplicas).

The location of each replica is chosen by hashing the node's name along with a numerical suffix, and the node is stored at each of these points in the map.

To find a node for an object (the get method), the hash value of the object is used to look in the map. Most of the time there will not be a node stored at this hash value (since the hash value space is typically much larger than the number of nodes, even with replicas), so the next node is found by looking for the first key in the tail map. If the tail map is empty then we wrap around the circle by getting the first key in the circle.

2015/10/25

---

layout: post
title: "个性化推荐的十大挑战"
category:

tags: ["读文章"]

{% include JB/setup %}

挑战一: 数据稀疏性问题

两个用户之间选择的重叠非常少。数据非常稀疏,使得绝大部分基于关联分析的算法(譬如协同过滤)效果都不好。

为了解决这个问题,也有很多方法,譬如可以通过扩散的算法,从原来的一阶关联(两个用户有多少相似打分或者共同购买的商品)到二阶甚至更高阶的关联(假设关联性或者说相似性本身是可以传播的),也可以添加一些缺省的打分。

挑战二:冷启动问题

新用户因为罕有可以利用的行为信息,很难给出精确的推荐。反过来,新商品由于被选择次数很少,也难以找到合适的方法推荐给用户

一种方法是利用文本信息进行辅助推荐,或者通过注册以及询问得到一些用户的属性信息。

新用户更容易选择特别流行的商品,说明使用热销榜也能获得不错的结果。

冷启动问题还可以通过多维数据的交叉推荐部分解决,其精确度和多样性又远胜于热销榜。

挑战三:大数据处理与增量计算问题

设计增量算法,也就是当产生新用户,新商品以及新的连接关系时,算法的结果不需要在整个数据集上重新计算,而只需要考虑所增加节点和连接局部的信息,对原有的结果进行微扰,快速得到新结果。一般而言,这种算法随着加入的信息量的增多,其误差会积累变大,最终每过一段时间还是需要利用全局数据重新进行计算。

挑战四:多样性与精确性的两难困境

应用个性化推荐技术的商家,也希望推荐中有更多的品类出现,从而激发用户新的购物需求。但是推荐多样的商品和新颖的商品与推荐的精确性之间存在矛盾,因为前者风险很大--一个没什么人看过或者打分较低的东西推荐出手,很可能被用户憎恶,或者牺牲精确性来提高多样性。

挑战五:推荐系统的脆弱性问题

一些心怀不轨的用户通过提供一些恶意虚假的行为,故意增加或者压制某些商品被推荐的可能性。

挑战六:用户行为模式的挖掘和利用

例如:新用户和老用户具有很不一样的选择模式。

有些混合算法可以通过一个单参数调节推荐结果的多样性和热门程度,在这种情况下就可以考虑为给不同用户赋予不同参数,甚至允许用户自己移动一个滑钮调节这个参数--当用户想看热门的时候,算法提供热门推荐;当用户想找点很酷的产品时,算法也可以提供冷门推荐。

挑战七:推荐系统效果评估

常见的评估指标可以分成四大类:分别是准确度、多样性、新颖性和覆盖率

挑战八:用户界面与用户体验

即推荐结果的可解释性,对于用户体验有至关重要的影响--用户希望知道这个推荐是怎么来的。

协同过滤有明显的优势,譬如亚马逊基于商品的协同过滤在发送推荐的电子邮件时会告诉用户之所以向其推荐某书,是因为用户之前购买过某些书。

不同类别往往来自于不同的推荐方法,譬如看过还看过(浏览过本商品的客户还浏览过的商品)、买过还买过(购买过本商品的客户还购买过的商品)、看过最终购买(浏览过本商品的客户最终购买的商品)、个性化热销榜(个性化流行品推荐)、猜你喜欢(个性化冷门商品推荐)等等。

挑战九:多维数据的交叉利用

譬如你可能既有新浪微博的帐号,又是人人网的注册用户,还是用手机,那么你已经同时在三个巨大的社会网络中了。与此同时,你可能还经常在淘宝、京东、麦包包、1号店、库巴网……这些地方进行网购,那么你也是一张巨大的用户-商品二部分图中的一员。想象如果能够把这些网络数据整合起来,特别是知道每个节点身份的对应关系(不需要知道你真实身份,只需要知道不同网络中存在的一些节点是同一个人),其中有特别巨大的社会经济价值。

挑战十:社会推荐

用户更喜欢来自朋友的推荐而不是被系统“算出来的推荐”

Reference:

周涛,个性化推荐的十大挑战, http://blog.sciencenet.cn/home.php?mod=space&uid=3075&do=blog&id=554630, viewed on 03/05/2012

2015/10/25

---

layout: post
title: "基础算法(1)"
category:

tags: ["读文章", "算法"]

{% include JB/setup %}

Linear search:                  O(N)
Binary search:                  O(logN)
Insertion in unordered array:   O(1)
Insertion in ordered array:     O(N)
Deletion in unordered array:    O(N)
Deletion in ordered array:      O(N)
  • Bubble Sort:

    • Compare two players
    • If the one on the left is taller, swap them
    • Move one position right

      You continue down the line this way until you reach the right end. You have by no means finishes sorting the kids, but you do know that the tallest kid is on the right.

    • When you reach the first sorted player, start over at the left end of the line.

  • Selection Sort:

    • Making a pass through all the players and picking the shortest one.
    • This shortest player is then swapped with the player on the left end of the line, at postion 0.
    • Now the leftmost player is sorted and wont need to be moved again.
    • The next time you pass down the row of players, you start at position 1, and, finding the minimum, swap with postion 1.
  • Insertion Sort:

  • A stack is used to reverse the letters(字段一个单词一个单词放入stack中) and delimiter matching(开始的(或{放入stack中,如果遇到}或者)就pop出来,出现pop的和输入的不符,则不匹配).

  • Stack: FILO

  • Queue: FIFO

  • Priority Queue: Like an ordinary queue, a priority queue has a front and a rear, and items are removed from the front. However, in a priority queue, items are ordered by key value so that the item with the lowest key is always at the front. Items are inserted in the proper position to maintain the order.

  • The mergesort is fairly easy to implement. The downside of the mergesort is that it requires an additional array in memory, equal in size of the one being sorted. If your original array barely fits in memory, the mergesort wont work. However, if you have enough space, it's a good choice.

2015/10/25

---

layout: post
title: "Database Systems(1)"
category:

tags: ["读书"]

{% include JB/setup %}

  • Database: Collection of data central to some enterprise/ organisation; essential to operation of enterprise; an asset in its own right: historial data can guide enterprise strtegy, of interest to other enterprises; database is persistent; all qualfied users have access to the same data for use in a variety of activities.

  • A database management system is a program that manages a database: stores the database on some mass storage providing fail safety(backup/ recovery); supports a high-level access language(SQL); provdes transaction management to guarantee correct concurrent access to shared data.

  • Disadvantages of File processing:

    1) Program-data dependence: all programs contain full descriptions of each data file they use

    2) Data Redundancy: different systems/ programs have separate copies of the same data; no centralized control of data

    3) Limited data sharing: required data in several, potentially incompatible files.

    4) Excessive progrmam maintenance

  • Advantages of Database:

    1) Program-data independence: metadata stored in DBMS, so applications dont need to worry about data formats; results in reduced application development time, increased maintenance productivity and efficient access.

    2) minimal data redundancy: leads to increased data integrity/ consistency

    3) Improved Data Sharing: different users get different views of the data; efficient concurrent access

    4) Enforcement of standards: all data access is done in the same way

    5) Improved data quality: integrity constraints, data validation rules

    6) Better data accessibility/ responsiveness: use of standard data query language(SQL)

    7) security, backup/recovery, concurrency

  • Relational Databases

    • Stores data as rows with multiple attributes
    • Rows of the same format('type') form a table
    • A relational database is a collection of such tables (which typically are related to each other by key attributes)
  • What is a transaction: When an event in the real world changes the state of the enterprise, a transaction is executed to cause the corresponding change in the database state.

  • DB system requirements:
    1) High Availability: online -> must be operational while enterprise is functioning
    2) High Reliability: correctly tracks state, does not loose data, controlled concurrency
    3) High Throughput: many users -> many transactions per second
    4) Low Response Time: on line -> users are waiting
    5) Long lifetime: complex systems are not easily replaced
    6) Security: sensitive information must be carefully protected since system is accessible to many users.

  • Data model: a collection of concepts for describing data

  • Schema: a description of a particular collection of data at some abstraction level, using a given data model.

  • Relational data model:

    relation, basically a table with rows and columns

    schema, which describes the columns or fields

  • Relation is a set of tuples (is a named, two-dimensional table of data)

    1) tuple ordering immaterial
    2) No duplicates
    3) Cardinality of relation = number of tuples per entity

  • All tuples in a relation have the same structure; constructed from the same set of attributes.

    1) Arity of relation = number of attributes
    2) attributes are named (ordering is immaterial)

  • Not all tables qualify as a relation

    1) Every relation must have a unique name
    2) Attributes in tables must have unique names
    3) Every attribute value is atomic. The range of allowed values of an attribute is called its domain
    4) The order of the columns is irrellevant
    5) Every row is unique (cant have two rows with exactly the same values for all their fields)
    6) The order of the rows is irrelevant

  • A relation R(表格中具体的数据) has a relation schema(表格的输入格式), which specifies name of relation, and name and data type of each attribute

  • Relation schema:

    Students(sid: string, name: string, login: string, addr: string, gender: char)

  • Relation:
    R = { (Jones, Main, Sydney),

    (Smith, North, Perth),
    (Curry, North, Brisbane),
    (Lindsay, Park, Sydney) }

  • A relation instance: a set of tuples(table) for a schema

rows = cardinality

fields = degree (or arity) of a relation

  • First Normal Form (1NF) is the restriction of atomic attributes

  • Relational database management system(RDBMS) allows duplicate rows

    In a SQL-based RDBMS, it is possible to insert a row where every attribute has the same value as an existing row. The table will then contain two identical rows.

  • RDBMS allows a special "null" value in a column to represent facts that are not relevant, or not yet known.

  • It is very dangerous to instead use an ordinary value with special meaning: Eg if salary = -1 is used for "unknown", then averages wont be sensible

  • keys are special attributes that serve two main purposes:

    1) Primary keys are unique, minimal identifiers of the relation in question.
    2) Foreign keys are identifiers that enable a dependent relation (on the many side of a relationship) to refer to its parent relation (on the one side of the relationship)

  • Keys can be simple (a single attribute) or composite (more than one attribute)

  • Keys usually are used as indices to speed up the response to user queries

  • Possibly many candidate keys (specified using UNIQUE), one of which is chosen as the primary key

  • Foreign key: set of attributes in a relation that is used to 'refer' to a tuple in a parent relation

  • Referential Integrity: for each tuple in the referring relation whose foreign key value is a, there must be a tuple in the referred relation whose primary key value is also a.

  • CREATE TABLE Enrolled ( sid INTEGER, ucode CHAR(8), semester VARCHAR,

    CONSTRAINT Enrolled_FK1 FOREIGN KEY (sid) REFERENCES Student,

    CONSTRAINT Enrolled_FK2 FOREIGN KEY (ucode) REFERENCES UoS,

    CONSTRAINT Enrolled_PK PRIMARY KEY (sid,ucode)

    );

  • Data Structure: A relational database is a set of relations (tables) with tuples (rows) and fields (columns) - a simple and consistent structure

  • The union of the corresponding relational schemata is the relational database schema

  • Data manipulation: powerful operators to manipulate the data stored in relations

  • Data Integrity: facilities to specify a variety of rules to maintain the integrity of data when it is manipulated.

  • Integrity Constraint (IC): condition that must be true for any instance of the database.

    ICs can be declared in the schema. They are specified when schema is defined. All declared ICs are checked whenever relations are modified.

  • A legal instance of a relation is one that satisfies all specified ICs. If ICs are declared, DBMS will not allow illegal instances.

  • Logical schema:

    • Student(sid:integer, name:string, login:string, bdate: date, sex:char)
    • UoS(ucode:string, cname:string, credits: integer)
    • Enrolled(sid:integer, ucode :string, semester: integer, grade:string)
  • Physical Schema:

    • describes details of how data is stored:
    • E.g. relations stored as unordered files
    • Index on first column of Students
    • non-database applications typically have only on a physical schema
  • External Schema (view)

    • UoS_info(ucode :string, enrolment: integer)
  • A view is a virtual relation, but we store a definition, rather than a set of tuples. A mechanism to hide data from the view of certain users.Views can be used to present necessary information, while hiding details in underlying relations.

  • Entity: a person, place, object, event, or concept about which you want to gather and store data

  • Entity type (also: entity set): a collection of entities that share common properties or characteristics, which is described by a set of attributes

  • Attribute: describe one aspect of an entity type

  • domain: possible values of an attribute

  • key: minimal set of attributes that uniquely identifies an entity in the set

  • Relationship: relates two or more entities

  • number of entities is also known as the degree of the relationship

  • relation (relational model): set of tuples

    relationship (E-R model): describes relationship between entities

    Both entity sets and relationship sets (E-R model) may be represented as relations (in the relational model)

  • Weak entity type: an entity type that does not have a primary key.

    1) can be seen as an exclusive 'part-of' relationship

    2) its existence depends on the existence of one or more identifying entity types

    3) it must relate to the identifying entity set via a total, one-to-many identifying relationship type from the identifying to the weak entity set

    Eg: child from parents, payment of a loan

  • The discriminator (or partial key) of a weak entity type is the set of attributes that distinguishes among all the entities of a weak entity type related to the same owning entity

  • The most important requirement is adequacy: that the design should allow representing all the important facts about the design

  • If a design is adequate, then we seek to avoid redundancy in the data

  • Redundant data is where the same information is repeated in several places in the db

  • Redundancy is at the root of several problems associated with relational schemas:

    1) Insertion Anomaly: Adding new rows forces user to create duplicate data or to use null value

    2) Deletion Anomaly: Deleting rows may causes a loss of data that would be needed for other future rows.

    3) Update Anomaly: Changing data in a row forces changes to other rows because of duplication

  • To say a table relation, it should contain unique rows and no multivalued attributes.

  • Functional Dependency: the value of one attribute determines the value of another attribute

  • FDs must be identified based on the semantics of an application

  • A relation R is in First Normal Form(1NF) if the domains of all attributes of R are atomic (所有元素都是不能拆分的)

  • Domain is atomic if its elements are considered to be indivisible units; Examples of non-atomic domains: multivalued attributes, composite attributes

  • Second Normal Form (2NF): 1NF + every non-key attribute is fully functionally dependent on the primary key, this means no partial dependencies, no FD x -> y where x is a strict subset of some key(1NF + 所有非键元素都可以通过主键查找到)

  • Third Normal Form (3NF):2NF + no transitive dependencies (functional dependencies between non-primary-key attributes). No FD of the form x -> y, where x is not a key and y is not at least a subset of a key

  • 在任何一个关系数据库中,第一范式(1NF)是对关系模式的基本要求,不满足第一范式(1NF)的数据库就不是关系数据库。在第一范式(1NF)中表的每一行只包含一个实例的信息。

  • 第二范式(2NF)要求数据库表中的每个实例或行必须可以被唯一地区分。所谓完全依赖是指不能存在仅依赖主关键字一部分的属性。第二范式就是非主属性非部分依赖于主关键字。

  • 第三范式(3NF)要求一个数据库表中不包含已在其它表中已包含的非主关键字信息

2015/10/25