Flags and Lollipops

Sunday, January 13, 2008

Lazyweb, I invoke thee

Dear maths geeks,

On Bookshare each user owns a set of books. The same book can be owned by multiple users.

What algorithm can I use to find the smallest possible set of books that contains at least one book from each user?

Comments and trackbacks Feel free to post your comments Blogger LudoA Anonymous Stew Blogger baoilleach Blogger LudoA Blogger LudoA Blogger DMK Blogger LudoA Blogger DMK Blogger baoilleach Anonymous Stew . This post has trackbacks.

Trackbacks:

10 Comments:

At January 13, 2008 5:10 PM, Blogger LudoA said...

Probably a stupid comment, but is there any reason this can't be done very easily with a SQL query?

 
At January 14, 2008 10:09 AM, Anonymous Stew said...

What would the query be, though? The difficult bit is the 'smallest possble set' part.

 
At January 14, 2008 10:31 AM, Blogger baoilleach said...

Are there are cases where the following naive algorithm fails?

succesively retain the book with the highest numbers of users, and then remove the book and its users from the list (in case of a tie, choose the book whose users have the highest number of connections to their books [or something])

 
At January 14, 2008 11:26 AM, Blogger LudoA said...

Well, I suppose you've got a many-to-many table which links the users to the books, something like this:

tblUsersBooks
user_id: number
book_id: number



To get the least owned book, you've got the problem that there can be multiple books owned 1 times (or whatever the lowest is). If you just need one book of all those owned the least, you can do this:


create database bookshare;
use bookshare;
create table tblOwnersBooks (owner_id int, book_id int);
/* there's owner 1,2 & 3. each owner owns has a book corresponding to his/her own id, and book 4*/
insert into tblownersbooks (owner_id, book_id) values (1, 1), (1, 4), (2, 4), (2, 2), (3, 3), (3, 4);
/* as you can see below, each book is owned once, except book 4 which is owned 3 times*/
select book_id, count(owner_id) from tblownersbooks group by book_id;
/*returns:
+---------+-----------------+
| book_id | count(owner_id) |
+---------+-----------------+
| 1 | 1 |
| 2 | 1 |
| 3 | 1 |
| 4 | 3 |
+---------+-----------------+*/
/* now we want to show them in the correct order, so we can limit ourselves to the top result afterwards. */
select book_id, count(owner_id) AS count from tblownersbooks group by book_id order by count ASC;
/*returns:
+---------+-------+
| book_id | count |
+---------+-------+
| 1 | 1 |
| 2 | 1 |
| 3 | 1 |
| 4 | 3 |
+---------+-------+*/
/* now we want to limit ourselves to the top result. to do this, we use a subquery, so we can use LIMIT (which doesn't work otherwise because of GROUP BY I think*/
SELECT * from (select book_id, count(owner_id) AS count from tblownersbooks group by book_id order by count ASC) SUBQUERY LIMIT 1;
/*returns:
+---------+-------+
| book_id | count |
+---------+-------+
| 1 | 1 |
+---------+-------+*/


If you want all those used the least, you can use a WHERE clause with IN(). The last command gave the number of times the least-owned book is owned. Suppose we already know the number of times the least owned book is owned is 1. In that case we could do:


SELECT book_id
FROM (select book_id, count(owner_id) AS count from tblownersbooks GROUP BY book_id order by count ASC) SUBQUERY
WHERE SUBQUERY.count IN(1);
/*returns:
+---------+
| book_id |
+---------+
| 1 |
| 2 |
| 3 |
+---------+*/

But since we don't know that, we need to get this "1". This can be done like so:


SELECT count
FROM
(
SELECT book_id, count(owner_id) AS count
FROM tblownersbooks
GROUP BY book_id
order by count ASC
) SUBQUERY LIMIT 1;
/*returns:
+-------+
| count |
+-------+
| 1 |
+-------+*/


Now to combine the two last queries, we'd need another subquery, like this;


SELECT book_id
FROM (select book_id, count(owner_id) AS count from tblownersbooks GROUP BY book_id order by count ASC) SUBQUERY
WHERE SUBQUERY.count IN
(
SELECT count
FROM
(
SELECT book_id, count(owner_id) AS count
FROM tblownersbooks
GROUP BY book_id
order by count ASC
) SUBQUERY LIMIT 1
);


But that returns: "ERROR 1235 (42000): This version of MySQL doesn't yet support 'LIMIT & IN/ALL/ANY/SOME subquery'"

So I'm not sure if there's something wrong with the construction of my query, or if it's really a MySQL thing, like the error says. I'm using MySQL 5.0. Maybe you're using another database?


I don't know if your question was because you had the exact same problem as me? Well, I hope you got something out of my comment...

If you're using Oracle or Postgresql, this could be done easily with PL/SQL & PG/SQL, respectively. I'd definitely recommend PG/SQL.

Does anyone know if something's wrong with my query, or it's really a MySQL thing? I'd love to know if this works on e.g. Postgresql. I'll try it out on Oracle in a sec.

 
At January 14, 2008 11:48 AM, Blogger LudoA said...

Apparently oracle has a different limit syntax, so I'm not going to bother. I find the differences in SQL (which is supposed to be an ANSI standard) rather annoying, to say the least.

 
At January 14, 2008 2:49 PM, Blogger DMK said...

baoilleach is thinking along the same lines as I am.

An SQL query won't do it -- that sort set intersection calculation doesn't scale all that well, and ludoa's special case assumption of only a single common book is insufficiently general.

I had a thought that represented books as a bit field/vector, reducing ownership to a series of bit vectors. AND'ing all of the ownership strings would give a set of all the elements common to all of the ownership strings -- the minimal set of books shared by all.

Seems to me this would take some setup, so I wonder whether the above would scale.

 
At January 14, 2008 3:07 PM, Blogger LudoA said...

DMK: what do you mean by my "special case assumption of only a single common book is insufficiently general"? My query works with as many books as you want, no matter if there's a tie or not. Not saying you're wrong or anything, just that I probably don't understand it :)

Won't it scale if you index it correctly? If you just get the whole list and process it in Python/Java/whatever, will it be much faster? (I'm wondering as you don't have the advantages of indexes.)
Just curious as I'm not really experienced in this sort of stuff (as I'm sure you've guessed).

 
At January 14, 2008 3:30 PM, Blogger DMK said...

Ack. I shouldn't comment on blogs on Monday mornings.

I've read the question wrong -- I interpreted it as the minimal common set of books to all users, but that's not exactly the question (although it is a solution...).

I think this is a sort of covering set issue.

That being said, baoilleach's greedy algorithm may not be sufficient. Here's a counter-example:

Imagine 4 books: A, B, C, D. And users u1, u2, ... u9.

u1 = AC
u2 = AD
u3 = AB
u4 = AB
u5 = D
u6 = D
u7 = B
u8 = C
u9 = C

With baoilleach's algorithm, we remove retain A, leaving us with u5, u6, u7, u8, and u9. Retain, then, D, C, and then B for a final set of { A, B, C, D }.

However, if we choose B instead, this leaves us with u1, u2, u5, u6, u8, u9. Then, retain D, with u1, u8, and u9 remaining. Retaining C covers the rest, for a solution of { B, C, D }, better than the one above.

 
At January 14, 2008 3:43 PM, Blogger baoilleach said...

Nice counter example, dmk. In fact because u7 has only one book, B, you know in advance that you *need* to include B. So it's clear that this is the first thing you would do.

So how about pick the user with the smallest number of books. Then pick the book in that set with the largest number of users. Though it still won't work in every case, I guess. I declare this an NP-hard problem, proof left as an exercise. :-)

Lazyweb, over and out.

 
At January 15, 2008 7:15 PM, Anonymous Stew said...

Thanks for the suggestions. Yeah, I think I could have explained the problem better - as dmk says it's a covering set thing.

Going to try baoilleach's algorithm (also got suggested by Alf), with the additional first step of removing users with only one book and adding those books to the results set.

Ta lazy web!

 

Post a Comment

<< Home


See all posts from: July 2005 August 2005 September 2005 October 2005 November 2005 December 2005 January 2006 February 2006 March 2006 April 2006 May 2006 June 2006 July 2006 September 2006 October 2006 November 2006 December 2006 January 2007 February 2007 March 2007 April 2007 May 2007 June 2007 July 2007 August 2007 October 2007 November 2007 December 2007 January 2008 February 2008 March 2008 April 2008 May 2008