What if it were easy to query a complex set of Java objects at runtime? What if there were an API that kept your object indexes (really just TreeMaps, and HashMaps) in sync.? Well then you would have Boon's data repo. This article shows how to use Boon's data repo utilities to query Java objects. This is part one. There can be many, many parts. :)
Boon's data repo makes doing index based queries on collections a lot easier.
Why Boon data repo
Boon's data repo allows you to treat Java collections more like a database at least when it comes to querying the collections. Boon's data repo is not an in-memory database, and cannot substitute arranging your objects into data structures optimized for your application.
If you want to spend your time providing customer value and building your objects and classes and using the Collections API for your data structures, then DataRepo is meant for you. This does not preclude breaking out the Knuth books and coming up with an optimized data structure. It just helps keep the mundane things easy so you can spend your time making the hard things possible.
Born out of need
This project came out of a need. I was working on a project that planned to store large collection of domain objects in-memory for speed, and somebody asked an all to important question that I overlooked. How are we going to query this data. My answer was we will use the Collections API and the Streaming API. Then I tried to do this... Hmmm...
Boon's data repo augments the streaming API.
Boon's data repo does not endeavor to replace the JDK 8 stream API, and in fact it works well with it. Boon's data repo allows you to create indexed collections. The indexes can be anything (it is pluggable).
At this moment in time, Boon's data repo indexes are based on ConcurrentHashMap and ConcurrentSkipListMap.
By design, Boon's data repo works with standard collection libraries. There is no plan to create a set of custom collections. One should be able to plug in Guava, Concurrent Trees or Trove if one desires to do so.
It provides a simplified API for doing so. It allows linear search for a sense of completion but I recommend using it primarily for using indexes and then using the streaming API for the rest (for type safety and speed).
Sneak peak before the step by step
Let's say you have a method that creates 200,000 employee objects like this:
List<Employee> employees = TestHelper.createMetricTonOfEmployees(200_000);
So now we have 200,000 employees. Let's search them...
First wrap Employees in a searchable query:
employees = query(employees);
Now search:
List<Employee> results = query(employees, eq("firstName", firstName));
So what is the main difference between the above and the stream API?
employees.stream().filter(emp -> emp.getFirstName().equals(firstName)
About a factor of 20,000% faster to use Boon's DataRepo! Ah the power of HashMaps and TreeMaps. :)
Question: UPDATE QUESTION FROM A READER:
On Saturday, November 2, 2013, Chris B wrote:
"Very interesting, Rick - I only had time to do a quick read of the article, so forgive me if this is answered within your write up. But, I was curious as to the overhead of building the indexes when you wrap your collection in a query object... Say for the 200_000 employees in the example below. How long would it take to build the indexed structure ?"
Thanks Chris. Good Question! It would be quite expensive so you would only use this construct if you plan on holding on to the collection for a while. There is also a repo object if you want to gradually update a collection's indexes.
For a use case: Imagine you are pulling a list of employees from memcached every 20 minutes. You want to query against the employees so you need the indexes and you want something faster than storing every possible combination of a query in memcached (yes I have seen people do this to the tune of 12 GB... 20x more than the data in the DB. I have seen this more than once). So now you pull them down, query against the searchable collection. This avoids the call to memcached and the explosion of every possible query being cached. Anyway... Data repo was written to avoid many of the anti patterns that I have seen with caching. It allows you a way to query against Java objects.
There is an API that looks just like your built-in collections. There is also an API that looks more like a DAO object or a Repo Object.
A simple query with the Repo/DAO object looks like this (the repo allows gradual update of the indexes):
List<Employee> employees = repo.query(eq("firstName", "Diana"));
A more involved query would look like this:
List<Employee> employees = repo.query(
and(eq("firstName", "Diana"), eq("lastName", "Smith"), eq("ssn", "21785999")));
Or this:
List<Employee> employees = repo.query(
and(startsWith("firstName", "Bob"), eq("lastName", "Smith"), lte("salary", 200_000),
gte("salary", 190_000)));
Or even this:
List<Employee> employees = repo.query(
and(startsWith("firstName", "Bob"), eq("lastName", "Smith"), between("salary", 190_000, 200_000)));
Or if you want to use JDK 8 stream API, this works with it not against it:
int sum = repo.query(eq("lastName", "Smith")).stream().filter(emp -> emp.getSalary()>50_000)
.mapToInt(b -> b.getSalary())
.sum();
The above would be much faster if the number of employees was quite large. It would narrow down the employees whose name started with Smith and had a salary above 50,000. Let's say you had 100,000 employees and only 50 named Smith so now you narrow to 50 quickly by using the index which effectively pulls 50 employees out of 100,000, then we do the filter over just 50 instead of the whole 100,000.
Here is a benchmark run from data repo of a linear search versus an indexed search in nano seconds:
Name index Time 218 Boon data repo!
Name linear Time 3542320 Not boon. :(
Name index Time 218
Name linear Time 3511667
Name index Time 218
Name linear Time 3709120
Name index Time 213
Name linear Time 3606171
Name index Time 219
Name linear Time 3528839
Someone recently said to me: "But with the streaming API, you can run the filter in parralel).
Let's see how the math holds up:
35,28,839 / 16 threads vs. 219
201,802 vs. 219.
Indexes win, but it was a photo finish. :)
It was only 9,500% faster instead of 40,000% faster. So close.....
By default all search indexes and lookup indexes allow duplicates (except for primary key index).
repoBuilder.primaryKey("ssn")
.searchIndex("firstName").searchIndex("lastName")
.searchIndex("salary").searchIndex("empNum", true)
.usePropertyForAccess(true);
You can override that by providing a true flag as the second argument to searchIndex.
Notice empNum is a searchable unique index.
List<Employee> employees = TestHelper.createMetricTonOfEmployees(200_000);
employees = query(employees);
List<Employee> results = query(employees, eq("firstName", firstName));
employees.stream().filter(emp -> emp.getFirstName().equals(firstName)
About a factor of 20,000% faster to use Boon's DataRepo! Ah the power of HashMaps and TreeMaps. :)
Question: UPDATE QUESTION FROM A READER:
On Saturday, November 2, 2013, Chris B wrote:
"Very interesting, Rick - I only had time to do a quick read of the article, so forgive me if this is answered within your write up. But, I was curious as to the overhead of building the indexes when you wrap your collection in a query object... Say for the 200_000 employees in the example below. How long would it take to build the indexed structure ?"
Thanks Chris. Good Question! It would be quite expensive so you would only use this construct if you plan on holding on to the collection for a while. There is also a repo object if you want to gradually update a collection's indexes.
For a use case: Imagine you are pulling a list of employees from memcached every 20 minutes. You want to query against the employees so you need the indexes and you want something faster than storing every possible combination of a query in memcached (yes I have seen people do this to the tune of 12 GB... 20x more than the data in the DB. I have seen this more than once). So now you pull them down, query against the searchable collection. This avoids the call to memcached and the explosion of every possible query being cached. Anyway... Data repo was written to avoid many of the anti patterns that I have seen with caching. It allows you a way to query against Java objects.
List<Employee> employees = repo.query(eq("firstName", "Diana"));
List<Employee> employees = repo.query(
and(eq("firstName", "Diana"), eq("lastName", "Smith"), eq("ssn", "21785999")));
List<Employee> employees = repo.query(
and(startsWith("firstName", "Bob"), eq("lastName", "Smith"), lte("salary", 200_000),
gte("salary", 190_000)));
List<Employee> employees = repo.query(
and(startsWith("firstName", "Bob"), eq("lastName", "Smith"), between("salary", 190_000, 200_000)));
int sum = repo.query(eq("lastName", "Smith")).stream().filter(emp -> emp.getSalary()>50_000)
.mapToInt(b -> b.getSalary())
.sum();
The above would be much faster if the number of employees was quite large. It would narrow down the employees whose name started with Smith and had a salary above 50,000. Let's say you had 100,000 employees and only 50 named Smith so now you narrow to 50 quickly by using the index which effectively pulls 50 employees out of 100,000, then we do the filter over just 50 instead of the whole 100,000.
Here is a benchmark run from data repo of a linear search versus an indexed search in nano seconds:
Name linear Time 3542320 Not boon. :(
Name index Time 218
Name linear Time 3511667
Name index Time 218
Name linear Time 3709120
Name index Time 213
Name linear Time 3606171
Name index Time 219
Name linear Time 3528839
Someone recently said to me: "But with the streaming API, you can run the filter in parralel).
Let's see how the math holds up:
35,28,839 / 16 threads vs. 219
201,802 vs. 219.
Indexes win, but it was a photo finish. :)
It was only 9,500% faster instead of 40,000% faster. So close.....
By default all search indexes and lookup indexes allow duplicates (except for primary key index).
repoBuilder.primaryKey("ssn")
.searchIndex("firstName").searchIndex("lastName")
.searchIndex("salary").searchIndex("empNum", true)
.usePropertyForAccess(true);
You can override that by providing a true flag as the second argument to searchIndex.
Notice empNum is a searchable unique index.
Using Boon's data repo by example
Brief announcement from our sponsor "Boon = Simple opinionated Java for the novice to expert level Java Programmer. Low Ceremony. High Productivity. A real boon to Java to developers!"
Good way of explaining, and pleasant post to obtain information about my presentation focus, which i am going to present in college 국산야동
ReplyDeletePlease visit once. I leave my blog address below
야설
국산야동