Data Modeling and Scalability in Google App

0
1177
11 min read

Google App Engine Java and GWT Application Development

Google App Engine Java and GWT Application Development

Build powerful, scalable, and interactive web applications in the cloud

  • Comprehensive coverage of building scalable, modular, and maintainable applications with GWT and GAE using Java
  • Leverage the Google App Engine services and enhance your app functionality and performance
  • Integrate your application with Google Accounts, Facebook, and Twitter
  • Safely deploy, monitor, and maintain your GAE applications
  • A practical guide with a step-by-step approach that helps you build an application in stages
        Read more about this book      

In deciding how to design your application’s data models, there are a number of ways in which your approach can increase the app’s scalability and responsiveness. Here, we discuss several such approaches and how they are applied in the Connectr app. In particular, we describe how the Datastore access latency can sometimes be reduced; ways to split data models across entities to increase the efficiency of data object access and use; and how property lists can be used to support “join-like” behavior with Datastore entities.

Reducing latency—read consistency and Datastore access deadlines

By default, when an entity is updated in the Datastore, all subsequent reads of that entity will see the update at the same time; this is called strong consistency . To achieve it, each entity has a primary storage location, and with a strongly consistent read, the read waits for a machine at that location to become available. Strong consistency is the default in App Engine.

However, App Engine allows you to change this default and use eventual consistency for a given Datastore read. With eventual consistency, the query may access a copy of the data from a secondary location if the primary location is temporarily unavailable. Changes to data will propagate to the secondary locations fairly quickly, but it is possible that an “eventually consistent” read may access a secondary location before the changes have been incorporated. However, eventually consistent reads are faster on average, so they trade consistency for availability. In many contexts, for example, with web apps such as Connectr that display “activity stream” information, this is an acceptable tradeoff—completely up-to-date freshness of information is not required.


See http://googleappengine.blogspot.com/2010/03/ read-consistency-deadlines-more-control.html, http://googleappengine.blogspot.com/2009/09/migrationto- better-datastore.html, and http://code.google.com/ events/io/2009/sessions/TransactionsAcrossDatacenters. html for more background on this and related topics.

In Connectr, we will add the use of eventual consistency to some of our feed object reads; specifically, those for feed content updates. We are willing to take the small chance that a feed object is slightly out-of-date in order to have the advantage of quicker reads on these objects.

The following code shows how to set eventual read consistency for a query, using server.servlets.FeedUpdateFriendServlet as an example.

Query q = pm.newQuery("select from " + FeedInfo.class.getName() +
"where urlstring == :keys");
//Use eventual read consistency for this query
q.addExtension("datanucleus.appengine.datastoreReadConsistency",
"EVENTUAL");

App Engine also allows you to change the default Datastore access deadline. By default, the Datastore will retry access automatically for up to about 30 seconds. You can set this deadline to a smaller amount of time. It can often be appropriate to set a shorter deadline if you are concerned with response latency, and are willing to use a cached version of the data for which you got the timeout, or are willing to do without it.

The following code shows how to set an access timeout interval (in milliseconds) for a given JDO query.

Query q = pm.newQuery("...");
// Set a Datastore access timeout
q.setTimeoutMillis(10000);

Splitting big data models into multiple entities to make access more efficient

Often, the fields in a data model can be divided into two groups: main and/or summary information that you need often/first, and details—the data that you might not need or tend not to need immediately. If this is the case, then it can be productive to split the data model into multiple entities and set the details entity to be a child of the summary entity, for instance, by using JDO owned relationships. The child field will be fetched lazily, and so the child entity won’t be pulled in from the Datastore unless needed.

In our app, the Friend model can be viewed like this: initially, only a certain amount of summary information about each Friend is sent over RPC to the app’s frontend (the Friend’s name). Only if there is a request to view details of or edit a particular Friend, is more information needed.

So, we can make retrieval more efficient by defining a parent summary entity, and a child details entity. We do this by keeping the “summary” information in Friend, and placing “details” in a FriendDetails object , which is set as a child of Friend via a JDO bidirectional, one-to-one owned relationship, as shown in Figure 1. We store the Friend’s e-mail address and its list of associated URLs in FriendDetails. We’ll keep the name information in Friend. That way, when we construct the initial ‘FriendSummaries’ list displayed on application load, and send it over RPC, we only need to access the summary object.

Data Modeling and Scalability in Google App

Splitting Friend data between a “main” Friend persistent class and a FriendDetails child class.

A details field of Friend points to the FriendDetails child, which we create when we create a Friend. In this way, the details will always be transparently available when we need them, but they will be lazily fetched—the details child object won’t be initially retrieved from the database when we query Friend, and won’t be fetched unless we need that information.

As you may have noticed, the Friend model is already set up in this manner—this is the rationale for that design.

Discussion

When splitting a data model like this, consider the queries your app will perform and how the design of the data objects will support those queries. For example, if your app often needs to query for property1 == x and property2 == y, and especially if both individual filters can produce large result sets, you are probably better off keeping both those properties on the same entity (for example, retaining both fields on the “main” entity, rather than moving one to a “details” entity).

For persistent classes (that is, “data classes”) that you often access and update, it is also worth considering whether any of its fields do not require indexes. This would be the case if you never perform a query which includes that field. The fewer the indexed fields of a persistent class, the quicker are the writes of objects of that cl ass.

Splitting a model by creating an “index” and a “data” entity

You can also consider splitting a model if you identify fields that you access only when performing queries, but don’t require once you’ve actually retrieved the object. Often, this is the case with multi-valued properties. For example, in the Connectr app, this is the case with the friendKeys list of the server.domain.FeedIndex class. This multi-valued property is used to find relevant feed objects but is not used when displaying feed content information.

With App Engine, there is no way for a query to retrieve only the fields that you need, so the full object must always be pulled in. If the multi-valued property lists are long, this is inefficient.

To avoid this inefficiency, we can split up such a model into two parts, and put each one in a different entity—an index entity and a data entity. The index entity holds only the multi-valued properties (or other data) used only for querying, and the data entity holds the information that we actually want to use once we’ve identified the relevant objects. The trick to this new design is that the data entity key is defined to be the parent of the index entity key.

More specifically, when an entity is created, its key can be defined as a “child” of another entity’s key, which becomes its parent. The child is then in the same entity group as the parent. Because such a child key is based on the path of its parent key, it is possible to derive the parent key given only the child key, using the getParent() method of Key, without requiring the child to be instantiated.

So with this design, we can first do a keys-only query on the index kind (which is faster than full object retrieval) to get a list of the keys of the relevant index entities. With that list, even though we’ve not actually retrieved the index objects themselves, we can derive the parent data entity keys from the index entity keys. We can then do a batch fetch with the list of relevant parent keys to grab all the data entities at once. This lets us retrieve the information we’re interested in, without having to retrieve the properties that we do not need.

See Brett Slatkin’s presentation, Building scalable, complex apps on App Engine (http://code.google.com/events/ io/2009/sessions/BuildingScalableComplexApps. html) for more on this index/data design.

Data Modeling and Scalability in Google App

Splitting the feed model into an “index” part (server.domain.FeedIndex) and a “data” part (server.domain.FeedInfo)

Our feed model maps well to this design—we filter on the FeedIndex.friendKeys multi-valued property (which contains the list of keys of Friends that point to this feed) when we query for the feeds associated with a given Friend.

But, once we have retrieved those feeds, we don’t need the friendKeys list further. So, we would like to avoid retrieving them along with the feed content. With our app’s sample data, these property lists will not comprise a lot of data, but they would be likely to do so if the app was scaled up. For example, many users might have the same friends, or many different contacts might include the same company blog in their associated feeds.

So, we split up the feed model into an index part and a parent data part, as shown in Figure 2. The index class is server.domain.FeedIndex; it contains the friendKeys list for a feed. The data part, containing the actual feed content, is server.domain. FeedInfo. When a new FeedIndex object is created, its key will be constructed so that its corresponding FeedInfo object ‘s key is its parent key. This construction must of course take place at object creation, as Datastore entity keys cannot be changed.

For a small-scale app, the payoff from this split model would perhaps not be worth it. But for the sake of example, let’s assume that we expect our app to grow significantly.

The FeedInfo persistent class —the parent class—simply uses an app-assigned String primary key, urlstring (the feed URL string). The server.domain. FeedIndex constructor, shown in the code below, uses the key of its FeedInfo parent—the URL string—to construct its key. This places the two entities into the same entity group and allows the parent FeedInfo key to be derived from the FeedIndex entity’s key.

@PersistenceCapable(identityType = IdentityType.APPLICATION,
detachable="true")
public class FeedIndex implements Serializable {

@PrimaryKey
@Persistent(valueStrategy = IdGeneratorStrategy.IDENTITY)
private Key key;
...

public FeedIndex(String fkey, String url) {
this.friendKeys = new HashSet<String>();
this.friendKeys.add(fkey);
KeyFactory.Builder keyBuilder =
new KeyFactory.Builder(FeedInfo.class.getSimpleName(), url);
keyBuilder.addChild(FeedIndex.class.getSimpleName(), url);
Key ckey = keyBuilder.getKey();
this.key= ckey;
}

The following code, from server.servlets.FeedUpdateFriendServlet, shows how this model is used to efficiently retrieve the FeedInfo objects associated with a given Friend. Given a Friend key, a query is performed for the keys of the FeedIndex entities that contain this Friend key in their friendKeys list. Because this is a keys-only query, it is much more efficient than returning the actual objects. Then, each FeedIndex key is used to derive the parent (FeedInfo) key. Using that list of parent keys, a batch fetch is performed to fetch the FeedInfo objects associated with the given Friend. We did this without needing to actually fetch the FeedIndex objects.

... imports...
@SuppressWarnings("serial")
public class FeedUpdateFriendServlet extends HttpServlet{

private static Logger logger =
Logger.getLogger(FeedUpdateFriendServlet.class.getName());

public void doPost(HttpServletRequest req, HttpServletResponse resp)
throws IOException {

PersistenceManager pm = PMF.get().getPersistenceManager();

Query q = null;
try {
String fkey = req.getParameter("fkey");
if (fkey != null) {
logger.info("in FeedUpdateFriendServlet, updating feeds for:"
+fkey);
// query for matching FeedIndex keys
q = pm.newQuery("select key from "+FeedIndex.class.getName()+"
where friendKeys == :id");
List ids=(List)q.execute(fkey);
if (ids.size()==0) {
return;
}
// else, get the parent keys of the ids
Key k = null;
List<Key>parent list = new ArrayList<Key>();
for (Object id : ids) {
// cast to key
k = (Key)id;
parentlist.add(k.getParent());
}
// fetch the parents using the keys
Query q2 = pm.newQuery("select from +FeedInfo.class.getName()+
"where urlstring == :keys");
// allow eventual consistency on read
q2.addExtension(
"datanucleus.appengine.datastoreReadConsistency",
"EVENTUAL");
List<FeedInfo>results =
(List<FeedInfo>)q2.execute(parentlist);
if(results.iterator().hasNext()){
for(FeedInfo fi : results){
fi.updateRequestedFeed(pm);
}
}
}
}
catch (Exception e) {
logger.warning(e.getMessage());
}
finally {
if q!=null) {
q.closeAll();
}
pm.close();
}
}
}//end class

LEAVE A REPLY

Please enter your comment!
Please enter your name here