This post is brought to you by Luca Miglioli, an Information System Analyst that works at WebResults (Engineering group) in the Solution Team, an highly innovative team devoted to Salesforce products evangelization.
Introduction
In modern day-time, people constantly create data. All-day long, every day. So. Much. Data. Suddenly, your org has accumulated millions of records, thousands of users, and several gigabytes of data storage. Whereby designing for high performance is a critical part of the success of software and applications: you don’t let your largest and most important customers find performance issues with your app before you do.
Fortunately, if you work in IT you also know that the technology involved in this industry is pretty well featured with “scalability” or the ability to handle a growing amount of work by adding data or resources to the system. Furthermore, that’s why Salesforce provides Large Data Volumes Org (LDV): they have a 60BG of storage (~ 31.000.000 records!) and more space allocation for File Storage and Big Objects. They’re worth using!
Use Cases
Large Data Volume testing is an aspect of this performance testing for a large amount of data (millions of records) in an org with a significant load (thousands) of concurrent users. They can help you to find answers to questions like: can my system scale to elegantly handle a large amount of data?
Regarding this topic and according to the Salesforce Developer’s Guide, there are a lot of use cases your business might encounter:
- Data Aggregation / Custom Search Functionality / Indexing with Nulls
- Rendering Related Lists with Large Data Volumes / API Performance
- Sort Optimization on a Query / Multi-Join Report Performance
In general, you’d need LDV for testing and performance checks, to see if your application or architecture could handle a very large amount of data.
Salesforce enables customers to easily scale up their applications from small to large amounts of data. This scaling usually happens automatically, but as data sets get larger, the time required for certain operations grows. How architects design and configure data structures and operations can increase or decrease those operation times by several orders of magnitude.
The main processes affected by differing architectures and configurations are:
- Loading or updating of large numbers of records, either directly or with integrations.
- Extraction of data through reports and queries, or through views.
Data Load
You can easily import external data into Salesforce. Supported data sources include any program that can save data in the comma delimited text format (.csv). Whether we’re talking LDV migration or ongoing large data sync operations, minimizing the impact these actions have on business-critical operations is the best practice. A smart strategy for accomplishing this is loading lean—including only the data and configuration you need to meet your business-critical operations. Also here are some suggestions:
- Parent records with master-detail children. You won’t be able to load child records if the parents don’t already exist.
- Record owners. In most cases, your records will be owned by individual users, and the owners need to exist in the system before you can load the data.
- Role hierarchy. You might think that loading would be faster if the owners of your records were not members of the role hierarchy. But in almost all cases, the performance would be the same, and it would be considerably faster if you were loading portal accounts. So there’s no benefit to deferring this aspect of the configuration.
Data Extract
You’ve been tasked with extracting data from a Salesforce object. If you’re dealing with small volumes of data, this operation might be simple, involving only a few button clicks using some of the great tools available on the AppExchange. But when it comes to dealing with millions of records in a limited time frame, you might need to take extra steps to optimize the data throughput. Here are some hints:
- Chunking Data. When extracting data with Bulk API, queries are split into 100,000 record chunks by default—you can use the chunkSize header field to configure smaller chunks, or larger ones up to 250,000. Larger chunk sizes use up fewer Bulk API batches but may not perform as well.
- Idempotence. Remember that idempotence is an important design consideration in successful extraction processes. Make sure that your job is designed so that resubmitting failed requests fills in the missing records without creating duplicate records for partial extractions.
- Caching. The more tests you run, the more likely that the data extraction will complete faster because of the underlying database cache utilization. While it is great to have better performance, don’t schedule your batch jobs based on the assumption that you will always see the best results.
Tools
There are a lot of tools you can use for performing these upload and extract operations:
- Data Import Wizard (up to 50,000 records): An in-browser wizard that imports your org’s accounts, contacts, leads, solutions, campaign members, and custom objects.
- Data Loader (up to 5 million records): Data Loader is an application for the bulk import or export of data. Use it to insert, update, delete, or export Salesforce records.
- Dataloader.io (varies by purchase plan): it has a clean and simple interface that makes it easy to import, export, and delete data in Salesforce no matter what edition you use. This third party tool allows you to schedule tasks and opportunity imports on a daily, weekly, or monthly basis.
- Jitterbit: This free tool runs on both Mac and PC, and allows Salesforce administrators to manage the import and export of data. It is compatible with all Salesforce editions and supports multiple logins.
Best Practices
Be aware of the Governor Limits. Follow best practices for deployments with large data volumes to reduce the risk of hitting limits when executing jobs and SOQL queries. Here some examples:
- Large Data Volume testing is only performed against “sandbox” or test org. Keep in mind that you don’t work in production and you’d need to migrate these data and configuration to another org to go live.
- Use the Salesforce Bulk API when you have more than a few hundred thousand records.
- Disable Apex triggers, workflow rules, and validations during loads; investigate the use of batch Apex to process records after the load is complete.
- When updating, send only fields that have changed (delta-only loads).
- When using a query that can return more than one million results, consider using the query capability of the Bulk API, which might be more suitable.
- Use External or Big Objects: with the first, there’s no need to bring data into Salesforce and with the second you can provide consistent performance for a billion records or more, and access a standard set of APIs to your org or external system. In general, you avoid both storing large amounts of data in your org, and the performance issues associated with LDV.
Monitoring
Finally, it’s important to monitor the situation: in the Setup > Data Usage section one can easily find all the information he wants, for example, see how much space is used or what is the object which sizes more.
Conclusion
With the exponential growth of data in the age of IT, it’s becoming more and more important for customers to have integrated, end to end solutions in place for storing, archiving, and analyzing their data: the Salesforce platform offers several features that make it easy to develop a common-sense approach to data management that can deliver happier constituents, a more effective user experience, improved organizational agility, and reduced maintenance & cost; you only have to use it!