Abstract
The schemalessness, one of the major advantages of JSON representation format, comes withhigh penalties in querying and operations by denying various critical functions such as query optimizations, indexing, or data verification. There have been continuous efforts to develop an accurate JSON schema discovery algorithm from a bag of JSON documents. Unfortunately, existing schema discovery techniques, being top-down algorithms, face challenges from the lack of visibility into children nodes of JSON tree. With absence of the information about lower-level JSON elements, top-down algorithms need to employ assumptions and heuristics to decide the schema type of nodes. However, such static decisions are often violated in datasets which causes top-down algorithms to perform poorly. To overcome this, we propose an algorithm, called ReCG, that processes JSON documents in a bottom-up manner. It builds up schemas from leaf elements upward in the JSON document tree and, thus, can make more informed decisions of the schema node types. In addition, we adopt MDL (Minimum Description Length) principles systematically while building up the schemas to choose among candidate schemas the most concise yet accurate one with well-balanced generality. Evaluations show that our technique improves the recall and precision of found schemas by as high as 47%, resulting in 46% better F1 score while also performing 2.11× faster on average against the state-of-the-art.
Original language | English |
---|---|
Pages (from-to) | 3538-3550 |
Number of pages | 13 |
Journal | Proceedings of the VLDB Endowment |
Volume | 17 |
Issue number | 11 |
DOIs | |
State | Published - 2024 |
Event | 50th International Conference on Very Large Data Bases, VLDB 2024 - Guangzhou, China Duration: 24 Aug 2024 → 29 Aug 2024 |