-
-
Notifications
You must be signed in to change notification settings - Fork 158
feat: new schema based checker #761
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAttention: Patch coverage is
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## master #761 +/- ##
==========================================
- Coverage 99.15% 99.11% -0.04%
==========================================
Files 46 48 +2
Lines 2834 3162 +328
==========================================
+ Hits 2810 3134 +324
- Misses 24 28 +4 ☔ View full report in Codecov by Sentry. |
i have fixed the missing coverages, but the coverage ci test needs to be rerun |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution!
I really like this to be part of dotenv-linter
! 🚀
I've left some comments for first pass
@dotenv-linter/core please take a look
Another idea in my mind in continue to this one, is to allow deriving of the schema from To provide a way for users to generate a schema from existing files - they can be large. And then it can be edited manually. This is definitely not for this PR. |
I have one failing test because the text returned by failure to open a file is different on windows (My main dev system) and linux. I will try to fix, You code coverage tool is a hard taskmaster - but very good at making sure everything is tested. I think I have fixed it |
dotenv-linter/src/schema.rs
Outdated
pub struct DotEnvSchema { | ||
pub version: String, | ||
pub allow_other_keys: bool, | ||
pub entries: HashMap<String, SchemaEntry>, | ||
} | ||
#[derive(Deserialize, Default)] | ||
#[serde(default)] | ||
pub struct SchemaEntry { | ||
pub key: String, | ||
pub required: bool, | ||
#[serde(rename = "type")] | ||
pub value_type: SchemaValueType, | ||
pub regex: Option<String>, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we do conversion of elements to map anyway, what do you think of using enums
?
It also would introduce some basic validation on schema reading (of the schema itself)
I mean something like:
enum EnvVariableKind {
String { regex },
Email,
Nunber { min, max },
..
}
So we will have HashMap<String, EnvVariableKind>
after validation of the schema
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wondered about this too, I dont think so though. The only useful one at the moment is String(Option<Regex>)
. If the idea is to encapsulate the entire definition of the 'rule' in the enum they get too complex. Imagine rules that say
- either KEY1 or KEY2 must be present but not both
- if KEY1 present then KEY2 must also be present
- these are the allowed values for KEY1 (could be a regex)
- ...
The code then still needs to look at other things in the schema definition , not just the type enum so my vote is no.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we need type-independent checks, it could be more extended wrapper, like:
// entries: HashMap<String, EnvValidationRules>
struct EnvValidationRules {
ty: EnvVariableType, // contains specific validations for the type
required: bool,
excludes: Vec<String>, // keys which can't present in case of this one found (just for example)
... // other common rules which doesn't depend on type
}
The main point here that there are inappropriate configs for particular types (i.e depends only on type).
For now only Regex
for string, but we already have understanding of validations for numbers and can have more. UPD: see below suggestion for boolean
, one more possible type-specific rule
It's kinda a good way to ensure schema is passed in correct way, and let's say there is no such confusing definitions:
{
"key": "SECRET",
"required": true,
"type": "Number",
"regex": "^[a-z]*$"
},
Another benefit - the logic will be more separated, for type specific checks it can be just a method on enum, so we can do something like that: ty.validate(value)
And in general this type can be de-serialized with serde
as is, so no need to manually map them
Arg::new("schema") | ||
.short('S') | ||
.long("schema") | ||
.help("Use schema file to check .env files") | ||
.value_parser(value_parser!(PathBuf)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, it can be useful to have an ability to pass schema as an in-place argument (json string) to the CLI.
E.g it can be used when you're piping commands
But we don't have to do this initially, can be separate issue/PR to support this later
I also messed up the fix for that missing file test, hopefully I have it correct now |
I apologize for a number of comments and suggestions. I understand that this may be annoing. Many of items can be done in iterative way, in separate PRs. Thus, we just need to collect features which we gonna schedule for the future enhancement. It also would allow other folks to contribute 🙂 Thanks for all efforts addressing everything! 🙏 |
@DDtKey are you waiting for me to do something. I think I have done all that you asked, I have all checks passing and 100% coverage |
Actually I'm not the only maintainer, so we need @dotenv-linter/core to review as well (only @mgrachev can merge / make final decision here). But from my side, I still think it would be nice this part to be implemented 🤔 |
I dont think serde can do it when we are not deserializing the enums directly. I can make serde do it directly from a different json syntax that I dont like so much
or
which is very odd. I can change the enum to
if you like ( I dont think it adds anything) I will have to remap it manually - which is fine, just like the vec to hashmap translation I will then reject a regex on anything other than a string. Or I can simply do that without changing the enum |
That's possible, did you consider to use Working representation#[derive(Deserialize, Debug)]
pub struct SchemaEntry {
pub key: String,
pub required: bool,
#[serde(flatten)]
pub ty: SchemaValueType,
}
#[derive(Deserialize, Debug)]
#[serde(tag = "type")]
pub enum SchemaValueType {
String {
regex: Option<String>
},
Integer,
Float,
Boolean,
Url,
Email,
}
I believe this makes the code safer & stricter since it is not possible to get specific fields of a type from branches for other types (e.g Moreover, in the future we may end up with a monster type that contains many of these type-specific rules, but they do not belong to anything (only to high-level struct). Thus, it may be difficult for contributors to understand when and what to use (confuse common rules with specific ones) |
@pm100 Thank you for your contribution 👍 Sorry, but I need more time to review your PR, so I'll look at it around the weekend. |
dotenv-linter/src/cli/options.rs
Outdated
match schema::DotEnvSchema::load(schema_path) { | ||
Ok(schema) => Some(schema), | ||
Err(e) => { | ||
println!("Error loading schema: {}", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we print the error ignoring quiet mode flag, I think it should be stderr
rather than stdout
then 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it was stderr originally but the cli tests only look at stdout
All done except CliOptions - see my comment |
dotenv-linter/src/schema.rs
Outdated
pub fn load(path: &Path) -> Result<Self, std::io::Error> { | ||
let file = File::open(path)?; | ||
let reader = BufReader::new(file); | ||
let read_schema: ReadDotEnvSchema = serde_json::from_reader(reader)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let read_schema: ReadDotEnvSchema = serde_json::from_reader(reader)?; | |
let read_schema: DotEnvSchema = serde_json::from_reader(reader)?; |
Why do you use Read*
structs if deserialization directly into structure DotEnvSchema
works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it does not work, it did originally but once the entries was changed to HashMap and the regex was compiled into the schema at load time it is no longer possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we change the schema format to this one, it will work:
{
"version": "1.0",
"allow_other_keys": false,
"entries": {
"SECRET": {
"required": true,
"type": "String",
"regex": "^[a-z]*$"
},
"PORT": {
"type": "Integer"
},
"URL": {
"type": "Url"
}
}
}
Working example:
#[derive(Deserialize, Default)]
pub struct DotEnvSchema {
pub version: String,
pub allow_other_keys: bool,
pub entries: HashMap<String, SchemaEntry>,
}
#[derive(Deserialize, Default)]
pub struct SchemaEntry {
#[serde(default)]
pub required: bool,
#[serde(rename = "type")]
pub value_type: SchemaValueType,
#[serde(with = "serde_regex", default)]
pub regex: Option<Regex>,
}
#[derive(Deserialize, Default)]
pub enum SchemaValueType {
#[default]
String,
Integer,
Float,
Boolean,
Url,
Email,
}
impl DotEnvSchema {
pub fn load(path: &Path) -> Result<Self, std::io::Error> {
let file = File::open(path)?;
let reader = BufReader::new(file);
let schema: DotEnvSchema = serde_json::from_reader(reader)?;
Ok(schema)
}
}
@DDtKey What do you think about the new schema format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm totally good with that. Actually I mentioned this way earlier:
Also alternative to consider: instead of
array
, try to useObject
inJSON
, wherekey
is env variable name 🤔 This can be deserialized into map
Any valid Unicode string is valid json key - so we should have no issues with that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another "unresolved" question regarding schema deserialization part is #761 (comment) (also there is an example in rust-playground)
I tend to keep type-specific rules attached to the type itself (like regex
belongs to string
). Just to keep our code more extensible & clear for the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another "unresolved" question regarding schema deserialization part is #761 (comment) (also there is an example in rust-playground)
I tend to keep type-specific rules attached to the type itself (like
regex
belongs tostring
). Just to keep our code more extensible & clear for the future
I like your idea 👍
for ch in &mut checks { | ||
let end_warns = ch.end(); | ||
warnings.extend(end_warns); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for ch in &mut checks { | |
let end_warns = ch.end(); | |
warnings.extend(end_warns); | |
} | |
if disabled_checks.contains(&LintKind::SchemaViolation) { | |
return warnings; | |
} | |
if let Some(schema) = schema { | |
let keys = lines.iter().filter_map(|l| l.get_key()).collect::<Vec<&str>>(); | |
let required_warnings = schema_violation::check_required_keys(schema, &keys); | |
warnings.extend(required_warnings); | |
} |
It seems to me that it's more obvious what's going on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Highly related to #761 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the 'end' call was intended as a generic call for any checkers that may want to do a final pass of some kind
This now elevates the schema checker to a special case
fn end(&mut self) -> Vec<Warning> { | ||
let mut warnings = Vec::new(); | ||
if let Some(schema) = self.schema { | ||
for (key, entry) in &schema.entries { | ||
if entry.required && !self.seen_keys.contains(key) { | ||
warnings.push(Warning::new( | ||
self.last_line_number, | ||
self.name(), | ||
format!("The {} key is required", entry.key), | ||
)); | ||
} | ||
} | ||
} | ||
warnings | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fn end(&mut self) -> Vec<Warning> { | |
let mut warnings = Vec::new(); | |
if let Some(schema) = self.schema { | |
for (key, entry) in &schema.entries { | |
if entry.required && !self.seen_keys.contains(key) { | |
warnings.push(Warning::new( | |
self.last_line_number, | |
self.name(), | |
format!("The {} key is required", entry.key), | |
)); | |
} | |
} | |
} | |
warnings | |
} | |
pub(crate) fn check_required_keys(schema: &crate::schema::DotEnvSchema, keys: &[&str]) -> Vec<Warning> { | |
let mut warnings = Vec::new(); | |
for (key, entry) in &schema.entries { | |
if entry.required && !keys.contains(&key.as_str()) { | |
warnings.push(Warning::new( | |
keys.len() + 1, | |
LintKind::SchemaViolation, | |
format!("The {} key is required", entry.key), | |
)); | |
} | |
} | |
warnings | |
} |
The end
function is not too obvious, so I propose this solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, I think there is a better alternative.
- extract all required keys from schema in
SchemaViolationChecker::new
and keep asHashSet
- every call of
run
we remove the key from the hashset - If at the end there are keys left in the
hashset
, we return warnings for them.
This must be more efficient than going through the keys we collected one more time. Also, we won't be forced to keep all the keys from the file in memory.
This will still require the checker to have a callback, much like now, but instead of end
I would suggest something like remaining_warnings
or so. I believe this is even better, because instead of hardcode we will get expandable functionality for checkers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the 'end' call is intended as a generic callback for all checkers that may want to do a final pass of some kind
Removing the keys once seen will prevent the validation of duplicated entries, ie regex check... The second one may be the 'intended' one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing the keys once seen will prevent the validation of duplicated entries, ie regex check... The second one may be the 'intended' one
Didn't get it. What do you mean? Could you provide an example?
I expect to have something like required_keys: HashSet<String>
and we remove the keys from it once seen. It shouldn't be used for anything but required keys
If required key present - we remove it from the set. The logic is almost the same as current one with second pass through all keys.
the 'end' call is intended as a generic callback for all checkers that may want to do a final pass of some kind
Regarding this one, as I said, callback makes sense in general - especially with suggested approach, but would be nice to have better naming (see comment above).
reverted all Cli changes using serde_regex |
Why not accept the PR, which is fully featured and passes 100% test coverage - as far as I all the proposed changes are stylistic choices - then change it however you like. There are tons of extra features that can be added once the base support is in:
Or close it |
I think there is no point in merging the PR with an unfinished schema. This is the most important part of PR. It’s not stylistic choice - schema is user-facing feature, and changes may lead to major updates which we can avoid easily at this point of time. If we accept it unfinished, then we need someone to work on it to get the job done. We are not asking for additional features, just changes to an important part. I don't think merging the intermediate state of a feature is a good idea. Even if we did, we wouldn't be able to release it until we were ready to introduce the schema-format without making incompatible changes immediately afterwards. But you have already done a huge amount of work (thank you!) and are familiar with how this can be changed |
I do not know what changes to the schema are being asked, I only see implementation changes. Can somebody please post the proposed schema syntax and I will implement it |
Sure, the suggestion was in the thread above #761 (comment) |
the problem with that schema is that it is not describable in a json schema itself, so json aware editors wont like it. The only reason for the proposed change is to slightly simplify the implementation on reading the schema. Which isnt a compelling reason IMHO |
Didn't really get your point. It's valid JSON and any editor with JSON support works with this (at least I checked my editors).
Well, I see at least couple of reasons:
version: '1.0'
allow_other_keys: false
entries:
SECRET:
required: true
type: String
regex: "^[a-z]*$"
PORT:
type: Integer
URL:
type: Url
|
ok, I can change to that format. But its much harder to detect duplicated keys - which I have to do and forgot in the original code. The problem is that serde happily accepts duplicates. Ie this
It means that I have to deserialize some other way And since the only real purpose of this change was to make the deserialze easier I dont really like it. BTW - it is possible tp describe this format using json schema - it just wasnt obvious |
Most common solution is to use Just declare the field with #[serde_as(as = "MapPreventDuplicates<_, _>")] That's why I mentioned that we can get duplicate-check for free.
Please stop claiming that this is the only reason |
Sorry, but did you read what I suggested? (don't consider it rude, just can't follow the conversation) |
done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Couple of iterations before releasing should be performed
@pm100 Thank you for your help 👍 |
I do not expect this to be accepted as is. I am sure there will be feedback
There are a couple of big changed to the structure of the code
First I have unified the 3 different Options structs (Cli, Fix, Compare) into one CliOptions. Why, becuase its seems useful to pass the options to the checkers, rather than pull out one field (skips). But fix accepts options but calls check and fix, so either I had to synthesize a checkoptions from the fixoptions, or just make it one options type.
Next, I have added a call to the Check trait that allows them to be called after all lines have been processed. Useful for detecting missing things. Check::run loops over the checker at the end calling this new entry point.
Features
sample schema
to use include -S | --schema on command line
if schema not found or is invalid then error message is printed and app exits
Not done yet
Full set of tests included